-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speed up calculation by numba #205
Comments
Thank you for this nice study @liorshk! |
@liorshk: Nice observation. But, I do not see the urge to optimize this part. Yes, the functions However, if we want to exploit that, we would have to implement some annoying routines, making the code unnecessary complicated. We probably would have to add a fourth kind of feature calculator. The return is not worth it from my point of view. |
Well, we could at least use numba or something similar to speed up the calculation by itself. Let's study this in greater detail. |
It seems that @liorshk is right. Also, I've noticed that the type of return in "apply" calculators is pandas series, which are know to be much more consuming than simple data structures, such as tuples (self checked). BTW, is there any special reason why you choose to work with the pandas's dataframe format instead of dictionaries with 'id' keys and [time_vector, signal_vector] values? I suspect that the frequently used groupby('id') operation extends dramatically the program's runtime... |
Can you Provide some benchmarks for that? If I get you right you propose to change the return type in all function calculators from pandas series to ndarrays, lists or tuples? |
Right now there is no strict dependency on pandas dataframes. Originally, we were aiming for a framework that would allow us to distribute the apply calls over a cluster. Also the group by routine was quite convenient because it saved us to write a lot of code ( as always in business, computation time is cheap but programming time not) I like your idea with the id column as keys. Maybe we should benchmark such a internal representation against the current format. |
I have now implemented and tested a version where I am using numpy arrays internally and do not return Series, but tuples. The performance boost is actually not that high as I expected (10 %), but still. Also, I do like the logics now more, as there is no need for distinguishing between apply and aggregate and using numba should be easier now. I will fix up the branch an make a PR. |
10% decrease in runtime? amazing. I am really curious to see the pr. It is probably touching many parts of tsfresh? |
All right, parts of it are now in tsfresh (head version). There is still more to do (we could still gain from numba etc.), but this requires some more work. I will leave this issue open for later reference. |
@nils-braun then we should probably adapt the issue title. maybe "speed up calculation by numba"? |
Go for it, I am currently on my smartphone |
I think, a great place to test that is the |
I would like to take a look at this issue. Do you want a big PR with all feature calculator modified to use numba whenever it is possible or can we work with incremental change (doing feature calculator one by one) ? I think some feature calculator can be improved really quickly, but we need to benchmark this. |
Thank you very much! That would be great. Incremental PRs are fine, if they make sense: if there is some initialization time involved in this and it will only pay off with multiple of the calculators converted, a larger PR might be better. Would be really interesting to see, if this pays off! |
I tried with |
Do you have a branch where we can see what you have done ? |
https://github.com/thibaultbl/tsfresh/tree/numba I created an "optimized_sample_entropy" to benchmark with the sample_entropy version. |
I think it is time to at least give the option to using On average the function that uses This is a simple adoption and the following benchmark:
Checks and benchmark:
|
I can help convert the functions if more help is needed @nils-braun |
Another example using the function cid_ce. This one is ~40x times faster.
Check and benchmarks:
|
@arturdaraujo - thanks for looking into this! I would be very interested in how the numba implementation performs on the calculators with high computation cost (as marked by this flag). Because I think speeding up the simpler minimal calculators such as How does the numba solution play with multiprocessing? Thank you very much! |
Numba natively uses multiprocessing, it is set from the get-go.
I will look into more computational expansive functions, no worries |
Correct! But as also stated in the links you shared it does not play well with the "normal" |
I believe that a safe approach is necessary if you are considering dropping your current multiprocessing approach altogether. But I believe you could implement it while still maintaining the current multiprocessing. My take on this would be like this:
This approach would it make optional and also experimental in a sense. I believe there might be scenarios where running "normal" functions might even be faster than numba. This would make the community check which datasets works better with numba or normal tsfresh functions. So here is a reason to NOT drop tsfresh current multiprocessing: For instance, while researching here I noticed that the numba functions that I created from tsfresh (maximum and cid_ce) perform the same compared to tsfresh current functions when you use an array of 10.000 (on my linux, at least). So, I imagine if you have 10 times series of 20.000 rows it would be better to use tsfresh multiprocessing. If I have 100 times series with 1000 rows, however, it would still be much better to run with numba. I also don't know how numba performs in different systems like linux, windows, macos. So that's one more reason to thoroughly test it. I hope I didn't overstep here. your package is really awesome and I only trying to picture an easier way to implement this. while keeping compatibility with old code and exploring new ways of processing. |
Hi,
I have made some performance comparison of 3 basic feature extractors: mean, sum of values, and standard deviation between your feature extractors and pandas and I found some major comparison differences.
I changed the code in extraction.py at line 345
As you can see, I only called pandas built in function instead of calling the implementation of the feature extractor.
The following plot compares different time series lengths with 1000 ids

As you can see, there is a major performance difference.
The reason for that is that probably because pandas functions are optimized.
I think that the feature extractors should be optimized using numba or cython
The text was updated successfully, but these errors were encountered: