-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalability of feature extraction #363
Comments
Hi @rth, thanks for the issue and the benchmarks, its always good to push the boundaries. From my point of view, you showed in your benchmark that it is possible to speed up the feature calculation for two, very specific features. I am curious to see if the calculation of the other features can be speed up similarly. I have not yet worked with xarrays. When performing your benchmark, you should keep in mind that apart from the pure calculation of features, tsfresh also contains logic around the extraction that eat up runtime. So, it is natural that tsfresh will show a longer runtime
this implies that all time series in the dataset have the same length, right? tsfresh supports datasets where the time series have different lenghts
there is no pressing need to use pandas dataframes as the time series container. we developed the package that way because we were used to work with pandas dataframes. |
Thanks for your response @MaxBenChrist
Well those two were somewhat randomly chosen. The main difference is between doing an operation on a 2D array vs a for loop on 1D arrays. I imagine in some cases, the former would be faster due to better vectorization in BLAS used by numpy. Though, currently that wouldn't directly work for models that require
Well xarray was used mostly for convenience (to keeps labels associated with arrays dimensions), the same could be done just with numpy.arrays.
Yes, and the idea could be to do the corresponding validation etc once per dataset instead of for each 1D array. But I have not looked how this could work with e.g. feature selection in your pipeline.
Not necessarily. It implies that all time series are sampled on the same time grid. Ideally it would also be uniformly sampled (tsfresh also makes this assumption in a few places). To handle series of different length one would need to handle missing values (e.g. via NaNs) in the feature extraction metrics. Anyway it's just some thoughts, I'm not sure this would make sense in the general use case of tslearn.. |
FWIW, that's also more similar to the time series representation in tslearn... |
Thanks for the detailed answer @rth
I am curious about the overhead and overall runtime costs if each feature calculator is checking for NaNs.
So, in essence we need to identify features that benefit from being calculated under BLAS on two dimensional arrays. Are you interested in working on this? |
Yes, I imagine it could be significant, which would defeat the purpose a bit.
True. But also I realised that in my use case some feature were taking much more time than others (even those not marker with
Unfortunately, I won't have the bandwidth to work on it anymore, but I hope these benchmarks could help the person who would. |
@rth have you spent any time on implementing the feature extractors for xarray? I think you are onto something here. |
No, I haven't looked into this since 2018. |
Just if anyone is interested (slightly related): I posted about the scaling behavior of tsfresh here: https://nils-braun.github.io/execution-time/ |
Thank you for this package!
Somewhat related to #356 , as far as I understand currently the input format is a pandas.DataFrame, then data is aggregated by the
date
anduid
and finally features are extracted from individual time series. All functions intsfresh/feature_extraction/feature_calculators.py
appear to take as input either 1D arrays orpd.Series
.What I am wondering about is whether computing features from 2D arrays (with axis
('uid', 'time')
) would be computationally more efficient.I have done a few performance benchmarks to investigate this by comparing,
with aggregating the DataFrame by user and date, to convert it to a labelled array (with xarray, for convenience)
The results are available in the following notebooks,
The conclusion appears to be that feature extraction on 2D arrays would be 10-30x faster. The parallelization is disabled, and these benchmarks are run on a 2 CPU core laptop for the following features
{'abs_energy': None, 'absolute_sum_of_changes': None}
. So even even the default multi-threading in BLAS for numpy operations would not explain this difference.In the end I'm trying to run feature extraction from tsfresh on ~5 GB of data (i.e. 10x larger than the larger benchmark) and in my rough estimation this would take half a day for all features. Of course dask parallelization can help, but I'm wondering if there is a computationally more efficient way of doing it before parallelization (that would also work with xarray).
Am I missing something? What do you think? I imagine in the use case for which this package was developed pandas.DataFrame made more sense than e.g. xarray to store time series data? Thank you.
The text was updated successfully, but these errors were encountered: