-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature extraction progress bar fills fast, but some part in the code takes a long time #703
Comments
These are all very valid points you raise. In principle we do not need to pivot at all - the data is already nicely partitioned. It should not be a problem to fix this. If you want, I can finally have a look into this (now that we know that it is a problem). If you already have a good idea, I am of course even more happy ;-) Distributing the pivoting is hard unfortunately (not un-doable, but hard) and it will involve a lot of shuffling (even the modin you quote did not implement pivoting so far ;-)). How to do the distribution of work properly depends a lot on the data - something we do not know. But I think using custom logic instead of pivoting can speed this up - so we might not need this anymore.
|
Just a short comment: it seems like the pivoting can be improved by a large factor. Still sorting out the details, but it was totally worth looking into this. Thanks again! |
Cool, thanks for #705, will look at it in the afternoon (most probably). I have approximately 300k ids from Regarding I will do some tests with |
#705 removed a bit of computational effort (~half) from this part - but the issue is still not solved.
|
Thanks alot @nils-braun. I am really impressed about the effort you put in and the help. Regarding your comments:
Yeah, thanks alot. I already tested it 👍
That would be great. I do not care about the order as I want to feed the result into a machine learning model with every row being one training sample.
Yeah, I guess, this dict of dicts makes things very slow. Concerning (a) I don't know, what the best solution would be. Concerning (b) we could create the array or pandas dataframe already on the worker (so it scales with the cores).
I do not use impute at all (at the moment), so I don't know 🤷♂️ |
I wrote an ugly workaround which (at least for my case) consumes way less memory and is way faster. My initial code: def compute_tsfresh_rolling_features_old(df, column_id="ride_id", column_sort="tick", window_size=30):
df_rolled = roll_time_series(df, column_id=column_id, column_sort=column_sort,
min_timeshift=window_size-1, max_timeshift=window_size-1)
df_features = extract_features(df_rolled, column_id="id", column_sort=column_sort)
return df_features This took very long and went out of memory (even with 56GB; 112GB were enough, though). Now the dirty workaround (I also have a parallel version, left out as not needed for understanding). Computing the rolling dataframe is the same as before. Then, I follow some sort of divide & conquer approach to compute the def compute_tsfresh_rolling_features_sequential(df, column_id="ride_id", column_sort="tick", window_size=30):
df_rolled = roll_time_series(df, column_id=column_id, column_sort=column_sort,
min_timeshift=window_size-1, max_timeshift=window_size-1)
groups = df_rolled.groupby('id', sort=False)
dfs = []
for name, df_group in groups:
dfs.append(group_function(df_group))
df_res = pd.concat(dfs, ignore_index=True, sort=False)
return df_res
def group_function(group):
df_features = extract_features(group, column_id="id", column_sort="tick").reset_index()
return df_features |
Very interesting! Thanks for the investigation. I am just wondering, why this is so different from now. Because the That is definitely something to continuing to study. Just a side note: if you want to automate this "prepare data once, then do feature extraction on each chunk, then merge it" I would suggest to use a workflow automation tool, such as |
That's a good question about the progress bar. Cannot tell at the moment, will investigate (hopefully) tomorrow.
A long time ago, I had a look at |
By the way: thank you for investigation so deeply and trying to solve the problem! Nice to see that people really want to push the boundaries! |
A small update on this: EDIT: |
@beyondguo - it is hard to tell without the actual data. I would suggest you try with a smaller set of feature extractors, like the efficient feature extractors. Some of the extractors scale very badly with the length of a single timeseries and that is what we might see here. Only one CPU doing work is strange, because the extract features function is basically entering multiprocessing right away. Can you check the default number of processes in |
When I run the feature extraction on this rolled dataframe, it takes about 10 minutes until the progress bar is filled to 100% (with using all 16 cores on the system).
Then, it takes a very long time (approx. 20 - 30 minutes. maybe even more) until the feature extraction completes (with just utilizing a single core).
If I read the code correctly, it must happen in this part of extraction.py (after
_do_extraction_on_chunk
):I did not do any profiling, but it must be the creation of the dataframe or the pivotization.
Do you have any idea how to speed this up?
If it is the pivot, could this be done on the workers?
A second question (I don't want to open another issue for this):
Did you consider switching from native pandas to modin to parallelize up e.g.
concat
s and other pandas operations (https://github.com/modin-project/modin)? I know this would be a bigger change :)The text was updated successfully, but these errors were encountered: