Feature extraction progress bar fills fast, but some part in the code takes a long time #703

TKlerx · 2020-05-25T21:05:13Z

OS: Linux (16 cores)
version: current master branch
a rolled time series

<class 'pandas.core.frame.DataFrame'>
Int64Index: 627750 entries, 0 to 627749
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   ride_id     627750 non-null  uint8  
 1   tick        627750 non-null  uint16 
 2   imu_ax_FFT  627750 non-null  float32
 3   imu_ay_FFT  627750 non-null  float32
 4   imu_az_FFT  627750 non-null  float32
 5   imu_gx_FFT  627750 non-null  float32
 6   imu_gy_FFT  627750 non-null  float32
 7   imu_gz_FFT  627750 non-null  float32
 8   id          627750 non-null  object 
dtypes: float32(6), object(1), uint16(1), uint8(1)
memory usage: 68.2 MB

When I run the feature extraction on this rolled dataframe, it takes about 10 minutes until the progress bar is filled to 100% (with using all 16 cores on the system).
Then, it takes a very long time (approx. 20 - 30 minutes. maybe even more) until the feature extraction completes (with just utilizing a single core).

If I read the code correctly, it must happen in this part of extraction.py (after _do_extraction_on_chunk):

    # Return a dataframe in the typical form (id as index and feature names as columns)
    result = pd.DataFrame(result)
    if "value" in result.columns:
        result["value"] = result["value"].astype(float)
    if len(result) != 0:
        result = result.pivot("id", "variable", "value")
        result.index = result.index.astype(df[column_id].dtype)

I did not do any profiling, but it must be the creation of the dataframe or the pivotization.
Do you have any idea how to speed this up?
If it is the pivot, could this be done on the workers?

A second question (I don't want to open another issue for this):
Did you consider switching from native pandas to modin to parallelize up e.g. concats and other pandas operations (https://github.com/modin-project/modin)? I know this would be a bigger change :)

The text was updated successfully, but these errors were encountered:

nils-braun · 2020-05-26T15:26:21Z

These are all very valid points you raise.
I think it is probably the pivoting. I do not know exactly how many ids you have, but I guess the resulting dataframe will have many columns. pivoting is really slow - we have a super old issue to fix this (#417) but so far no one has tackled this.

In principle we do not need to pivot at all - the data is already nicely partitioned. It should not be a problem to fix this. If you want, I can finally have a look into this (now that we know that it is a problem). If you already have a good idea, I am of course even more happy ;-)
My idea would be: the fill in mappings of "id -> data" for all the data in result without building a dataframe. Then just concat the data once. Should be better for computation and maybe also for memory.

Distributing the pivoting is hard unfortunately (not un-doable, but hard) and it will involve a lot of shuffling (even the modin you quote did not implement pivoting so far ;-)). How to do the distribution of work properly depends a lot on the data - something we do not know. But I think using custom logic instead of pivoting can speed this up - so we might not need this anymore.

modin looks like a very good project. Thanks for bringing this up - I did not know about it so far.
However, this might be a more general question:
So far we assumed (or have heard) that most of our "power user" already use dask or spark or alike. They will just use the "core logic" of tsfresh and build all the dataframe normalisation, result pivoting etc. using their distributed computation system as they will need to include that into their data pipeline anyways.
Now I hear from more people wanting to use tsfresh with pandas-like input and that can scale out the feature extraction, so the problems are now the normalisation (see #702) or the result pivoting (see this issue). Our point was always: use pandas, as "default users" want pandas whereas power users will do it manually anyways. But this decision was 4 years ago :-D
So it might be worth thinking about it again.
What is your opinion @TKlerx, should we switch the input format to something more "non-default" such as monin, which would also mean that many users would need to convert their data before using it with tsfresh?

nils-braun · 2020-05-26T20:26:49Z

Just a short comment: it seems like the pivoting can be improved by a large factor. Still sorting out the details, but it was totally worth looking into this. Thanks again!

TKlerx · 2020-05-27T08:12:06Z

Cool, thanks for #705, will look at it in the afternoon (most probably). I have approximately 300k ids from rolling_time_series which are fed into extract_features.

Regarding modin:
I wasn't sure either whether to switch to modin dataframes because the dataframes then must be modin dataframes or tsfresh would transform the pandas dataframe to a modin dataframe for internal use in tsfresh.
The question is whether the internal use of modin would bring a big speed up or whether a possible transformation from pandas to modin would make the use with pandas way slower.
I just wanted to mention modin as it seems quite handy and might make some things faster.
On the other hand it is another dependency to another library.

I will do some tests with modin how fast it is when we first transform a pandas dataframe to modin and then do e.g. a concat-operation.

nils-braun · 2020-05-28T10:53:02Z

#705 removed a bit of computational effort (~half) from this part - but the issue is still not solved.
Additional open things to do:

we sort the index of the result for backward compatibility. One could introduce a new option to turn that off if you really need fast speed (and ordering is not important to you - although many index calculations work best with that).
Currently we create a pandas dataframe out of a dict of dicts of numbers. Maybe it is faster to create numpy arrays or even pandas objects right aways? However one needs to make sure that (a) we do not loose the id information and (b) creating the array is not slower than using it
if imputing is turned on, it could be preferred to apply deletion of columns already before creating the dataframe - although that might imply a lot of bookkeeping (and not all impute functions just remove columns)

TKlerx · 2020-05-28T11:18:38Z

Thanks alot @nils-braun. I am really impressed about the effort you put in and the help.
I identified (this may be trivial to you) that the main issue arises from the many ids (300k) I create with the roll_time_series method. As a chunk is created for every id and feature this results in a lot of very small chunks (size 30 in my case) which have to be put together in the end (from the mentioned dict of dicts of numbers).
I tried a different chunking manually, but don't have any results, yet.

Regarding your comments:

#705 removed a bit of computational effort (~half) from this part - but the issue is still not solved.

Yeah, thanks alot. I already tested it 👍

Additional open things to do:

we sort the index of the result for backward compatibility. One could introduce a new option to turn that off if you really need fast speed (and ordering is not important to you - although many index calculations work best with that).

That would be great. I do not care about the order as I want to feed the result into a machine learning model with every row being one training sample.

Currently we create a pandas dataframe out of a dict of dicts of numbers. Maybe it is faster to create numpy arrays or even pandas objects right aways? However one needs to make sure that (a) we do not loose the id information and (b) creating the array is not slower than using it

Yeah, I guess, this dict of dicts makes things very slow. Concerning (a) I don't know, what the best solution would be. Concerning (b) we could create the array or pandas dataframe already on the worker (so it scales with the cores).

if imputing is turned on, it could be preferred to apply deletion of columns already before creating the dataframe - although that might imply a lot of bookkeeping (and not all impute functions just remove columns)

I do not use impute at all (at the moment), so I don't know 🤷‍♂️

TKlerx · 2020-05-28T12:32:43Z

I wrote an ugly workaround which (at least for my case) consumes way less memory and is way faster.
This requires a bit more explanation. I wanted to extract extract features from a sliding window of 30 time ticks from a very long time series (approx 300k entries) and then compute features for every of those 30 tick windows.

My initial code:

def compute_tsfresh_rolling_features_old(df, column_id="ride_id", column_sort="tick", window_size=30):
	df_rolled = roll_time_series(df, column_id=column_id, column_sort=column_sort,
			min_timeshift=window_size-1, max_timeshift=window_size-1)
	df_features = extract_features(df_rolled, column_id="id", column_sort=column_sort)
	return df_features

This took very long and went out of memory (even with 56GB; 112GB were enough, though).

Now the dirty workaround (I also have a parallel version, left out as not needed for understanding). Computing the rolling dataframe is the same as before. Then, I follow some sort of divide & conquer approach to compute the tsfresh features for every id in a separate call and concat everything in the end. Doing so does not seem to be slower (maybe even faster; have to invest deeper) but runs even with 30GB RAM on my notebook. I guess this is because there are not so many dicts of dicts in memory at the same time (which may require a lot of memory).

def compute_tsfresh_rolling_features_sequential(df, column_id="ride_id", column_sort="tick", window_size=30):
    df_rolled = roll_time_series(df, column_id=column_id, column_sort=column_sort,
                                 min_timeshift=window_size-1, max_timeshift=window_size-1)
    groups = df_rolled.groupby('id', sort=False)
    dfs = []
    for name, df_group in groups:
        dfs.append(group_function(df_group))
    df_res = pd.concat(dfs, ignore_index=True, sort=False)
    return df_res
	
def group_function(group):
    df_features = extract_features(group, column_id="id", column_sort="tick").reset_index()
    return df_features

nils-braun · 2020-05-28T18:45:37Z

Very interesting! Thanks for the investigation.

I am just wondering, why this is so different from now. Because the groups = df_rolled.groupby('id', sort=False) is also done internally (more or less) and even when the data is written out as a df and concatenated - it still needs to be held in memory. So I guess (as you mentioned), it is really the dict of dict, which seems to be a lot worse in memory...
Thinking about it, there could be another thing hiding here: when all data is processed at once, we need to keep the results (in a list of tuples), the input data as well as the current calculation in memory. That might be too much (or did it explicitly fail after the feature calculation has finished and the progress bar done?)

That is definitely something to continuing to study.

Just a side note: if you want to automate this "prepare data once, then do feature extraction on each chunk, then merge it" I would suggest to use a workflow automation tool, such as luigi. If you need some impression how it could look like, I have written a small note at the very end of this about this.

TKlerx · 2020-05-28T22:00:02Z

Very interesting! Thanks for the investigation.

I am just wondering, why this is so different from now. Because the groups = df_rolled.groupby('id', sort=False) is also done internally (more or less) and even when the data is written out as a df and concatenated - it still needs to be held in memory. So I guess (as you mentioned), it is really the dict of dict, which seems to be a lot worse in memory...
Thinking about it, there could be another thing hiding here: when all data is processed at once, we need to keep the results (in a list of tuples), the input data as well as the current calculation in memory. That might be too much (or did it explicitly fail after the feature calculation has finished and the progress bar done?)

That is definitely something to continuing to study.

That's a good question about the progress bar. Cannot tell at the moment, will investigate (hopefully) tomorrow.

Just a side note: if you want to automate this "prepare data once, then do feature extraction on each chunk, then merge it" I would suggest to use a workflow automation tool, such as luigi. If you need some impression how it could look like, I have written a small note at the very end of this about this.

A long time ago, I had a look at luigi. Will have a look at it, again. Thanks for the hint.

nils-braun · 2020-05-29T14:58:18Z

By the way: thank you for investigation so deeply and trying to solve the problem! Nice to see that people really want to push the boundaries!

TKlerx · 2020-05-29T15:24:34Z

So I checked the behavior with RAM and the progress bar. The progress bar for the feature extraction completes.
With the old pivoting, it crashed with OOM (56GB RAM):

With the new logic, it does not crash, but seems to swap a bit (32 GB RAM).

For my workaround, I only tried the new version with the improved pivoting (which does not crash, either; 32 GB RAM).

So I still need to compare my workaround with the new pivoting. But the new pivoting is already way better than the old one. 👍

TKlerx · 2020-06-05T00:38:21Z

A small update on this:
I was extracting a lot of features on different files, but the same kind of data. At some point the new pivoting went OOM, so I went back to my workaround, which was still able to compute the features (same PC, same RAM, etc)

EDIT:
I will try to replace the dicts with something more memory-friendly. But it will take some time as I am quite busy at the moment.

beyondguo · 2024-11-04T08:50:52Z

OS: Linux (16 cores)

version: current master branch

a rolled time series
<class 'pandas.core.frame.DataFrame'>
Int64Index: 627750 entries, 0 to 627749
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   ride_id     627750 non-null  uint8  
 1   tick        627750 non-null  uint16 
 2   imu_ax_FFT  627750 non-null  float32
 3   imu_ay_FFT  627750 non-null  float32
 4   imu_az_FFT  627750 non-null  float32
 5   imu_gx_FFT  627750 non-null  float32
 6   imu_gy_FFT  627750 non-null  float32
 7   imu_gz_FFT  627750 non-null  float32
 8   id          627750 non-null  object 
dtypes: float32(6), object(1), uint16(1), uint8(1)
memory usage: 68.2 MB
When I run the feature extraction on this rolled dataframe, it takes about 10 minutes until the progress bar is filled to 100% (with using all 16 cores on the system). Then, it takes a very long time (approx. 20 - 30 minutes. maybe even more) until the feature extraction completes (with just utilizing a single core).

I encounted a similar problem, but my program is stuck before the feature extraction progress bar starts:

as you can see, the Rolling step has finished for 30mins, but the Feature Extraction step hasn't start. What could be the problem? Also I noticed only one core is working at this period.

nils-braun · 2024-11-05T20:32:32Z

@beyondguo - it is hard to tell without the actual data. I would suggest you try with a smaller set of feature extractors, like the efficient feature extractors. Some of the extractors scale very badly with the length of a single timeseries and that is what we might see here.

Only one CPU doing work is strange, because the extract features function is basically entering multiprocessing right away. Can you check the default number of processes in tsfresh.defaults.N_PROCESSES?

nils-braun mentioned this issue May 26, 2020

Replace pivoting by a simpler function #705

Merged

nils-braun self-assigned this May 26, 2020

nils-braun added the enhancement label May 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature extraction progress bar fills fast, but some part in the code takes a long time #703

Feature extraction progress bar fills fast, but some part in the code takes a long time #703

TKlerx commented May 25, 2020

nils-braun commented May 26, 2020

nils-braun commented May 26, 2020

TKlerx commented May 27, 2020 •

edited

Loading

nils-braun commented May 28, 2020

TKlerx commented May 28, 2020

TKlerx commented May 28, 2020

nils-braun commented May 28, 2020

TKlerx commented May 28, 2020

nils-braun commented May 29, 2020

TKlerx commented May 29, 2020

TKlerx commented Jun 5, 2020 •

edited

Loading

beyondguo commented Nov 4, 2024 •

edited

Loading

nils-braun commented Nov 5, 2024

Feature extraction progress bar fills fast, but some part in the code takes a long time #703

Feature extraction progress bar fills fast, but some part in the code takes a long time #703

Comments

TKlerx commented May 25, 2020

nils-braun commented May 26, 2020

nils-braun commented May 26, 2020

TKlerx commented May 27, 2020 • edited Loading

nils-braun commented May 28, 2020

TKlerx commented May 28, 2020

TKlerx commented May 28, 2020

nils-braun commented May 28, 2020

TKlerx commented May 28, 2020

nils-braun commented May 29, 2020

TKlerx commented May 29, 2020

TKlerx commented Jun 5, 2020 • edited Loading

beyondguo commented Nov 4, 2024 • edited Loading

nils-braun commented Nov 5, 2024

TKlerx commented May 27, 2020 •

edited

Loading

TKlerx commented Jun 5, 2020 •

edited

Loading

beyondguo commented Nov 4, 2024 •

edited

Loading