speed up calculation by numba #205

liorshk · 2017-05-27T19:48:04Z

Hi,
I have made some performance comparison of 3 basic feature extractors: mean, sum of values, and standard deviation between your feature extractors and pandas and I found some major comparison differences.

I changed the code in extraction.py at line 345

if isDirect:
  del column_name_to_aggregate_function[column_prefix+'__sum_values']
  del column_name_to_aggregate_function[column_prefix+'__standard_deviation']
  del column_name_to_aggregate_function[column_prefix+'__mean']
  extracted_features =  pd.DataFrame(index=dataframe[column_id].unique())
  extracted_features[column_prefix+'__mean']= dataframe.groupby(column_id)[column_value].mean()
  extracted_features[column_prefix+'__sum_values']= dataframe.groupby(column_id)[column_value].sum()
  extracted_features[column_prefix+'__standard_deviation']= dataframe.groupby(column_id)[column_value].std()
else:
  extracted_features = dataframe.groupby(column_id)[column_value].aggregate(column_name_to_aggregate_function)

As you can see, I only called pandas built in function instead of calling the implementation of the feature extractor.

The following plot compares different time series lengths with 1000 ids

As you can see, there is a major performance difference.
The reason for that is that probably because pandas functions are optimized.

I think that the feature extractors should be optimized using numba or cython

nils-braun · 2017-05-28T08:50:29Z

Thank you for this nice study @liorshk!
We will definitely have a look into this. For this we will have to change some internals in the feature calculators, but this should be doable.

MaxBenChrist · 2017-05-28T10:50:15Z

@liorshk: Nice observation.

But, I do not see the urge to optimize this part.

Yes, the functions .mean(), .max(), .min(), .std(), .sum() and .median() are optimized on a groupby object (so gp.f() is faster than gp.apply(f)).

However, if we want to exploit that, we would have to implement some annoying routines, making the code unnecessary complicated. We probably would have to add a fourth kind of feature calculator.

The return is not worth it from my point of view.

nils-braun · 2017-05-28T13:55:31Z

Well, we could at least use numba or something similar to speed up the calculation by itself. Let's study this in greater detail.

NoamGit · 2017-06-11T11:32:33Z

It seems that @liorshk is right.

Also, I've noticed that the type of return in "apply" calculators is pandas series, which are know to be much more consuming than simple data structures, such as tuples (self checked).
Accordingly, it would be great if all types of calculators would return tuples instead.

BTW, is there any special reason why you choose to work with the pandas's dataframe format instead of dictionaries with 'id' keys and [time_vector, signal_vector] values? I suspect that the frequently used groupby('id') operation extends dramatically the program's runtime...

MaxBenChrist · 2017-06-11T18:07:11Z

Also, I've noticed that the type of return in "apply" calculators is pandas series, which are know to be much more consuming than simple data structures, such as tuples (self checked).
Accordingly, it would be great if all types of calculators would return tuples instead.

Can you Provide some benchmarks for that? If I get you right you propose to change the return type in all function calculators from pandas series to ndarrays, lists or tuples?

MaxBenChrist · 2017-06-11T18:23:06Z

BTW, is there any special reason why you choose to work with the pandas's dataframe format instead of dictionaries with 'id' keys and [time_vector, signal_vector] values? I suspect that the frequently used groupby('id') operation extends dramatically the program's runtime.

Right now there is no strict dependency on pandas dataframes. Originally, we were aiming for a framework that would allow us to distribute the apply calls over a cluster. Also the group by routine was quite convenient because it saved us to write a lot of code ( as always in business, computation time is cheap but programming time not)

I like your idea with the id column as keys. Maybe we should benchmark such a internal representation against the current format.

nils-braun · 2017-06-18T05:59:51Z

I have now implemented and tested a version where I am using numpy arrays internally and do not return Series, but tuples. The performance boost is actually not that high as I expected (10 %), but still.

Also, I do like the logics now more, as there is no need for distinguishing between apply and aggregate and using numba should be easier now.

I will fix up the branch an make a PR.

MaxBenChrist · 2017-06-18T10:08:43Z

I have now implemented and tested a version where I am using numpy arrays internally and do not return Series, but tuples. The performance boost is actually not that high as I expected (10 %), but still.
Also, I do like the logics now more, as there is no need for distinguishing between apply and aggregate and using numba should be easier now.
I will fix up the branch an make a PR.

10% decrease in runtime? amazing. I am really curious to see the pr. It is probably touching many parts of tsfresh?

nils-braun · 2017-07-05T17:47:50Z

All right, parts of it are now in tsfresh (head version). There is still more to do (we could still gain from numba etc.), but this requires some more work. I will leave this issue open for later reference.

MaxBenChrist · 2017-07-06T12:42:20Z

@nils-braun then we should probably adapt the issue title.

maybe "speed up calculation by numba"?

nils-braun · 2017-07-06T12:59:58Z

Go for it, I am currently on my smartphone

MaxBenChrist · 2017-07-11T07:53:11Z

I think, a great place to test that is the sample_entropy feature calculator.

thibaultbl · 2020-10-07T13:14:03Z

I would like to take a look at this issue.

Do you want a big PR with all feature calculator modified to use numba whenever it is possible or can we work with incremental change (doing feature calculator one by one) ?

I think some feature calculator can be improved really quickly, but we need to benchmark this.

nils-braun · 2020-10-08T05:09:08Z

Thank you very much! That would be great.

Incremental PRs are fine, if they make sense: if there is some initialization time involved in this and it will only pay off with multiple of the calculators converted, a larger PR might be better.
If this is really the case, we can also keep a "numba" branch around and we can do incremental changes against this one.

Would be really interesting to see, if this pays off!

thibaultbl · 2020-11-09T09:10:30Z

I tried with sample_entropy, got 20% improvement without multiprocessing, but once multiprocessing is at works performance are getting worse (really worse). I do not think I will be able to improve performance with numba at this point.

slamer59 · 2020-11-14T11:10:43Z

Do you have a branch where we can see what you have done ?
To see if there is a bottleneck somewhere ? I am insterested

thibaultbl · 2020-11-15T10:49:54Z

https://github.com/thibaultbl/tsfresh/tree/numba

I created an "optimized_sample_entropy" to benchmark with the sample_entropy version.

arturdaraujo · 2023-02-28T14:39:01Z

I think it is time to at least give the option to using numba for some functions. The changes are minimal.

On average the function that uses numba is 5x to 10x times faster for most cases. I strongly believe this would be a very good implementation for tsfresh speed up. At least as an option. The library numba recognizes many numpy functions which would make the implementation easy for many tsfresh functions.

This is a simple adoption and the following benchmark:

import numpy as np
import numba as nb

def maximum(x):
    """
    Calculates the highest value of the time series x.

    :param x: the time series to calculate the feature of
    :type x: numpy.ndarray
    :return: the value of this feature
    :return type: float
    """
    return np.max(x)


@nb.jit(nopython=True)
def maximum_numba(x):
    """
    Calculates the highest value of the time series x.

    :param x: the time series to calculate the feature of
    :type x: numpy.ndarray
    :return: the value of this feature
    :return type: float
    """
    return np.max(x)

Checks and benchmark:

sample_array = np.array([n**3 for n in range(1, 100)])

print(maximum_numba(sample_array))
# 970299
print(maximum(sample_array))
# 970299

%timeit maximum(sample_array)
# 2.77 µs ± 104 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit maximum_numba(sample_array)
# 255 ns ± 7.58 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

arturdaraujo · 2023-02-28T17:27:00Z

I can help convert the functions if more help is needed @nils-braun

arturdaraujo · 2023-02-28T18:10:57Z

Another example using the function cid_ce. This one is ~40x times faster.

import numpy as np
import numba as nb
import pandas as pd

def cid_ce(x, normalize):
    """
    This function calculator is an estimate for a time series complexity [1] (A more complex time series has more peaks,
    valleys etc.). It calculates the value of

    .. math::

        \\sqrt{ \\sum_{i=1}^{n-1} ( x_{i} - x_{i-1})^2 }

    .. rubric:: References

    |  [1] Batista, Gustavo EAPA, et al (2014).
    |  CID: an efficient complexity-invariant distance for time series.
    |  Data Mining and Knowledge Discovery 28.3 (2014): 634-669.

    :param x: the time series to calculate the feature of
    :type x: numpy.ndarray
    :param normalize: should the time series be z-transformed?
    :type normalize: bool

    :return: the value of this feature
    :return type: float
    """
    if not isinstance(x, (np.ndarray, pd.Series)):
        x = np.asarray(x)
    if normalize:
        s = np.std(x)
        if s != 0:
            x = (x - np.mean(x)) / s
        else:
            return 0.0

    x = np.diff(x)
    return np.sqrt(np.dot(x, x))


@nb.jit(nopython=True)
def cid_ce_numba(x, normalize):
    """
    This function calculator is an estimate for a time series complexity [1] (A more complex time series has more peaks,
    valleys etc.). It calculates the value of

    .. math::

        \\sqrt{ \\sum_{i=1}^{n-1} ( x_{i} - x_{i-1})^2 }

    .. rubric:: References

    |  [1] Batista, Gustavo EAPA, et al (2014).
    |  CID: an efficient complexity-invariant distance for time series.
    |  Data Mining and Knowledge Discovery 28.3 (2014): 634-669.

    :param x: the time series to calculate the feature of
    :type x: numpy.ndarray
    :param normalize: should the time series be z-transformed?
    :type normalize: bool

    :return: the value of this feature
    :return type: float
    """
    if normalize:
        s = np.std(x)
        if s != 0:
            x = (x - np.mean(x)) / s
        else:
            return 0.0

    x = np.diff(x)
    return np.sqrt(np.dot(x, x))

Check and benchmarks:

normalize = True
sample_array = np.array([n**3 for n in range(1, 100)])
sample_array = sample_array.astype(np.float64)

print(cid_ce(sample_array, normalize))
# 0.46831974531962184
print(cid_ce_numba(sample_array, normalize))
# 0.46831974531962184

%timeit cid_ce(sample_array, normalize)
# 37.8 µs ± 5.72 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit cid_ce_numba(sample_array, normalize)
# 1.04 µs ± 39.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

nils-braun · 2023-02-28T20:24:07Z

@arturdaraujo - thanks for looking into this! I would be very interested in how the numba implementation performs on the calculators with high computation cost (as marked by this flag). Because I think speeding up the simpler minimal calculators such as min or max is definitely nice, but most of the time is spent in just a few calculators.
Could you run your tests using the full set (or at least the efficient parameters)?

How does the numba solution play with multiprocessing?

Thank you very much!

arturdaraujo · 2023-02-28T21:06:28Z

How does the numba solution play with multiprocessing?

Numba natively uses multiprocessing, it is set from the get-go.

Because I think speeding up the simpler minimal calculators such as min or max is definitely nice, but most of the time is spent in just a few calculators.
Could you run your tests using the full set (or at least the efficient parameters)?

I will look into more computational expansive functions, no worries

nils-braun · 2023-02-28T21:12:28Z

Numba natively uses multiprocessing, it is set from the get-go.

Correct! But as also stated in the links you shared it does not play well with the "normal" multiprocessing we are using for parallelization among time series. I am not an expert in numba but it might be interesting to understand if
(a) we can not use multiprocessing at all anymore for parallelization among time series and what the impact on time is
(b) where the tipping point between better parallelism in numba on the data from a single time series vs. parallelization over multiple time series (using the multiprocessing we currently do) is

arturdaraujo · 2023-02-28T21:42:29Z

I believe that a safe approach is necessary if you are considering dropping your current multiprocessing approach altogether. But I believe you could implement it while still maintaining the current multiprocessing.

My take on this would be like this:

create a script with tsfresh functions but modify it using numba. The original script keeps intact because we know it works.
create an arg in extract_features called activate_numba (or something like this), which would utilize numba functions instead of the ones from the main tsfresh.

This approach would it make optional and also experimental in a sense. I believe there might be scenarios where running "normal" functions might even be faster than numba. This would make the community check which datasets works better with numba or normal tsfresh functions.

So here is a reason to NOT drop tsfresh current multiprocessing:

For instance, while researching here I noticed that the numba functions that I created from tsfresh (maximum and cid_ce) perform the same compared to tsfresh current functions when you use an array of 10.000 (on my linux, at least). So, I imagine if you have 10 times series of 20.000 rows it would be better to use tsfresh multiprocessing. If I have 100 times series with 1000 rows, however, it would still be much better to run with numba. I also don't know how numba performs in different systems like linux, windows, macos. So that's one more reason to thoroughly test it.

I hope I didn't overstep here. your package is really awesome and I only trying to picture an easier way to implement this. while keeping compatibility with old code and exploring new ways of processing.

nils-braun mentioned this issue Jun 18, 2017

Remove pandas internal implementation #217

Merged

nils-braun self-assigned this Jul 5, 2017

MaxBenChrist changed the title ~~Performance comparison with pandas built in functions~~ speed up calculation by numba Jul 6, 2017

MaxBenChrist mentioned this issue Oct 25, 2017

Improving performance for high_comp_cost calculators #341

Closed

nils-braun removed their assignment Apr 11, 2020

nils-braun added enhancement help wanted labels Apr 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed up calculation by numba #205

speed up calculation by numba #205

liorshk commented May 27, 2017 •

edited

Loading

nils-braun commented May 28, 2017

MaxBenChrist commented May 28, 2017 •

edited

Loading

nils-braun commented May 28, 2017

NoamGit commented Jun 11, 2017

MaxBenChrist commented Jun 11, 2017

MaxBenChrist commented Jun 11, 2017

nils-braun commented Jun 18, 2017

MaxBenChrist commented Jun 18, 2017

nils-braun commented Jul 5, 2017

MaxBenChrist commented Jul 6, 2017

nils-braun commented Jul 6, 2017

MaxBenChrist commented Jul 11, 2017 •

edited

Loading

thibaultbl commented Oct 7, 2020

nils-braun commented Oct 8, 2020

thibaultbl commented Nov 9, 2020

slamer59 commented Nov 14, 2020

thibaultbl commented Nov 15, 2020

arturdaraujo commented Feb 28, 2023 •

edited

Loading

arturdaraujo commented Feb 28, 2023

arturdaraujo commented Feb 28, 2023

nils-braun commented Feb 28, 2023

arturdaraujo commented Feb 28, 2023 •

edited

Loading

nils-braun commented Feb 28, 2023

arturdaraujo commented Feb 28, 2023 •

edited

Loading

speed up calculation by numba #205

speed up calculation by numba #205

Comments

liorshk commented May 27, 2017 • edited Loading

nils-braun commented May 28, 2017

MaxBenChrist commented May 28, 2017 • edited Loading

nils-braun commented May 28, 2017

NoamGit commented Jun 11, 2017

MaxBenChrist commented Jun 11, 2017

MaxBenChrist commented Jun 11, 2017

nils-braun commented Jun 18, 2017

MaxBenChrist commented Jun 18, 2017

nils-braun commented Jul 5, 2017

MaxBenChrist commented Jul 6, 2017

nils-braun commented Jul 6, 2017

MaxBenChrist commented Jul 11, 2017 • edited Loading

thibaultbl commented Oct 7, 2020

nils-braun commented Oct 8, 2020

thibaultbl commented Nov 9, 2020

slamer59 commented Nov 14, 2020

thibaultbl commented Nov 15, 2020

arturdaraujo commented Feb 28, 2023 • edited Loading

arturdaraujo commented Feb 28, 2023

arturdaraujo commented Feb 28, 2023

nils-braun commented Feb 28, 2023

arturdaraujo commented Feb 28, 2023 • edited Loading

nils-braun commented Feb 28, 2023

arturdaraujo commented Feb 28, 2023 • edited Loading

liorshk commented May 27, 2017 •

edited

Loading

MaxBenChrist commented May 28, 2017 •

edited

Loading

MaxBenChrist commented Jul 11, 2017 •

edited

Loading

arturdaraujo commented Feb 28, 2023 •

edited

Loading

arturdaraujo commented Feb 28, 2023 •

edited

Loading

arturdaraujo commented Feb 28, 2023 •

edited

Loading