Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up calculation by numba #205

Open
liorshk opened this issue May 27, 2017 · 24 comments
Open

speed up calculation by numba #205

liorshk opened this issue May 27, 2017 · 24 comments

Comments

@liorshk
Copy link

liorshk commented May 27, 2017

Hi,
I have made some performance comparison of 3 basic feature extractors: mean, sum of values, and standard deviation between your feature extractors and pandas and I found some major comparison differences.

I changed the code in extraction.py at line 345

if isDirect:
  del column_name_to_aggregate_function[column_prefix+'__sum_values']
  del column_name_to_aggregate_function[column_prefix+'__standard_deviation']
  del column_name_to_aggregate_function[column_prefix+'__mean']
  extracted_features =  pd.DataFrame(index=dataframe[column_id].unique())
  extracted_features[column_prefix+'__mean']= dataframe.groupby(column_id)[column_value].mean()
  extracted_features[column_prefix+'__sum_values']= dataframe.groupby(column_id)[column_value].sum()
  extracted_features[column_prefix+'__standard_deviation']= dataframe.groupby(column_id)[column_value].std()
else:
  extracted_features = dataframe.groupby(column_id)[column_value].aggregate(column_name_to_aggregate_function)

As you can see, I only called pandas built in function instead of calling the implementation of the feature extractor.

The following plot compares different time series lengths with 1000 ids
TimeSeriesLength

As you can see, there is a major performance difference.
The reason for that is that probably because pandas functions are optimized.

I think that the feature extractors should be optimized using numba or cython

@nils-braun
Copy link
Collaborator

Thank you for this nice study @liorshk!
We will definitely have a look into this. For this we will have to change some internals in the feature calculators, but this should be doable.

@MaxBenChrist
Copy link
Collaborator

MaxBenChrist commented May 28, 2017

@liorshk: Nice observation.

But, I do not see the urge to optimize this part.

Yes, the functions .mean(), .max(), .min(), .std(), .sum() and .median() are optimized on a groupby object (so gp.f() is faster than gp.apply(f)).

However, if we want to exploit that, we would have to implement some annoying routines, making the code unnecessary complicated. We probably would have to add a fourth kind of feature calculator.

The return is not worth it from my point of view.

@nils-braun
Copy link
Collaborator

Well, we could at least use numba or something similar to speed up the calculation by itself. Let's study this in greater detail.

@NoamGit
Copy link

NoamGit commented Jun 11, 2017

It seems that @liorshk is right.

Also, I've noticed that the type of return in "apply" calculators is pandas series, which are know to be much more consuming than simple data structures, such as tuples (self checked).
Accordingly, it would be great if all types of calculators would return tuples instead.

BTW, is there any special reason why you choose to work with the pandas's dataframe format instead of dictionaries with 'id' keys and [time_vector, signal_vector] values? I suspect that the frequently used groupby('id') operation extends dramatically the program's runtime...

@MaxBenChrist
Copy link
Collaborator

Also, I've noticed that the type of return in "apply" calculators is pandas series, which are know to be much more consuming than simple data structures, such as tuples (self checked).
Accordingly, it would be great if all types of calculators would return tuples instead.

Can you Provide some benchmarks for that? If I get you right you propose to change the return type in all function calculators from pandas series to ndarrays, lists or tuples?

@MaxBenChrist
Copy link
Collaborator

BTW, is there any special reason why you choose to work with the pandas's dataframe format instead of dictionaries with 'id' keys and [time_vector, signal_vector] values? I suspect that the frequently used groupby('id') operation extends dramatically the program's runtime.

Right now there is no strict dependency on pandas dataframes. Originally, we were aiming for a framework that would allow us to distribute the apply calls over a cluster. Also the group by routine was quite convenient because it saved us to write a lot of code ( as always in business, computation time is cheap but programming time not)

I like your idea with the id column as keys. Maybe we should benchmark such a internal representation against the current format.

@nils-braun
Copy link
Collaborator

I have now implemented and tested a version where I am using numpy arrays internally and do not return Series, but tuples. The performance boost is actually not that high as I expected (10 %), but still.

Also, I do like the logics now more, as there is no need for distinguishing between apply and aggregate and using numba should be easier now.

I will fix up the branch an make a PR.

@MaxBenChrist
Copy link
Collaborator

I have now implemented and tested a version where I am using numpy arrays internally and do not return Series, but tuples. The performance boost is actually not that high as I expected (10 %), but still.
Also, I do like the logics now more, as there is no need for distinguishing between apply and aggregate and using numba should be easier now.
I will fix up the branch an make a PR.

10% decrease in runtime? amazing. I am really curious to see the pr. It is probably touching many parts of tsfresh?

@nils-braun
Copy link
Collaborator

All right, parts of it are now in tsfresh (head version). There is still more to do (we could still gain from numba etc.), but this requires some more work. I will leave this issue open for later reference.

@MaxBenChrist
Copy link
Collaborator

@nils-braun then we should probably adapt the issue title.

maybe "speed up calculation by numba"?

@nils-braun
Copy link
Collaborator

Go for it, I am currently on my smartphone

@MaxBenChrist MaxBenChrist changed the title Performance comparison with pandas built in functions speed up calculation by numba Jul 6, 2017
@MaxBenChrist
Copy link
Collaborator

MaxBenChrist commented Jul 11, 2017

I think, a great place to test that is the sample_entropy feature calculator.

@thibaultbl
Copy link
Contributor

I would like to take a look at this issue.

Do you want a big PR with all feature calculator modified to use numba whenever it is possible or can we work with incremental change (doing feature calculator one by one) ?

I think some feature calculator can be improved really quickly, but we need to benchmark this.

@nils-braun
Copy link
Collaborator

Thank you very much! That would be great.

Incremental PRs are fine, if they make sense: if there is some initialization time involved in this and it will only pay off with multiple of the calculators converted, a larger PR might be better.
If this is really the case, we can also keep a "numba" branch around and we can do incremental changes against this one.

Would be really interesting to see, if this pays off!

@thibaultbl
Copy link
Contributor

I tried with sample_entropy, got 20% improvement without multiprocessing, but once multiprocessing is at works performance are getting worse (really worse). I do not think I will be able to improve performance with numba at this point.

@slamer59
Copy link

Do you have a branch where we can see what you have done ?
To see if there is a bottleneck somewhere ? I am insterested

@thibaultbl
Copy link
Contributor

https://github.com/thibaultbl/tsfresh/tree/numba

I created an "optimized_sample_entropy" to benchmark with the sample_entropy version.

@arturdaraujo
Copy link

arturdaraujo commented Feb 28, 2023

I think it is time to at least give the option to using numba for some functions. The changes are minimal.

On average the function that uses numba is 5x to 10x times faster for most cases. I strongly believe this would be a very good implementation for tsfresh speed up. At least as an option. The library numba recognizes many numpy functions which would make the implementation easy for many tsfresh functions.

This is a simple adoption and the following benchmark:

import numpy as np
import numba as nb

def maximum(x):
    """
    Calculates the highest value of the time series x.

    :param x: the time series to calculate the feature of
    :type x: numpy.ndarray
    :return: the value of this feature
    :return type: float
    """
    return np.max(x)


@nb.jit(nopython=True)
def maximum_numba(x):
    """
    Calculates the highest value of the time series x.

    :param x: the time series to calculate the feature of
    :type x: numpy.ndarray
    :return: the value of this feature
    :return type: float
    """
    return np.max(x)

Checks and benchmark:

sample_array = np.array([n**3 for n in range(1, 100)])

print(maximum_numba(sample_array))
# 970299
print(maximum(sample_array))
# 970299

%timeit maximum(sample_array)
# 2.77 µs ± 104 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit maximum_numba(sample_array)
# 255 ns ± 7.58 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

@arturdaraujo
Copy link

I can help convert the functions if more help is needed @nils-braun

@arturdaraujo
Copy link

Another example using the function cid_ce. This one is ~40x times faster.

import numpy as np
import numba as nb
import pandas as pd

def cid_ce(x, normalize):
    """
    This function calculator is an estimate for a time series complexity [1] (A more complex time series has more peaks,
    valleys etc.). It calculates the value of

    .. math::

        \\sqrt{ \\sum_{i=1}^{n-1} ( x_{i} - x_{i-1})^2 }

    .. rubric:: References

    |  [1] Batista, Gustavo EAPA, et al (2014).
    |  CID: an efficient complexity-invariant distance for time series.
    |  Data Mining and Knowledge Discovery 28.3 (2014): 634-669.

    :param x: the time series to calculate the feature of
    :type x: numpy.ndarray
    :param normalize: should the time series be z-transformed?
    :type normalize: bool

    :return: the value of this feature
    :return type: float
    """
    if not isinstance(x, (np.ndarray, pd.Series)):
        x = np.asarray(x)
    if normalize:
        s = np.std(x)
        if s != 0:
            x = (x - np.mean(x)) / s
        else:
            return 0.0

    x = np.diff(x)
    return np.sqrt(np.dot(x, x))


@nb.jit(nopython=True)
def cid_ce_numba(x, normalize):
    """
    This function calculator is an estimate for a time series complexity [1] (A more complex time series has more peaks,
    valleys etc.). It calculates the value of

    .. math::

        \\sqrt{ \\sum_{i=1}^{n-1} ( x_{i} - x_{i-1})^2 }

    .. rubric:: References

    |  [1] Batista, Gustavo EAPA, et al (2014).
    |  CID: an efficient complexity-invariant distance for time series.
    |  Data Mining and Knowledge Discovery 28.3 (2014): 634-669.

    :param x: the time series to calculate the feature of
    :type x: numpy.ndarray
    :param normalize: should the time series be z-transformed?
    :type normalize: bool

    :return: the value of this feature
    :return type: float
    """
    if normalize:
        s = np.std(x)
        if s != 0:
            x = (x - np.mean(x)) / s
        else:
            return 0.0

    x = np.diff(x)
    return np.sqrt(np.dot(x, x))

Check and benchmarks:

normalize = True
sample_array = np.array([n**3 for n in range(1, 100)])
sample_array = sample_array.astype(np.float64)

print(cid_ce(sample_array, normalize))
# 0.46831974531962184
print(cid_ce_numba(sample_array, normalize))
# 0.46831974531962184

%timeit cid_ce(sample_array, normalize)
# 37.8 µs ± 5.72 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit cid_ce_numba(sample_array, normalize)
# 1.04 µs ± 39.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

@nils-braun
Copy link
Collaborator

@arturdaraujo - thanks for looking into this! I would be very interested in how the numba implementation performs on the calculators with high computation cost (as marked by this flag). Because I think speeding up the simpler minimal calculators such as min or max is definitely nice, but most of the time is spent in just a few calculators.
Could you run your tests using the full set (or at least the efficient parameters)?

How does the numba solution play with multiprocessing?

Thank you very much!

@arturdaraujo
Copy link

arturdaraujo commented Feb 28, 2023

How does the numba solution play with multiprocessing?

Numba natively uses multiprocessing, it is set from the get-go.

Because I think speeding up the simpler minimal calculators such as min or max is definitely nice, but most of the time is spent in just a few calculators.
Could you run your tests using the full set (or at least the efficient parameters)?

I will look into more computational expansive functions, no worries

@nils-braun
Copy link
Collaborator

Numba natively uses multiprocessing, it is set from the get-go.

Correct! But as also stated in the links you shared it does not play well with the "normal" multiprocessing we are using for parallelization among time series. I am not an expert in numba but it might be interesting to understand if
(a) we can not use multiprocessing at all anymore for parallelization among time series and what the impact on time is
(b) where the tipping point between better parallelism in numba on the data from a single time series vs. parallelization over multiple time series (using the multiprocessing we currently do) is

@arturdaraujo
Copy link

arturdaraujo commented Feb 28, 2023

I believe that a safe approach is necessary if you are considering dropping your current multiprocessing approach altogether. But I believe you could implement it while still maintaining the current multiprocessing.

My take on this would be like this:

  1. create a script with tsfresh functions but modify it using numba. The original script keeps intact because we know it works.
  2. create an arg in extract_features called activate_numba (or something like this), which would utilize numba functions instead of the ones from the main tsfresh.

This approach would it make optional and also experimental in a sense. I believe there might be scenarios where running "normal" functions might even be faster than numba. This would make the community check which datasets works better with numba or normal tsfresh functions.

So here is a reason to NOT drop tsfresh current multiprocessing:

For instance, while researching here I noticed that the numba functions that I created from tsfresh (maximum and cid_ce) perform the same compared to tsfresh current functions when you use an array of 10.000 (on my linux, at least). So, I imagine if you have 10 times series of 20.000 rows it would be better to use tsfresh multiprocessing. If I have 100 times series with 1000 rows, however, it would still be much better to run with numba. I also don't know how numba performs in different systems like linux, windows, macos. So that's one more reason to thoroughly test it.

I hope I didn't overstep here. your package is really awesome and I only trying to picture an easier way to implement this. while keeping compatibility with old code and exploring new ways of processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants