Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.1 Introduction]: why add_python is faster than add_numpy for vectorization add #74

Open
bingyao opened this issue Aug 3, 2018 · 14 comments

Comments

@bingyao
Copy link

bingyao commented Aug 3, 2018

I found an opposite conclusion when running the example code in 4.1 Introduction, following code is my results tested in IPython 6.4.0 with Python 3.6.5 and Numpy 1.14.3:

In [1]: import numpy as np

In [2]: import random

In [3]: def add_python(Z1,Z2):
   ...:     return [z1+z2 for (z1,z2) in zip(Z1,Z2)]
   ...: 
   ...: def add_numpy(Z1,Z2):
   ...:     return np.add(Z1,Z2)
   ...: 

In [4]: Z1 = random.sample(range(1000), 100)

In [5]: Z2 = random.sample(range(1000), 100)

# For Python lists `Z1`, `Z2`, `add_python` is faster
In [6]: %timeit add_python(Z1, Z2)
8.25 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [7]: %timeit add_numpy(Z1, Z2)
16.9 µs ± 235 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [8]: a = np.random.randint(0, 1000, size=100)

In [9]: b = np.random.randint(0, 1000, size=100)
# For Numpy array `a`, `b`, `add_numpy` is faster
In [10]: %timeit add_python(a, b)
22.6 µs ± 816 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [11]: %timeit add_numpy(a, b)
851 ns ± 6.37 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
@bingyao bingyao changed the title [4.1 Introduction]: why add_python is faster than add_numpy for vectorization add in my test [4.1 Introduction]: why add_python is faster than add_numpy for vectorization add Aug 3, 2018
@rougier
Copy link
Owner

rougier commented Aug 7, 2018

Interesting. I re-tested it using Python 3.7 and I got:

In [8]: %timeit add_python(Z1,Z2)
8.88 µs ± 423 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [9]: %timeit add_numpy(Z1,Z2)
14.4 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@dwt
Copy link

dwt commented Dec 13, 2018

The same thing for me. Using standard python arrays (python 3.7, Mac OS Mojave)

%timeit add_python(Z1, Z2)
6 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit add_numpy(Z1, Z2)
11.1 µs ± 46.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Using np.arrays instead, the timings change in an interesting way:

%timeit add_python(Z3, Z4)
28.5 µs ± 996 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit add_numpy(Z3, Z4)
540 ns ± 21.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.add(Z3, Z4)
488 ns ± 8.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Interestingly the python call overhead starts to really show when doing such micro benchmarks.

So to summarize:

  • numpy is about twice as slow for me with native python lists
  • numpy is just as fast as expected, with numpy arrays, and python is about twice as slow with bumpy arrays than with native lists

I'd say that is about as expected, so maybe that is what should be compared in the example instead of trying to do both compute paths with native python lists first?

@dwt
Copy link

dwt commented Dec 13, 2018

I'd say the examples are just way too small to make the differences really visible. When upscaling the input a bit, I get this:

length = 100000

import random
Z1, Z2 = random.sample(range(length), length), random.sample(range(length), length)

%timeit add_python(Z1, Z2)
%timeit [z1+z2 for (z1,z2) in zip(Z1,Z2)]
19.1 ms ± 514 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
15.6 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit add_numpy(Z1, Z2)
%timeit np.add(Z1, Z2)
11 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.9 ms ± 63.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Z3, Z4 = np.random.sample(length) * 100, np.random.sample(length) * 100

%timeit add_python(Z3, Z4)
%timeit [z3+z4 for (z3,z4) in zip(Z3,Z4)]
16.8 ms ± 93.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
16.7 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit add_numpy(Z3, Z4)
%timeit np.add(Z3, Z4)
43.1 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
42.7 µs ± 278 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@rougier
Copy link
Owner

rougier commented Dec 17, 2018

Nice. Could you make a PR for the book?

@dwt
Copy link

dwt commented Dec 17, 2018

Sure, but it will probably take me until the Christmas-time.

@dwt
Copy link

dwt commented Dec 17, 2018

(Also my english is shit, so you will have to improve that probably. Sorry)

@rougier
Copy link
Owner

rougier commented Dec 17, 2018

Mine is the same, not sure I can correct :)

@inamoto85
Copy link

Hi @dwt, getting similar results. Can you explain why this is about as expected (due to recent python optimizations on arrays)?

@dwt
Copy link

dwt commented Feb 15, 2019

My thinking is that you have to think about a numpy operation in three parts. Switching from the python to the c layer, doing the actual computation and then switching back to python.

Now the actual computation part is pretty much always faster than doing the same computation in python. BUT if the context switches take more time than you save by doing the computation faster, then the pure python solution can still be faster.

This is why larger lists / arrays / vectors make the context switch to C more worth it, as the savings in the computation can dominate the costs of switching to the C layer.

@inamoto85
Copy link

Thank you for the explanation!

@dr-neptune
Copy link
Contributor

I've been playing around with this more today, and it seems that most of the time the python version is faster. My assumption is that addition is already fairly heavily optimized in python, leaving the time dominated by the numpy overhead.

vec_length = 1_000_000
Z1, Z2 = random.sample(range(vec_length), vec_length), random.sample(range(vec_length), vec_length)

# %timeit add_python(Z1, Z2)
# 253 ms ± 4.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# %timeit add_numpy(Z1, Z2)
# 501 ms ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I got similar results at different sizes. It might be worth swapping out this example for something more convoluted to make a point:

def add_python(Z1, Z2):
    return [((z1**2 + z2**2)**0.5) + ((z1 + z2)**3) for z1, z2 in zip(Z1, Z2)]

def add_numpy(Z1, Z2):
    return np.sqrt(Z1**2 + Z2**2) + (Z1 + Z2)**3

vec_length = 1_000_000
Z1, Z2 = random.sample(range(vec_length), vec_length), random.sample(range(vec_length), vec_length)
Z1_np, Z2_np = np.array(Z1, dtype=np.float64), np.array(Z2, dtype=np.float64)

%timeit add_python(Z1, Z2)
# 665 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit add_numpy(Z1_np, Z2_np)
# 54.2 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@rougier
Copy link
Owner

rougier commented Jan 22, 2024

I tried again with the simple add version and 1,000,000 elements, and I get:

%timeit add_python(Z1, Z2)
54.6 ms ± 331 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit add_numpy(Z1_np, Z2_np)
645 µs ± 3.91 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@dr-neptune
Copy link
Contributor

Interesting -- my example is running on python 3.11, windows 10, and numpy 1.24.3. Your results are not only much more apparent, but much faster overall.

@rougier
Copy link
Owner

rougier commented Jan 22, 2024

OSX, macbook M1, Python 3.11, Numpy 1.26.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants