You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+21-10Lines changed: 21 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -158,15 +158,24 @@ $ for backend in jax cupy pytorch tensorflow; do
158
158
159
159
#### Equation of state
160
160
161
-



@@ -177,14 +186,16 @@ $ for backend in jax cupy pytorch tensorflow; do
177
186
178
187
Lessons I learned by assembling these benchmarks: (your mileage may vary)
179
188
180
-
- The performance of Jax seems very competitive, both on GPU and CPU. It is consistently among the top implementations on CPU, and shows the best performance on GPU.
181
-
- Jax' performance on GPU seems to be quite hardware dependent. Jax performance significantly better (relatively speaking) on a Tesla P100 than a Tesla K80.
182
-
- Numba is a great choice on CPU if you don't mind writing explicit for loops (which can be more readable than a vectorized implementation), being slightly faster than Jax with little effort.
189
+
- The performance of JAX is very competitive, both on GPU and CPU. It is consistently among the top implementations on both platforms.
190
+
- Pytorch performs very well on GPU for large problems (slightly better than JAX), but its CPU performance is not great for tasks with many slicing operations.
191
+
- Numba is a great choice on CPU if you don't mind writing explicit for loops (which can be more readable than a vectorized implementation), being slightly faster than JAX with little effort.
192
+
- JAX performance on GPU seems to be quite hardware dependent. JAX performancs significantly better (relatively speaking) on a Tesla P100 than a Tesla K80.
183
193
- If you have embarrasingly parallel workloads, speedups of > 1000x are easy to achieve on high-end GPUs.
184
-
- Tensorflow is not great for applications like ours, since it lacks tools to apply partial updates to tensors (in the sense of `tensor[2:-2] = 0.`).
185
-
- Don't bother using Pytorch or vanilla Tensorflow on CPU (you won't get much faster than NumPy). Tensorflow with XLA (`experimental_compile`) is great though!
194
+
- TPUs are catching up to GPUs. We can now get similar performance to a high-end GPU on these workloads.
195
+
- Tensorflow is not great for applications like ours, since it lacks tools to apply partial updates to tensors (such as `tensor[2:-2] = 0.`).
196
+
- If you use Tensorflow on CPU, make sure to use XLA (`experimental_compile`) for tremendous speedups.
186
197
- CuPy is nice! Often you don't need to change anything in your NumPy code to have it run on GPU (with decent, but not outstanding performance).
187
-
- Reaching Fortran performance on CPU with vectorized implementations is hard :)
198
+
- Reaching Fortran performance on CPU for non-trivial tasks is hard :)
0 commit comments