Speedup arithmetic kernels (up to -25%) / not (-30%) #7457

Dandandan · 2025-04-29T18:09:35Z

Which issue does this PR close?

Closes #.

Rationale for this change

add(0)                  time:   [8.2259 µs 8.2342 µs 8.2441 µs]
                        change: [-0.5010% -0.2757% -0.0590%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

add_checked(0)          time:   [7.0879 µs 7.3277 µs 7.6011 µs]
                        change: [-6.7853% -4.9254% -3.0503%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 25 outliers among 100 measurements (25.00%)
  20 (20.00%) low severe
  2 (2.00%) high mild
  3 (3.00%) high severe

add_scalar(0)           time:   [3.6041 µs 3.6297 µs 3.6584 µs]
                        change: [-38.775% -37.843% -37.013%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) high mild
  14 (14.00%) high severe

subtract(0)             time:   [6.4811 µs 6.5582 µs 6.6578 µs]
                        change: [-14.791% -12.619% -10.742%] (p = 0.00 < 0.05)
                        Performance has improved.

subtract_checked(0)     time:   [7.9566 µs 7.9676 µs 7.9805 µs]
                        change: [-4.4506% -4.1728% -3.9161%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

subtract_scalar(0)      time:   [3.8308 µs 3.9523 µs 4.1161 µs]
                        change: [-23.490% -19.635% -15.749%] (p = 0.00 < 0.05)
                        Performance has improved.

multiply(0)             time:   [6.8732 µs 6.8898 µs 6.9040 µs]
                        change: [-16.540% -16.354% -16.162%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

multiply_checked(0)     time:   [6.7054 µs 6.8740 µs 7.0836 µs]
                        change: [-10.102% -7.8101% -5.9373%] (p = 0.00 < 0.05)
                        Performance has improved.

multiply_scalar(0)      time:   [3.6219 µs 3.6418 µs 3.6607 µs]
                        change: [-37.007% -36.422% -35.888%] (p = 0.00 < 0.05)
                        Performance has improved.


not                     time:   [103.75 ns 103.80 ns 103.86 ns]
                        change: [-28.869% -28.724% -28.584%] (p = 0.00 < 0.05)
                        Performance has improved.

not_sliced              time:   [226.43 ns 226.48 ns 226.52 ns]
                        change: [-13.416% -13.278% -13.154%] (p = 0.00 < 0.05)
                        Performance has improved.

What changes are included in this PR?

Are there any user-facing changes?

Dandandan · 2025-04-30T19:07:36Z

Could you also ru the boolean kernel bench?

alamb

This looks very cool @Dandandan -- I had a question about the performance with subtraction with no null and one potential API change, but otherwise looks great

arrow-array/src/array/primitive_array.rs

alamb · 2025-04-30T19:05:26Z

arrow-arith/src/arity.rs

-pub fn try_binary<A: ArrayAccessor, B: ArrayAccessor, F, O>(
-    a: A,
-    b: B,
+pub fn try_binary<A, B, F, O>(


is this an API change?

Good question 🤔, I believe it isn't?
I think it was bounded by primitive arrays (via ArrowPrimitiveType already?
I.e. you can not use it for a binary or boolean arrays.

I think this is a breaking change, you could use this with any array provided the output was primitive

Hm yeah, you are right!
Are we willing to make this change given that no test seems to break and other methods seem to be on PrimitiveArray instead (unary, unary_mut, try_unary, try_unary_mut, binary, etc. are on Primitive array)?

I think it would be a non-trivial functionality regression, even if in arrow-rs we aren't exploiting it - I could see this being useful for processing primitive dictionaries, for example.

FWIW: I checked for DataFusion, here it is also only used (1 usage) for a primitive array.

Can we go back to the previous definition? I suspect you had a good reason for doing it this way, but I don't understand what it is

If we need to retain backward compatibility and want to release this before July, I recommend adding a new method called try_binary_primitive and add a link in the documentation for try_binary saying it is faster for primitive arrays to use that new function

We can do that, but it seems not possible to have a safe api without regressing (because we need TrustedLen )

Dandandan · 2025-04-30T19:10:32Z

It looks like performance difference is bigger for the arithmetic changes on my computer 🤔 I think it might be something related to the osx allocator.

Other than that I think this is still improving the kernels on the use of safety / showing the from_trusted_len_iter is not needed as much.

alamb · 2025-04-30T19:16:44Z

Other than that I think this is still improving the kernels on the use of safety / showing the from_trusted_len_iter is not needed as much.

I agree this PR is an improvement in general. I'll rerun the benchmarks to see if they are repeatable

alamb · 2025-04-30T19:44:04Z

🤔 second benchmark runs look very good performance wise. I'll run one more

alamb · 2025-04-30T19:44:24Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing speedup_arith (3f5f977) to 07093a4 diff
BENCH_NAME=arithmetic_kernels
BENCH_COMMAND=cargo bench --all-features --bench arithmetic_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=speedup_arith
Results will be posted here when complete

alamb · 2025-04-30T20:05:42Z

🤖: Benchmark completed

Details

group                    main                                   speedup_arith
-----                    ----                                   -------------
add(0)                   1.07     14.9±0.15µs        ? ?/sec    1.00     14.0±0.20µs        ? ?/sec
add(0.1)                 1.00     12.6±0.17µs        ? ?/sec    1.01     12.7±0.19µs        ? ?/sec
add(0.5)                 1.00     12.6±0.13µs        ? ?/sec    1.02     12.9±0.22µs        ? ?/sec
add(0.9)                 1.00     12.5±0.09µs        ? ?/sec    1.02     12.7±0.27µs        ? ?/sec
add(1)                   1.00     12.5±0.09µs        ? ?/sec    1.02     12.8±0.21µs        ? ?/sec
add_checked(0)           1.00     11.3±0.09µs        ? ?/sec    1.01     11.4±0.15µs        ? ?/sec
add_checked(0.1)         1.00     12.5±0.08µs        ? ?/sec    1.02     12.7±0.19µs        ? ?/sec
add_checked(0.5)         1.00     12.6±0.10µs        ? ?/sec    1.00     12.7±0.25µs        ? ?/sec
add_checked(0.9)         1.00     12.4±0.09µs        ? ?/sec    1.01     12.6±0.22µs        ? ?/sec
add_checked(1)           1.00     12.4±0.07µs        ? ?/sec    1.03     12.8±0.21µs        ? ?/sec
add_scalar(0)            1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.08µs        ? ?/sec
add_scalar(0.1)          1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.05µs        ? ?/sec
add_scalar(0.5)          1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.03µs        ? ?/sec
add_scalar(0.9)          1.01      7.1±0.03µs        ? ?/sec    1.00      7.0±0.05µs        ? ?/sec
add_scalar(1)            1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.03µs        ? ?/sec
divide(0)                1.00     13.4±0.06µs        ? ?/sec    1.01     13.5±0.13µs        ? ?/sec
divide(0.1)              1.00     14.8±0.08µs        ? ?/sec    1.01     14.9±0.12µs        ? ?/sec
divide(0.5)              1.00     14.9±0.07µs        ? ?/sec    1.00     14.9±0.08µs        ? ?/sec
divide(0.9)              1.00     14.8±0.05µs        ? ?/sec    1.01     14.9±0.07µs        ? ?/sec
divide(1)                1.00     14.7±0.10µs        ? ?/sec    1.01     14.8±0.08µs        ? ?/sec
divide_scalar(0)         1.00     13.4±0.02µs        ? ?/sec    1.00     13.3±0.02µs        ? ?/sec
divide_scalar(0.1)       1.00     13.4±0.02µs        ? ?/sec    1.00     13.3±0.02µs        ? ?/sec
divide_scalar(0.5)       1.00     13.4±0.02µs        ? ?/sec    1.00     13.3±0.04µs        ? ?/sec
divide_scalar(0.9)       1.01     13.4±0.04µs        ? ?/sec    1.00     13.3±0.02µs        ? ?/sec
divide_scalar(1)         1.00     13.4±0.02µs        ? ?/sec    1.00     13.3±0.02µs        ? ?/sec
modulo(0)                1.00    333.1±0.60µs        ? ?/sec    1.03    341.6±0.72µs        ? ?/sec
modulo(0.1)              1.00    382.4±0.47µs        ? ?/sec    1.02    391.6±0.87µs        ? ?/sec
modulo(0.5)              1.00    525.4±0.81µs        ? ?/sec    1.05    549.2±1.52µs        ? ?/sec
modulo(0.9)              1.00    294.3±1.94µs        ? ?/sec    1.02    298.7±0.54µs        ? ?/sec
modulo(1)                1.00    237.9±0.36µs        ? ?/sec    1.02    242.5±0.48µs        ? ?/sec
modulo_scalar(0)         1.00    493.0±1.59µs        ? ?/sec    1.02    501.2±0.97µs        ? ?/sec
modulo_scalar(0.1)       1.00    466.7±2.01µs        ? ?/sec    1.03    478.4±1.55µs        ? ?/sec
modulo_scalar(0.5)       1.00    313.0±0.48µs        ? ?/sec    1.03    322.8±0.57µs        ? ?/sec
modulo_scalar(0.9)       1.04    165.5±0.23µs        ? ?/sec    1.00    159.7±0.42µs        ? ?/sec
modulo_scalar(1)         1.04    122.6±0.33µs        ? ?/sec    1.00    118.0±0.78µs        ? ?/sec
multiply(0)              1.00     11.2±0.08µs        ? ?/sec    1.00     11.3±0.09µs        ? ?/sec
multiply(0.1)            1.00     12.5±0.25µs        ? ?/sec    1.02     12.8±0.18µs        ? ?/sec
multiply(0.5)            1.00     12.7±0.14µs        ? ?/sec    1.01     12.9±0.22µs        ? ?/sec
multiply(0.9)            1.00     12.6±0.13µs        ? ?/sec    1.00     12.6±0.15µs        ? ?/sec
multiply(1)              1.00     12.5±0.09µs        ? ?/sec    1.01     12.6±0.15µs        ? ?/sec
multiply_checked(0)      1.00     11.3±0.09µs        ? ?/sec    1.01     11.4±0.14µs        ? ?/sec
multiply_checked(0.1)    1.00     12.6±0.10µs        ? ?/sec    1.03     12.9±0.17µs        ? ?/sec
multiply_checked(0.5)    1.00     12.6±0.08µs        ? ?/sec    1.02     12.8±0.17µs        ? ?/sec
multiply_checked(0.9)    1.00     12.6±0.12µs        ? ?/sec    1.01     12.7±0.18µs        ? ?/sec
multiply_checked(1)      1.00     12.5±0.05µs        ? ?/sec    1.02     12.7±0.16µs        ? ?/sec
multiply_scalar(0)       1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.02µs        ? ?/sec
multiply_scalar(0.1)     1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.05µs        ? ?/sec
multiply_scalar(0.5)     1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.09µs        ? ?/sec
multiply_scalar(0.9)     1.01      7.1±0.01µs        ? ?/sec    1.00      7.0±0.04µs        ? ?/sec
multiply_scalar(1)       1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.02µs        ? ?/sec
subtract(0)              1.00     11.3±0.07µs        ? ?/sec    1.01     11.4±0.13µs        ? ?/sec
subtract(0.1)            1.00     12.5±0.16µs        ? ?/sec    1.01     12.6±0.25µs        ? ?/sec
subtract(0.5)            1.00     12.5±0.21µs        ? ?/sec    1.00     12.5±0.21µs        ? ?/sec
subtract(0.9)            1.00     12.6±0.15µs        ? ?/sec    1.00     12.6±0.20µs        ? ?/sec
subtract(1)              1.00     12.5±0.16µs        ? ?/sec    1.01     12.7±0.21µs        ? ?/sec
subtract_checked(0)      1.00     11.3±0.07µs        ? ?/sec    1.02     11.4±0.16µs        ? ?/sec
subtract_checked(0.1)    1.00     12.5±0.20µs        ? ?/sec    1.02     12.7±0.27µs        ? ?/sec
subtract_checked(0.5)    1.00     12.5±0.22µs        ? ?/sec    1.02     12.7±0.31µs        ? ?/sec
subtract_checked(0.9)    1.01     12.6±0.14µs        ? ?/sec    1.00     12.5±0.15µs        ? ?/sec
subtract_checked(1)      1.00     12.4±0.20µs        ? ?/sec    1.00     12.4±0.19µs        ? ?/sec
subtract_scalar(0)       1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.03µs        ? ?/sec
subtract_scalar(0.1)     1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.01µs        ? ?/sec
subtract_scalar(0.5)     1.01      7.1±0.12µs        ? ?/sec    1.00      7.0±0.02µs        ? ?/sec
subtract_scalar(0.9)     1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.02µs        ? ?/sec
subtract_scalar(1)       1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.01µs        ? ?/sec

alamb · 2025-04-30T20:05:45Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing speedup_arith (3f5f977) to 07093a4 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --all-features --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=speedup_arith
Results will be posted here when complete

alamb · 2025-04-30T20:08:45Z

🤖: Benchmark completed

Details

group         main                                   speedup_arith
-----         ----                                   -------------
and           1.00    291.5±0.51ns        ? ?/sec    1.01    293.4±0.90ns        ? ?/sec
and_sliced    1.01   1139.3±1.03ns        ? ?/sec    1.00   1132.0±1.82ns        ? ?/sec
not           1.14    263.6±2.41ns        ? ?/sec    1.00    231.6±0.38ns        ? ?/sec
not_sliced    1.42   1014.9±3.10ns        ? ?/sec    1.00    716.5±1.44ns        ? ?/sec
or            1.14    304.2±1.41ns        ? ?/sec    1.00    267.7±0.58ns        ? ?/sec
or_sliced     1.00   1134.4±1.59ns        ? ?/sec    1.00   1134.4±1.34ns        ? ?/sec

Dandandan · 2025-05-01T09:02:51Z

🤔 second benchmark runs look very good performance wise. I'll run one more

This seems like similar to my results (although slightly less) which often shows better performance and sometimes 0% and sometimes 10-25% change for similar set of benchmarks.
I think it's something allocator related, but not sure what the exact cause is of this being sometimes faster.

alamb · 2025-05-01T15:16:44Z

sorry -- clicked the wrong button

Dandandan · 2025-05-01T17:20:26Z

What do we want to do with the API breakage?
I think it would be nice to avoid the unsafe / move to Iterator::collect (and get some perf. improvement), but using ArrayAccessor or even ArrayIter I think still needs the use of the unsafe API's to avoid big regressions (in large part because of https://doc.rust-lang.org/std/iter/trait.TrustedLen.html is not stable).
We can also look at what makes the difference in performance from time to time and change from_trusted_len_iter

At least I can break out the not improvement to another PR.

Dandandan · 2025-05-05T17:02:14Z

Ill open another PR for improving not

alamb · 2025-05-05T18:14:37Z

What do we want to do with the API breakage? I think it would be nice to avoid the unsafe / move to Iterator::collect (and get some perf. improvement), but using ArrayAccessor or even ArrayIter I think still needs the use of the unsafe API's to avoid big regressions (in large part because of https://doc.rust-lang.org/std/iter/trait.TrustedLen.html is not stable). We can also look at what makes the difference in performance from time to time and change from_trusted_len_iter

Why do we need trusted_len if we know both inputs are Arrays?

Dandandan · 2025-05-05T18:30:48Z

What do we want to do with the API breakage? I think it would be nice to avoid the unsafe / move to Iterator::collect (and get some perf. improvement), but using ArrayAccessor or even ArrayIter I think still needs the use of the unsafe API's to avoid big regressions (in large part because of https://doc.rust-lang.org/std/iter/trait.TrustedLen.html is not stable). We can also look at what makes the difference in performance from time to time and change from_trusted_len_iter

Why do we need trusted_len if we know both inputs are Arrays?

TrustedLen (https://doc.rust-lang.org/std/iter/trait.TrustedLen.html) is required for collecting into Vec to be performant.
ArrayAccessor or Array can not give that guarantee, as we cannot implement TrustedLen in non-nightly rust for our own types.

Currently the same code relies on Buffer::from_trusted_len_iter for the same reason, only in this case the caller is responsible for ensuring the iter is of "trusted len".

Speedup arith

ca81704

github-actions bot added the arrow Changes to the arrow crate label Apr 29, 2025

Dandandan added 7 commits April 29, 2025 20:17

Speedup arith scalar

c31011c

Speedup arith

9afc940

Merge remote-tracking branch 'upstream/main' into speedup_arith

609b625

Speedup arith

675239e

Speedup arith

5cc033e

Speedup arith

f556fa6

Speedup bitwise_unary_op_helper

3f5f977

Dandandan changed the title ~~Speedup arithmetic kernels~~ Speedup arithmetic kernels / not Apr 30, 2025

Dandandan changed the title ~~Speedup arithmetic kernels / not~~ Speedup arithmetic kernels (up to -25%) / not (-30%) Apr 30, 2025

This comment was marked as outdated.

Sign in to view

alamb reviewed Apr 30, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

tustvold added the api-change Changes to the arrow API label May 1, 2025

Add back safety note

7be4bc0

Add back safety note

55cb9bb

This comment was marked as outdated.

Sign in to view

alamb closed this May 1, 2025

alamb reopened this May 1, 2025

Dandandan closed this May 5, 2025

Speedup arithmetic kernels (up to -25%) / not (-30%) #7457

Speedup arithmetic kernels (up to -25%) / not (-30%) #7457

Uh oh!

Conversation

Dandandan commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Dandandan commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

Dandandan Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

tustvold May 1, 2025

Choose a reason for hiding this comment

Uh oh!

Dandandan May 1, 2025

Choose a reason for hiding this comment

Uh oh!

tustvold May 1, 2025

Choose a reason for hiding this comment

Uh oh!

Dandandan May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb May 5, 2025

Choose a reason for hiding this comment

Uh oh!

Dandandan May 5, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Dandandan commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

alamb commented Apr 30, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

alamb commented Apr 30, 2025

Uh oh!

alamb commented Apr 30, 2025

Uh oh!

alamb commented Apr 30, 2025

Uh oh!

alamb commented Apr 30, 2025

Uh oh!

alamb commented Apr 30, 2025

Uh oh!

Dandandan commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

alamb commented May 1, 2025

Uh oh!

Dandandan commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented May 5, 2025

Uh oh!

alamb commented May 5, 2025

Uh oh!

Dandandan commented May 5, 2025

Dandandan commented Apr 29, 2025 •

edited

Loading

Dandandan commented Apr 30, 2025 •

edited

Loading

Dandandan May 1, 2025 •

edited

Loading

Dandandan commented Apr 30, 2025 •

edited

Loading

Dandandan commented May 1, 2025 •

edited

Loading

Dandandan commented May 1, 2025 •

edited

Loading