Skip to content

Speedup arithmetic kernels (up to -25%) / not (-30%) #7457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Apr 29, 2025

Which issue does this PR close?

Closes #.

Rationale for this change

add(0)                  time:   [8.2259 µs 8.2342 µs 8.2441 µs]
                        change: [-0.5010% -0.2757% -0.0590%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

add_checked(0)          time:   [7.0879 µs 7.3277 µs 7.6011 µs]
                        change: [-6.7853% -4.9254% -3.0503%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 25 outliers among 100 measurements (25.00%)
  20 (20.00%) low severe
  2 (2.00%) high mild
  3 (3.00%) high severe

add_scalar(0)           time:   [3.6041 µs 3.6297 µs 3.6584 µs]
                        change: [-38.775% -37.843% -37.013%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) high mild
  14 (14.00%) high severe

subtract(0)             time:   [6.4811 µs 6.5582 µs 6.6578 µs]
                        change: [-14.791% -12.619% -10.742%] (p = 0.00 < 0.05)
                        Performance has improved.

subtract_checked(0)     time:   [7.9566 µs 7.9676 µs 7.9805 µs]
                        change: [-4.4506% -4.1728% -3.9161%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

subtract_scalar(0)      time:   [3.8308 µs 3.9523 µs 4.1161 µs]
                        change: [-23.490% -19.635% -15.749%] (p = 0.00 < 0.05)
                        Performance has improved.

multiply(0)             time:   [6.8732 µs 6.8898 µs 6.9040 µs]
                        change: [-16.540% -16.354% -16.162%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

multiply_checked(0)     time:   [6.7054 µs 6.8740 µs 7.0836 µs]
                        change: [-10.102% -7.8101% -5.9373%] (p = 0.00 < 0.05)
                        Performance has improved.

multiply_scalar(0)      time:   [3.6219 µs 3.6418 µs 3.6607 µs]
                        change: [-37.007% -36.422% -35.888%] (p = 0.00 < 0.05)
                        Performance has improved.


not                     time:   [103.75 ns 103.80 ns 103.86 ns]
                        change: [-28.869% -28.724% -28.584%] (p = 0.00 < 0.05)
                        Performance has improved.

not_sliced              time:   [226.43 ns 226.48 ns 226.52 ns]
                        change: [-13.416% -13.278% -13.154%] (p = 0.00 < 0.05)
                        Performance has improved.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 29, 2025
@Dandandan Dandandan changed the title Speedup arithmetic kernels Speedup arithmetic kernels / not Apr 30, 2025
@Dandandan Dandandan changed the title Speedup arithmetic kernels / not Speedup arithmetic kernels (up to -25%) / not (-30%) Apr 30, 2025
@alamb

This comment was marked as outdated.

@alamb

This comment was marked as outdated.

@Dandandan
Copy link
Contributor Author

Dandandan commented Apr 30, 2025

Could you also ru the boolean kernel bench?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very cool @Dandandan -- I had a question about the performance with subtraction with no null and one potential API change, but otherwise looks great

pub fn try_binary<A: ArrayAccessor, B: ArrayAccessor, F, O>(
a: A,
b: B,
pub fn try_binary<A, B, F, O>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an API change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question 🤔, I believe it isn't?
I think it was bounded by primitive arrays (via ArrowPrimitiveType already?
I.e. you can not use it for a binary or boolean arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a breaking change, you could use this with any array provided the output was primitive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm yeah, you are right!
Are we willing to make this change given that no test seems to break and other methods seem to be on PrimitiveArray instead (unary, unary_mut, try_unary, try_unary_mut, binary, etc. are on Primitive array)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a non-trivial functionality regression, even if in arrow-rs we aren't exploiting it - I could see this being useful for processing primitive dictionaries, for example.

Copy link
Contributor Author

@Dandandan Dandandan May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: I checked for DataFusion, here it is also only used (1 usage) for a primitive array.

@alamb

This comment was marked as outdated.

@Dandandan
Copy link
Contributor Author

Dandandan commented Apr 30, 2025

It looks like performance difference is bigger for the arithmetic changes on my computer 🤔 I think it might be something related to the osx allocator.

Other than that I think this is still improving the kernels on the use of safety / showing the from_trusted_len_iter is not needed as much.

@alamb

This comment was marked as outdated.

@alamb
Copy link
Contributor

alamb commented Apr 30, 2025

Other than that I think this is still improving the kernels on the use of safety / showing the from_trusted_len_iter is not needed as much.

I agree this PR is an improvement in general. I'll rerun the benchmarks to see if they are repeatable

@alamb

This comment was marked as outdated.

@alamb

This comment was marked as outdated.

@alamb

This comment was marked as outdated.

@alamb

This comment was marked as outdated.

@alamb
Copy link
Contributor

alamb commented Apr 30, 2025

🤔 second benchmark runs look very good performance wise. I'll run one more

@alamb
Copy link
Contributor

alamb commented Apr 30, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing speedup_arith (3f5f977) to 07093a4 diff
BENCH_NAME=arithmetic_kernels
BENCH_COMMAND=cargo bench --all-features --bench arithmetic_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=speedup_arith
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Apr 30, 2025

🤖: Benchmark completed

Details

group                    main                                   speedup_arith
-----                    ----                                   -------------
add(0)                   1.07     14.9±0.15µs        ? ?/sec    1.00     14.0±0.20µs        ? ?/sec
add(0.1)                 1.00     12.6±0.17µs        ? ?/sec    1.01     12.7±0.19µs        ? ?/sec
add(0.5)                 1.00     12.6±0.13µs        ? ?/sec    1.02     12.9±0.22µs        ? ?/sec
add(0.9)                 1.00     12.5±0.09µs        ? ?/sec    1.02     12.7±0.27µs        ? ?/sec
add(1)                   1.00     12.5±0.09µs        ? ?/sec    1.02     12.8±0.21µs        ? ?/sec
add_checked(0)           1.00     11.3±0.09µs        ? ?/sec    1.01     11.4±0.15µs        ? ?/sec
add_checked(0.1)         1.00     12.5±0.08µs        ? ?/sec    1.02     12.7±0.19µs        ? ?/sec
add_checked(0.5)         1.00     12.6±0.10µs        ? ?/sec    1.00     12.7±0.25µs        ? ?/sec
add_checked(0.9)         1.00     12.4±0.09µs        ? ?/sec    1.01     12.6±0.22µs        ? ?/sec
add_checked(1)           1.00     12.4±0.07µs        ? ?/sec    1.03     12.8±0.21µs        ? ?/sec
add_scalar(0)            1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.08µs        ? ?/sec
add_scalar(0.1)          1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.05µs        ? ?/sec
add_scalar(0.5)          1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.03µs        ? ?/sec
add_scalar(0.9)          1.01      7.1±0.03µs        ? ?/sec    1.00      7.0±0.05µs        ? ?/sec
add_scalar(1)            1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.03µs        ? ?/sec
divide(0)                1.00     13.4±0.06µs        ? ?/sec    1.01     13.5±0.13µs        ? ?/sec
divide(0.1)              1.00     14.8±0.08µs        ? ?/sec    1.01     14.9±0.12µs        ? ?/sec
divide(0.5)              1.00     14.9±0.07µs        ? ?/sec    1.00     14.9±0.08µs        ? ?/sec
divide(0.9)              1.00     14.8±0.05µs        ? ?/sec    1.01     14.9±0.07µs        ? ?/sec
divide(1)                1.00     14.7±0.10µs        ? ?/sec    1.01     14.8±0.08µs        ? ?/sec
divide_scalar(0)         1.00     13.4±0.02µs        ? ?/sec    1.00     13.3±0.02µs        ? ?/sec
divide_scalar(0.1)       1.00     13.4±0.02µs        ? ?/sec    1.00     13.3±0.02µs        ? ?/sec
divide_scalar(0.5)       1.00     13.4±0.02µs        ? ?/sec    1.00     13.3±0.04µs        ? ?/sec
divide_scalar(0.9)       1.01     13.4±0.04µs        ? ?/sec    1.00     13.3±0.02µs        ? ?/sec
divide_scalar(1)         1.00     13.4±0.02µs        ? ?/sec    1.00     13.3±0.02µs        ? ?/sec
modulo(0)                1.00    333.1±0.60µs        ? ?/sec    1.03    341.6±0.72µs        ? ?/sec
modulo(0.1)              1.00    382.4±0.47µs        ? ?/sec    1.02    391.6±0.87µs        ? ?/sec
modulo(0.5)              1.00    525.4±0.81µs        ? ?/sec    1.05    549.2±1.52µs        ? ?/sec
modulo(0.9)              1.00    294.3±1.94µs        ? ?/sec    1.02    298.7±0.54µs        ? ?/sec
modulo(1)                1.00    237.9±0.36µs        ? ?/sec    1.02    242.5±0.48µs        ? ?/sec
modulo_scalar(0)         1.00    493.0±1.59µs        ? ?/sec    1.02    501.2±0.97µs        ? ?/sec
modulo_scalar(0.1)       1.00    466.7±2.01µs        ? ?/sec    1.03    478.4±1.55µs        ? ?/sec
modulo_scalar(0.5)       1.00    313.0±0.48µs        ? ?/sec    1.03    322.8±0.57µs        ? ?/sec
modulo_scalar(0.9)       1.04    165.5±0.23µs        ? ?/sec    1.00    159.7±0.42µs        ? ?/sec
modulo_scalar(1)         1.04    122.6±0.33µs        ? ?/sec    1.00    118.0±0.78µs        ? ?/sec
multiply(0)              1.00     11.2±0.08µs        ? ?/sec    1.00     11.3±0.09µs        ? ?/sec
multiply(0.1)            1.00     12.5±0.25µs        ? ?/sec    1.02     12.8±0.18µs        ? ?/sec
multiply(0.5)            1.00     12.7±0.14µs        ? ?/sec    1.01     12.9±0.22µs        ? ?/sec
multiply(0.9)            1.00     12.6±0.13µs        ? ?/sec    1.00     12.6±0.15µs        ? ?/sec
multiply(1)              1.00     12.5±0.09µs        ? ?/sec    1.01     12.6±0.15µs        ? ?/sec
multiply_checked(0)      1.00     11.3±0.09µs        ? ?/sec    1.01     11.4±0.14µs        ? ?/sec
multiply_checked(0.1)    1.00     12.6±0.10µs        ? ?/sec    1.03     12.9±0.17µs        ? ?/sec
multiply_checked(0.5)    1.00     12.6±0.08µs        ? ?/sec    1.02     12.8±0.17µs        ? ?/sec
multiply_checked(0.9)    1.00     12.6±0.12µs        ? ?/sec    1.01     12.7±0.18µs        ? ?/sec
multiply_checked(1)      1.00     12.5±0.05µs        ? ?/sec    1.02     12.7±0.16µs        ? ?/sec
multiply_scalar(0)       1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.02µs        ? ?/sec
multiply_scalar(0.1)     1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.05µs        ? ?/sec
multiply_scalar(0.5)     1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.09µs        ? ?/sec
multiply_scalar(0.9)     1.01      7.1±0.01µs        ? ?/sec    1.00      7.0±0.04µs        ? ?/sec
multiply_scalar(1)       1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.02µs        ? ?/sec
subtract(0)              1.00     11.3±0.07µs        ? ?/sec    1.01     11.4±0.13µs        ? ?/sec
subtract(0.1)            1.00     12.5±0.16µs        ? ?/sec    1.01     12.6±0.25µs        ? ?/sec
subtract(0.5)            1.00     12.5±0.21µs        ? ?/sec    1.00     12.5±0.21µs        ? ?/sec
subtract(0.9)            1.00     12.6±0.15µs        ? ?/sec    1.00     12.6±0.20µs        ? ?/sec
subtract(1)              1.00     12.5±0.16µs        ? ?/sec    1.01     12.7±0.21µs        ? ?/sec
subtract_checked(0)      1.00     11.3±0.07µs        ? ?/sec    1.02     11.4±0.16µs        ? ?/sec
subtract_checked(0.1)    1.00     12.5±0.20µs        ? ?/sec    1.02     12.7±0.27µs        ? ?/sec
subtract_checked(0.5)    1.00     12.5±0.22µs        ? ?/sec    1.02     12.7±0.31µs        ? ?/sec
subtract_checked(0.9)    1.01     12.6±0.14µs        ? ?/sec    1.00     12.5±0.15µs        ? ?/sec
subtract_checked(1)      1.00     12.4±0.20µs        ? ?/sec    1.00     12.4±0.19µs        ? ?/sec
subtract_scalar(0)       1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.03µs        ? ?/sec
subtract_scalar(0.1)     1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.01µs        ? ?/sec
subtract_scalar(0.5)     1.01      7.1±0.12µs        ? ?/sec    1.00      7.0±0.02µs        ? ?/sec
subtract_scalar(0.9)     1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.02µs        ? ?/sec
subtract_scalar(1)       1.01      7.1±0.02µs        ? ?/sec    1.00      7.0±0.01µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Apr 30, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing speedup_arith (3f5f977) to 07093a4 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --all-features --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=speedup_arith
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Apr 30, 2025

🤖: Benchmark completed

Details

group         main                                   speedup_arith
-----         ----                                   -------------
and           1.00    291.5±0.51ns        ? ?/sec    1.01    293.4±0.90ns        ? ?/sec
and_sliced    1.01   1139.3±1.03ns        ? ?/sec    1.00   1132.0±1.82ns        ? ?/sec
not           1.14    263.6±2.41ns        ? ?/sec    1.00    231.6±0.38ns        ? ?/sec
not_sliced    1.42   1014.9±3.10ns        ? ?/sec    1.00    716.5±1.44ns        ? ?/sec
or            1.14    304.2±1.41ns        ? ?/sec    1.00    267.7±0.58ns        ? ?/sec
or_sliced     1.00   1134.4±1.59ns        ? ?/sec    1.00   1134.4±1.34ns        ? ?/sec

@tustvold tustvold added the api-change Changes to the arrow API label May 1, 2025
@Dandandan
Copy link
Contributor Author

Dandandan commented May 1, 2025

🤔 second benchmark runs look very good performance wise. I'll run one more

This seems like similar to my results (although slightly less) which often shows better performance and sometimes 0% and sometimes 10-25% change for similar set of benchmarks.
I think it's something allocator related, but not sure what the exact cause is of this being sometimes faster.

@alamb

This comment was marked as outdated.

@alamb alamb closed this May 1, 2025
@alamb alamb reopened this May 1, 2025
@alamb
Copy link
Contributor

alamb commented May 1, 2025

sorry -- clicked the wrong button

@Dandandan
Copy link
Contributor Author

Dandandan commented May 1, 2025

What do we want to do with the API breakage?
I think it would be nice to avoid the unsafe / move to Iterator::collect (and get some perf. improvement), but using ArrayAccessor or even ArrayIter I think still needs the use of the unsafe API's to avoid big regressions (in large part because of https://doc.rust-lang.org/std/iter/trait.TrustedLen.html is not stable).
We can also look at what makes the difference in performance from time to time and change from_trusted_len_iter

At least I can break out the not improvement to another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants