perf: optimize ARM64 NEON min/max assembly#748
Open
zeroshade wants to merge 2 commits intoapache:mainfrom
Open
perf: optimize ARM64 NEON min/max assembly#748zeroshade wants to merge 2 commits intoapache:mainfrom
zeroshade wants to merge 2 commits intoapache:mainfrom
Conversation
Improve the NEON assembly for min/max operations in two ways: 1. 32-bit (int32/uint32): Widen from .2s (64-bit D-registers, 2 lanes) to .4s (128-bit Q-registers, 4 lanes), doubling throughput with the same instruction count. Use sminv/smaxv/uminv/umaxv for horizontal reduction instead of manual dup+compare. 2. 64-bit (int64/uint64): Replace BSL (bit select) + 4 MOV register saves per loop iteration with BIT/BIF (bit insert if true/false), which are destructive to the accumulator rather than the mask, eliminating all register save/restore overhead. Restructure comparisons for maximum ILP on out-of-order cores. Also adds correctness tests validating NEON results against the pure Go implementation across boundary sizes (1, 3, 4, 7, 8, 9, 15, 16, etc.) and benchmarks for int32/uint32/int64/uint64 at various input sizes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
The NEON assembly in
internal/utils/min_max_neon_arm64.swas machine-translated from compiler output (via asm2plan9s) and had two significant inefficiencies:.2s(64-bit D-registers, 2 lanes) instead of.4s(128-bit Q-registers, 4 lanes), leaving half the hardware throughput on the table.BSL(bit select) is destructive to its mask operand, forcing register saves before each compare+select. ARM64 providesBIT/BIF(bit insert if true/false) which are destructive to the accumulator instead, eliminating the need for saves entirely.What changes are included in this PR?
Assembly optimizations (
min_max_neon_arm64.s):.2sto.4s, processing 8 elements per loop iteration instead of 4. Usesminv/smaxv/uminv/umaxvfor single-instruction horizontal reduction instead of manualdup+ compare pairs. Adjust loop mask from0xfffffffc(multiples of 4) to0xfffffff8(multiples of 8) and scalar tail threshold from 3 to 7.BSL+ 4×MOVregister saves withBIT/BIFinstructions. Restructure the 4 independent comparisons to be grouped together for maximum instruction-level parallelism on out-of-order cores, followed by 4 independent select operations.LBB0_3style labels with descriptive names (int32_neon,int32_loop,int32_scalar, etc.).New test file (
min_max_test.go):Benchmark results (Apple M4, 6 iterations, benchstat):
32-bit: ~2× throughput (38 GB/s → 81 GB/s at n=1024). Geomean: -27.9% latency, +38.7% throughput.
The 64-bit improvement is small (~0.4%) because the M4's out-of-order engine already absorbs MOV latency via register renaming. On in-order or narrower cores (Cortex-A55/A76) the BIT/BIF optimization would show a larger improvement.
Are these changes tested?
Yes. New correctness tests validate all 4 NEON functions against the pure Go reference implementation across 15 input sizes that exercise:
Each test forces
MinInt/MaxIntvalues at random positions to verify extreme values are handled correctly.Are there any user-facing changes?
No API changes. This is a pure performance improvement to internal SIMD routines used by Parquet statistics computation and Arrow dictionary operations.