Skip to content

perf: optimize ARM64 NEON min/max assembly#748

Open
zeroshade wants to merge 2 commits intoapache:mainfrom
zeroshade:optimize-neon-min-max
Open

perf: optimize ARM64 NEON min/max assembly#748
zeroshade wants to merge 2 commits intoapache:mainfrom
zeroshade:optimize-neon-min-max

Conversation

@zeroshade
Copy link
Copy Markdown
Member

Rationale for this change

The NEON assembly in internal/utils/min_max_neon_arm64.s was machine-translated from compiler output (via asm2plan9s) and had two significant inefficiencies:

  1. 32-bit functions used half the available NEON register width.2s (64-bit D-registers, 2 lanes) instead of .4s (128-bit Q-registers, 4 lanes), leaving half the hardware throughput on the table.
  2. 64-bit functions wasted 4 MOV instructions per loop iterationBSL (bit select) is destructive to its mask operand, forcing register saves before each compare+select. ARM64 provides BIT/BIF (bit insert if true/false) which are destructive to the accumulator instead, eliminating the need for saves entirely.

What changes are included in this PR?

Assembly optimizations (min_max_neon_arm64.s):

  • 32-bit (int32/uint32): Widen all NEON operations from .2s to .4s, processing 8 elements per loop iteration instead of 4. Use sminv/smaxv/uminv/umaxv for single-instruction horizontal reduction instead of manual dup + compare pairs. Adjust loop mask from 0xfffffffc (multiples of 4) to 0xfffffff8 (multiples of 8) and scalar tail threshold from 3 to 7.
  • 64-bit (int64/uint64): Replace BSL + 4×MOV register saves with BIT/BIF instructions. Restructure the 4 independent comparisons to be grouped together for maximum instruction-level parallelism on out-of-order cores, followed by 4 independent select operations.
  • Readability: Replace LBB0_3 style labels with descriptive names (int32_neon, int32_loop, int32_scalar, etc.).

New test file (min_max_test.go):

  • Correctness tests for all 4 types (int32, uint32, int64, uint64) validating NEON results against pure Go implementation across 15 boundary sizes including NEON/scalar transition points (1, 3, 4, 7, 8, 9, 15, 16, 31, 63, 64, 100, 1024).
  • Benchmarks for all 4 types at 5 input sizes (64, 256, 1024, 8192, 65536) with throughput reporting.

Benchmark results (Apple M4, 6 iterations, benchstat):

                        │ before        │     after                              │
                        │    sec/op     │   sec/op     vs base                  │
MinMaxInt32/n=64-10       5.992n ± 1%    3.675n ± 0%   -38.67% (p=0.002 n=6)
MinMaxInt32/n=256-10      20.80n ± 1%    10.75n ± 1%   -48.35% (p=0.002 n=6)
MinMaxInt32/n=1024-10    107.20n ± 0%    50.70n ± 0%   -52.71% (p=0.002 n=6)
MinMaxInt32/n=8192-10     921.6n ± 0%    466.5n ± 0%   -49.39% (p=0.002 n=6)
MinMaxInt32/n=65536-10    7.570µ ± 1%    3.909µ ± 0%   -48.37% (p=0.002 n=6)
MinMaxUint32/n=64-10      6.039n ± 1%    3.694n ± 0%   -38.83% (p=0.002 n=6)
MinMaxUint32/n=256-10     21.25n ± 0%    10.89n ± 0%   -48.76% (p=0.002 n=6)
MinMaxUint32/n=1024-10   109.75n ± 0%    51.81n ± 0%   -52.79% (p=0.002 n=6)
MinMaxUint32/n=8192-10    936.9n ± 0%    474.6n ± 0%   -49.34% (p=0.002 n=6)
MinMaxUint32/n=65536-10   7.667µ ± 0%    3.960µ ± 0%   -48.36% (p=0.002 n=6)
MinMaxInt64/n=64-10       11.18n ± 0%    11.10n ± 0%    -0.72% (p=0.002 n=6)
MinMaxInt64/n=256-10      51.09n ± 0%    50.96n ± 0%    -0.24% (p=0.022 n=6)
MinMaxInt64/n=1024-10     233.2n ± 0%    232.2n ± 0%    -0.41% (p=0.013 n=6)
MinMaxInt64/n=8192-10     1.917µ ± 0%    1.910µ ± 1%    -0.37% (p=0.002 n=6)
MinMaxInt64/n=65536-10    15.59µ ± 0%    15.53µ ± 0%    -0.40% (p=0.004 n=6)
MinMaxUint64/n=64-10      11.10n ± 0%    11.06n ± 0%    -0.41% (p=0.004 n=6)
MinMaxUint64/n=256-10     51.29n ± 0%    51.11n ± 0%         ~ (p=0.052 n=6)
MinMaxUint64/n=1024-10    233.9n ± 1%    233.1n ± 0%         ~ (p=0.219 n=6)
MinMaxUint64/n=8192-10    1.929µ ± 0%    1.917µ ± 0%    -0.60% (p=0.006 n=6)
MinMaxUint64/n=65536-10   15.65µ ± 0%    15.59µ ± 0%    -0.38% (p=0.024 n=6)
geomean                    228.5n         164.8n        -27.87%

32-bit: ~2× throughput (38 GB/s → 81 GB/s at n=1024). Geomean: -27.9% latency, +38.7% throughput.

The 64-bit improvement is small (~0.4%) because the M4's out-of-order engine already absorbs MOV latency via register renaming. On in-order or narrower cores (Cortex-A55/A76) the BIT/BIF optimization would show a larger improvement.

Are these changes tested?

Yes. New correctness tests validate all 4 NEON functions against the pure Go reference implementation across 15 input sizes that exercise:

  • Empty input (length 0)
  • Scalar-only paths (length 1–7 for 32-bit, 1–3 for 64-bit)
  • Exact NEON boundary (length 8 for 32-bit, length 4 for 64-bit)
  • NEON + scalar tail (length 9, 15, 31, 63, 100)
  • Pure NEON (length 16, 64, 1024)

Each test forces MinInt/MaxInt values at random positions to verify extreme values are handled correctly.

Are there any user-facing changes?

No API changes. This is a pure performance improvement to internal SIMD routines used by Parquet statistics computation and Arrow dictionary operations.

Improve the NEON assembly for min/max operations in two ways:

1. 32-bit (int32/uint32): Widen from .2s (64-bit D-registers, 2 lanes)
   to .4s (128-bit Q-registers, 4 lanes), doubling throughput with the
   same instruction count. Use sminv/smaxv/uminv/umaxv for horizontal
   reduction instead of manual dup+compare.

2. 64-bit (int64/uint64): Replace BSL (bit select) + 4 MOV register
   saves per loop iteration with BIT/BIF (bit insert if true/false),
   which are destructive to the accumulator rather than the mask,
   eliminating all register save/restore overhead. Restructure
   comparisons for maximum ILP on out-of-order cores.

Also adds correctness tests validating NEON results against the pure Go
implementation across boundary sizes (1, 3, 4, 7, 8, 9, 15, 16, etc.)
and benchmarks for int32/uint32/int64/uint64 at various input sizes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant