perf: optimize ARM64 NEON min/max assembly by zeroshade · Pull Request #748 · apache/arrow-go

zeroshade · 2026-04-03T21:42:02Z

Rationale for this change

The NEON assembly in internal/utils/min_max_neon_arm64.s was machine-translated from compiler output (via asm2plan9s) and had two significant inefficiencies:

32-bit functions used half the available NEON register width — .2s (64-bit D-registers, 2 lanes) instead of .4s (128-bit Q-registers, 4 lanes), leaving half the hardware throughput on the table.
64-bit functions wasted 4 MOV instructions per loop iteration — BSL (bit select) is destructive to its mask operand, forcing register saves before each compare+select. ARM64 provides BIT/BIF (bit insert if true/false) which are destructive to the accumulator instead, eliminating the need for saves entirely.

What changes are included in this PR?

Assembly optimizations (min_max_neon_arm64.s):

32-bit (int32/uint32): Widen all NEON operations from .2s to .4s, processing 8 elements per loop iteration instead of 4. Use sminv/smaxv/uminv/umaxv for single-instruction horizontal reduction instead of manual dup + compare pairs. Adjust loop mask from 0xfffffffc (multiples of 4) to 0xfffffff8 (multiples of 8) and scalar tail threshold from 3 to 7.
64-bit (int64/uint64): Replace BSL + 4×MOV register saves with BIT/BIF instructions. Restructure the 4 independent comparisons to be grouped together for maximum instruction-level parallelism on out-of-order cores, followed by 4 independent select operations.
Readability: Replace LBB0_3 style labels with descriptive names (int32_neon, int32_loop, int32_scalar, etc.).

New test file (min_max_test.go):

Correctness tests for all 4 types (int32, uint32, int64, uint64) validating NEON results against pure Go implementation across 15 boundary sizes including NEON/scalar transition points (1, 3, 4, 7, 8, 9, 15, 16, 31, 63, 64, 100, 1024).
Benchmarks for all 4 types at 5 input sizes (64, 256, 1024, 8192, 65536) with throughput reporting.

Benchmark results (Apple M4, 6 iterations, benchstat):

                        │ before        │     after                              │
                        │    sec/op     │   sec/op     vs base                  │
MinMaxInt32/n=64-10       5.992n ± 1%    3.675n ± 0%   -38.67% (p=0.002 n=6)
MinMaxInt32/n=256-10      20.80n ± 1%    10.75n ± 1%   -48.35% (p=0.002 n=6)
MinMaxInt32/n=1024-10    107.20n ± 0%    50.70n ± 0%   -52.71% (p=0.002 n=6)
MinMaxInt32/n=8192-10     921.6n ± 0%    466.5n ± 0%   -49.39% (p=0.002 n=6)
MinMaxInt32/n=65536-10    7.570µ ± 1%    3.909µ ± 0%   -48.37% (p=0.002 n=6)
MinMaxUint32/n=64-10      6.039n ± 1%    3.694n ± 0%   -38.83% (p=0.002 n=6)
MinMaxUint32/n=256-10     21.25n ± 0%    10.89n ± 0%   -48.76% (p=0.002 n=6)
MinMaxUint32/n=1024-10   109.75n ± 0%    51.81n ± 0%   -52.79% (p=0.002 n=6)
MinMaxUint32/n=8192-10    936.9n ± 0%    474.6n ± 0%   -49.34% (p=0.002 n=6)
MinMaxUint32/n=65536-10   7.667µ ± 0%    3.960µ ± 0%   -48.36% (p=0.002 n=6)
MinMaxInt64/n=64-10       11.18n ± 0%    11.10n ± 0%    -0.72% (p=0.002 n=6)
MinMaxInt64/n=256-10      51.09n ± 0%    50.96n ± 0%    -0.24% (p=0.022 n=6)
MinMaxInt64/n=1024-10     233.2n ± 0%    232.2n ± 0%    -0.41% (p=0.013 n=6)
MinMaxInt64/n=8192-10     1.917µ ± 0%    1.910µ ± 1%    -0.37% (p=0.002 n=6)
MinMaxInt64/n=65536-10    15.59µ ± 0%    15.53µ ± 0%    -0.40% (p=0.004 n=6)
MinMaxUint64/n=64-10      11.10n ± 0%    11.06n ± 0%    -0.41% (p=0.004 n=6)
MinMaxUint64/n=256-10     51.29n ± 0%    51.11n ± 0%         ~ (p=0.052 n=6)
MinMaxUint64/n=1024-10    233.9n ± 1%    233.1n ± 0%         ~ (p=0.219 n=6)
MinMaxUint64/n=8192-10    1.929µ ± 0%    1.917µ ± 0%    -0.60% (p=0.006 n=6)
MinMaxUint64/n=65536-10   15.65µ ± 0%    15.59µ ± 0%    -0.38% (p=0.024 n=6)
geomean                    228.5n         164.8n        -27.87%

32-bit: ~2× throughput (38 GB/s → 81 GB/s at n=1024). Geomean: -27.9% latency, +38.7% throughput.

The 64-bit improvement is small (~0.4%) because the M4's out-of-order engine already absorbs MOV latency via register renaming. On in-order or narrower cores (Cortex-A55/A76) the BIT/BIF optimization would show a larger improvement.

Are these changes tested?

Yes. New correctness tests validate all 4 NEON functions against the pure Go reference implementation across 15 input sizes that exercise:

Empty input (length 0)
Scalar-only paths (length 1–7 for 32-bit, 1–3 for 64-bit)
Exact NEON boundary (length 8 for 32-bit, length 4 for 64-bit)
NEON + scalar tail (length 9, 15, 31, 63, 100)
Pure NEON (length 16, 64, 1024)

Each test forces MinInt/MaxInt values at random positions to verify extreme values are handled correctly.

Are there any user-facing changes?

No API changes. This is a pure performance improvement to internal SIMD routines used by Parquet statistics computation and Arrow dictionary operations.

Improve the NEON assembly for min/max operations in two ways: 1. 32-bit (int32/uint32): Widen from .2s (64-bit D-registers, 2 lanes) to .4s (128-bit Q-registers, 4 lanes), doubling throughput with the same instruction count. Use sminv/smaxv/uminv/umaxv for horizontal reduction instead of manual dup+compare. 2. 64-bit (int64/uint64): Replace BSL (bit select) + 4 MOV register saves per loop iteration with BIT/BIF (bit insert if true/false), which are destructive to the accumulator rather than the mask, eliminating all register save/restore overhead. Restructure comparisons for maximum ILP on out-of-order cores. Also adds correctness tests validating NEON results against the pure Go implementation across boundary sizes (1, 3, 4, 7, 8, 9, 15, 16, etc.) and benchmarks for int32/uint32/int64/uint64 at various input sizes.

zeroshade added 2 commits April 3, 2026 14:40

style: restore Go ABI frame pointer comments in NEON assembly

77847ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize ARM64 NEON min/max assembly#748

perf: optimize ARM64 NEON min/max assembly#748
zeroshade wants to merge 2 commits intoapache:mainfrom
zeroshade:optimize-neon-min-max

zeroshade commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zeroshade commented Apr 3, 2026

Rationale for this change

What changes are included in this PR?

Benchmark results (Apple M4, 6 iterations, benchstat):

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant