perf: Use AArch64 SVE gather to speed up RLE dictionary decoding

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

When reading dictionary-encoded columns from Parquet, `RleDecoder::get_batch_with_dict` (in `parquet/src/encodings/rle.rs`) is on a very hot path. In the bit-packed branch, the decoder unpacks the indices into a scratch buffer and then materializes the output with a scalar, per-element dictionary lookup:

```rust
buffer[values_read..values_read + num_values]
    .iter_mut()
    .zip(index_buf[..num_values].iter())
    .for_each(|(b, i)| b.clone_from(&dict[*i as usize]));
```

This is a sequence of dependent, data-dependent loads (a gather) and dominates decode time for dictionary columns with primitive value types. On AArch64 CPUs that implement SVE (e.g. Kunpeng 920 / Neoverse-class server cores), this loop leaves the hardware gather capability completely unused, so dictionary decode is slower than necessary on this architecture.

`perf` profiling of a TPC-H workload on AArch64 SVE hardware shows this gather as one of the top hotspots in the Parquet read path for dictionary-encoded primitive columns.

**Describe the solution you'd like**

Add an AArch64-only SVE fast path for the dictionary gather in `get_batch_with_dict`, keeping the scalar implementation as the fallback:

- A small `#[cfg(target_arch = "aarch64")]` module that gathers 4-byte (i32/f32) and 8-byte (i64/f64) dictionary values using SVE indexed loads (`ld1w` / `ld1d` with a vector index), processing one vector-length of elements per iteration via `whilelt` predication (vector-length agnostic).
- Runtime SVE detection via `std::arch::is_aarch64_feature_detected!("sve")`, cached in an `AtomicU8` so the check amortizes to a single relaxed load on the hot path.
- The fast path only engages for `size_of::<T>() == 4 | 8`; all other types, and all non-AArch64 / non-SVE targets, fall back to the existing scalar `clone_from` loop. Results are bit-for-bit identical to the scalar path — only the gather is accelerated.

This is purely additive: no public API change, and no behaviour change on any existing platform.

**Measured improvement.** Benchmarked on Kunpeng 920B (SVE, 256-bit) over the full TPC-H query set against ~140 GB of data. Build flags were identical for the baseline and the patched build; the only difference is this SVE fast path. Per-function times were measured with `perf`, aggregated by symbol; each value is the mean of 3 runs. The SVE path was confirmed active at runtime via `is_aarch64_feature_detected!("sve")`.

- **Target function (`get_batch_with_dict`), summed over all 22 queries:** **2875 ms → 1622 ms** — a **43.6% reduction (1.77× faster)** on the optimized kernel.
- **End-to-end TPC-H (22 queries):** **+1.83% overall** (table below); 20/22 queries are faster and the 2 outliers (Q3, Q10, ≈1%) are within run-to-run noise. The end-to-end figure is smaller because dictionary decode is only a fraction of total query time — the kernel-level number above isolates the actual win.

| Query | before (s) | after (s) | Δ (faster) |
| ----- | ---------- | --------- | ---------- |
| Q1    | 6.037      | 5.987     | +0.83%     |
| Q2    | 1.208      | 1.190     | +1.45%     |
| Q3    | 5.584      | 5.673     | −1.59%     |
| Q4    | 3.190      | 3.127     | +1.98%     |
| Q5    | 4.432      | 4.407     | +0.57%     |
| Q6    | 1.473      | 1.398     | +5.06%     |
| Q7    | 6.441      | 6.265     | +2.73%     |
| Q8    | 6.057      | 5.983     | +1.23%     |
| Q9    | 23.494     | 22.921    | +2.44%     |
| Q10   | 6.302      | 6.366     | −1.01%     |
| Q11   | 2.465      | 2.436     | +1.18%     |
| Q12   | 3.070      | 2.953     | +3.82%     |
| Q13   | 9.954      | 9.708     | +2.46%     |
| Q14   | 3.676      | 3.631     | +1.24%     |
| Q15   | 2.862      | 2.798     | +2.25%     |
| Q16   | 3.458      | 3.402     | +1.62%     |
| Q17   | 3.349      | 3.327     | +0.66%     |
| Q18   | 10.431     | 10.278    | +1.46%     |
| Q19   | 4.756      | 4.610     | +3.07%     |
| Q20   | 4.888      | 4.845     | +0.87%     |
| Q21   | 50.797     | 49.641    | +2.28%     |
| Q22   | 3.937      | 3.837     | +2.55%     |
| Total | 167.86     | 164.78    | +1.83%     |

**Describe alternatives you've considered**

- **Rely on autovectorization** — the compiler does not turn this arbitrary-index gather into SVE gather instructions.
- **`std::simd` / portable SIMD** — gather with arbitrary indices is not available on stable, and portable fixed-width SIMD cannot express SVE's vector-length-agnostic (VLA) gather.
- **Stable `std::arch` SVE intrinsics** — SVE intrinsics are still unstable in Rust, which is why a small, audited `asm!` block is used; it can be swapped for intrinsics once they stabilize. This is the main difference from existing SIMD in the repo — e.g. `arrow-arith`'s AVX paths are `target_feature`-gated at compile time, and `parquet`'s `simdutf8` path is feature-gated — here runtime detection is needed because SVE availability/width isn't known at compile time for portable binaries.
- **NEON** — fixed-width NEON has no true gather instruction, so it offers little benefit for this access pattern.
- **Leave as-is** — simplest, but forfeits a meaningful win on a growing class of AArch64 SVE server CPUs.

**Additional context**

- Scope is limited to `RleDecoder::get_batch_with_dict`; the encoder, `get`, `get_batch`, and `skip` are untouched.
- Prior art in the repo for arch-specific SIMD acceleration: `arrow-arith/src/aggregate.rs` (AVX512/AVX dispatch) and `parquet/src/util/utf8.rs` (`simdutf8`). This proposal follows the same spirit, adding runtime-detected SVE for AArch64.
- The SVE path uses `unsafe` inline assembly. Safety contract for each helper: `dict` must be valid for reads up to the maximum index, `indices` must point to `count` valid `i32`s, and `output` must have `count` writable slots; the public entry point only dispatches into it after confirming SVE availability and `size_of::<T>()`.
- I'm happy to open a PR with the implementation, an SVE-specific test plus a Criterion benchmark, and CI notes for exercising the AArch64 path. I've implemented this with runtime detection (zero cost on other targets, automatic on SVE hardware); happy to gate it behind a Cargo feature instead if you'd prefer a more conservative default.
- This is my first contribution to arrow-rs, so apologies in advance if I've missed any conventions — happy to adjust the issue/PR format, benchmarks, or anything else per your guidance. Just let me know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Use AArch64 SVE gather to speed up RLE dictionary decoding #10036

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Query	before (s)	after (s)	Δ (faster)
Q1	6.037	5.987	+0.83%
Q2	1.208	1.190	+1.45%
Q3	5.584	5.673	−1.59%
Q4	3.190	3.127	+1.98%
Q5	4.432	4.407	+0.57%
Q6	1.473	1.398	+5.06%
Q7	6.441	6.265	+2.73%
Q8	6.057	5.983	+1.23%
Q9	23.494	22.921	+2.44%
Q10	6.302	6.366	−1.01%
Q11	2.465	2.436	+1.18%
Q12	3.070	2.953	+3.82%
Q13	9.954	9.708	+2.46%
Q14	3.676	3.631	+1.24%
Q15	2.862	2.798	+2.25%
Q16	3.458	3.402	+1.62%
Q17	3.349	3.327	+0.66%
Q18	10.431	10.278	+1.46%
Q19	4.756	4.610	+3.07%
Q20	4.888	4.845	+0.87%
Q21	50.797	49.641	+2.28%
Q22	3.937	3.837	+2.55%
Total	167.86	164.78	+1.83%

perf: Use AArch64 SVE gather to speed up RLE dictionary decoding #10036

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions