Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When reading dictionary-encoded columns from Parquet, RleDecoder::get_batch_with_dict (in parquet/src/encodings/rle.rs) is on a very hot path. In the bit-packed branch, the decoder unpacks the indices into a scratch buffer and then materializes the output with a scalar, per-element dictionary lookup:
buffer[values_read..values_read + num_values]
.iter_mut()
.zip(index_buf[..num_values].iter())
.for_each(|(b, i)| b.clone_from(&dict[*i as usize]));
This is a sequence of dependent, data-dependent loads (a gather) and dominates decode time for dictionary columns with primitive value types. On AArch64 CPUs that implement SVE (e.g. Kunpeng 920 / Neoverse-class server cores), this loop leaves the hardware gather capability completely unused, so dictionary decode is slower than necessary on this architecture.
perf profiling of a TPC-H workload on AArch64 SVE hardware shows this gather as one of the top hotspots in the Parquet read path for dictionary-encoded primitive columns.
Describe the solution you'd like
Add an AArch64-only SVE fast path for the dictionary gather in get_batch_with_dict, keeping the scalar implementation as the fallback:
- A small
#[cfg(target_arch = "aarch64")] module that gathers 4-byte (i32/f32) and 8-byte (i64/f64) dictionary values using SVE indexed loads (ld1w / ld1d with a vector index), processing one vector-length of elements per iteration via whilelt predication (vector-length agnostic).
- Runtime SVE detection via
std::arch::is_aarch64_feature_detected!("sve"), cached in an AtomicU8 so the check amortizes to a single relaxed load on the hot path.
- The fast path only engages for
size_of::<T>() == 4 | 8; all other types, and all non-AArch64 / non-SVE targets, fall back to the existing scalar clone_from loop. Results are bit-for-bit identical to the scalar path — only the gather is accelerated.
This is purely additive: no public API change, and no behaviour change on any existing platform.
Measured improvement. Benchmarked on Kunpeng 920B (SVE, 256-bit) over the full TPC-H query set against ~140 GB of data. Build flags were identical for the baseline and the patched build; the only difference is this SVE fast path. Per-function times were measured with perf, aggregated by symbol; each value is the mean of 3 runs. The SVE path was confirmed active at runtime via is_aarch64_feature_detected!("sve").
- Target function (
get_batch_with_dict), summed over all 22 queries: 2875 ms → 1622 ms — a 43.6% reduction (1.77× faster) on the optimized kernel.
- End-to-end TPC-H (22 queries): +1.83% overall (table below); 20/22 queries are faster and the 2 outliers (Q3, Q10, ≈1%) are within run-to-run noise. The end-to-end figure is smaller because dictionary decode is only a fraction of total query time — the kernel-level number above isolates the actual win.
| Query |
before (s) |
after (s) |
Δ (faster) |
| Q1 |
6.037 |
5.987 |
+0.83% |
| Q2 |
1.208 |
1.190 |
+1.45% |
| Q3 |
5.584 |
5.673 |
−1.59% |
| Q4 |
3.190 |
3.127 |
+1.98% |
| Q5 |
4.432 |
4.407 |
+0.57% |
| Q6 |
1.473 |
1.398 |
+5.06% |
| Q7 |
6.441 |
6.265 |
+2.73% |
| Q8 |
6.057 |
5.983 |
+1.23% |
| Q9 |
23.494 |
22.921 |
+2.44% |
| Q10 |
6.302 |
6.366 |
−1.01% |
| Q11 |
2.465 |
2.436 |
+1.18% |
| Q12 |
3.070 |
2.953 |
+3.82% |
| Q13 |
9.954 |
9.708 |
+2.46% |
| Q14 |
3.676 |
3.631 |
+1.24% |
| Q15 |
2.862 |
2.798 |
+2.25% |
| Q16 |
3.458 |
3.402 |
+1.62% |
| Q17 |
3.349 |
3.327 |
+0.66% |
| Q18 |
10.431 |
10.278 |
+1.46% |
| Q19 |
4.756 |
4.610 |
+3.07% |
| Q20 |
4.888 |
4.845 |
+0.87% |
| Q21 |
50.797 |
49.641 |
+2.28% |
| Q22 |
3.937 |
3.837 |
+2.55% |
| Total |
167.86 |
164.78 |
+1.83% |
Describe alternatives you've considered
- Rely on autovectorization — the compiler does not turn this arbitrary-index gather into SVE gather instructions.
std::simd / portable SIMD — gather with arbitrary indices is not available on stable, and portable fixed-width SIMD cannot express SVE's vector-length-agnostic (VLA) gather.
- Stable
std::arch SVE intrinsics — SVE intrinsics are still unstable in Rust, which is why a small, audited asm! block is used; it can be swapped for intrinsics once they stabilize. This is the main difference from existing SIMD in the repo — e.g. arrow-arith's AVX paths are target_feature-gated at compile time, and parquet's simdutf8 path is feature-gated — here runtime detection is needed because SVE availability/width isn't known at compile time for portable binaries.
- NEON — fixed-width NEON has no true gather instruction, so it offers little benefit for this access pattern.
- Leave as-is — simplest, but forfeits a meaningful win on a growing class of AArch64 SVE server CPUs.
Additional context
- Scope is limited to
RleDecoder::get_batch_with_dict; the encoder, get, get_batch, and skip are untouched.
- Prior art in the repo for arch-specific SIMD acceleration:
arrow-arith/src/aggregate.rs (AVX512/AVX dispatch) and parquet/src/util/utf8.rs (simdutf8). This proposal follows the same spirit, adding runtime-detected SVE for AArch64.
- The SVE path uses
unsafe inline assembly. Safety contract for each helper: dict must be valid for reads up to the maximum index, indices must point to count valid i32s, and output must have count writable slots; the public entry point only dispatches into it after confirming SVE availability and size_of::<T>().
- I'm happy to open a PR with the implementation, an SVE-specific test plus a Criterion benchmark, and CI notes for exercising the AArch64 path. I've implemented this with runtime detection (zero cost on other targets, automatic on SVE hardware); happy to gate it behind a Cargo feature instead if you'd prefer a more conservative default.
- This is my first contribution to arrow-rs, so apologies in advance if I've missed any conventions — happy to adjust the issue/PR format, benchmarks, or anything else per your guidance. Just let me know.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When reading dictionary-encoded columns from Parquet,
RleDecoder::get_batch_with_dict(inparquet/src/encodings/rle.rs) is on a very hot path. In the bit-packed branch, the decoder unpacks the indices into a scratch buffer and then materializes the output with a scalar, per-element dictionary lookup:This is a sequence of dependent, data-dependent loads (a gather) and dominates decode time for dictionary columns with primitive value types. On AArch64 CPUs that implement SVE (e.g. Kunpeng 920 / Neoverse-class server cores), this loop leaves the hardware gather capability completely unused, so dictionary decode is slower than necessary on this architecture.
perfprofiling of a TPC-H workload on AArch64 SVE hardware shows this gather as one of the top hotspots in the Parquet read path for dictionary-encoded primitive columns.Describe the solution you'd like
Add an AArch64-only SVE fast path for the dictionary gather in
get_batch_with_dict, keeping the scalar implementation as the fallback:#[cfg(target_arch = "aarch64")]module that gathers 4-byte (i32/f32) and 8-byte (i64/f64) dictionary values using SVE indexed loads (ld1w/ld1dwith a vector index), processing one vector-length of elements per iteration viawhileltpredication (vector-length agnostic).std::arch::is_aarch64_feature_detected!("sve"), cached in anAtomicU8so the check amortizes to a single relaxed load on the hot path.size_of::<T>() == 4 | 8; all other types, and all non-AArch64 / non-SVE targets, fall back to the existing scalarclone_fromloop. Results are bit-for-bit identical to the scalar path — only the gather is accelerated.This is purely additive: no public API change, and no behaviour change on any existing platform.
Measured improvement. Benchmarked on Kunpeng 920B (SVE, 256-bit) over the full TPC-H query set against ~140 GB of data. Build flags were identical for the baseline and the patched build; the only difference is this SVE fast path. Per-function times were measured with
perf, aggregated by symbol; each value is the mean of 3 runs. The SVE path was confirmed active at runtime viais_aarch64_feature_detected!("sve").get_batch_with_dict), summed over all 22 queries: 2875 ms → 1622 ms — a 43.6% reduction (1.77× faster) on the optimized kernel.Describe alternatives you've considered
std::simd/ portable SIMD — gather with arbitrary indices is not available on stable, and portable fixed-width SIMD cannot express SVE's vector-length-agnostic (VLA) gather.std::archSVE intrinsics — SVE intrinsics are still unstable in Rust, which is why a small, auditedasm!block is used; it can be swapped for intrinsics once they stabilize. This is the main difference from existing SIMD in the repo — e.g.arrow-arith's AVX paths aretarget_feature-gated at compile time, andparquet'ssimdutf8path is feature-gated — here runtime detection is needed because SVE availability/width isn't known at compile time for portable binaries.Additional context
RleDecoder::get_batch_with_dict; the encoder,get,get_batch, andskipare untouched.arrow-arith/src/aggregate.rs(AVX512/AVX dispatch) andparquet/src/util/utf8.rs(simdutf8). This proposal follows the same spirit, adding runtime-detected SVE for AArch64.unsafeinline assembly. Safety contract for each helper:dictmust be valid for reads up to the maximum index,indicesmust point tocountvalidi32s, andoutputmust havecountwritable slots; the public entry point only dispatches into it after confirming SVE availability andsize_of::<T>().