Bulk-fill definition levels for majority-null leaf columns#9967
Conversation
HippoBaro
left a comment
There was a problem hiding this comment.
I love it! I think the code can be simplified a bit, and we can probably remove the unsafe by slicing the NullBuffer and using nulls.valid_indices()/nulls.iter() instead of constructing BitIndexIterator directly.
NullBuffer will do the BitIndexIterator business for you!
| max_def_level - (!valid as i16) | ||
| })); | ||
| } | ||
| info.non_null_indices.reserve(len - valid_in_range); |
There was a problem hiding this comment.
This is the null count right? We should reserve valid_in_range directly I think?
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/arrow-parquet-null-heavy (7341370) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
…umns When writing a nullable leaf (primitive) Arrow array, `write_leaf` built the definition-level buffer one element at a time, mapping each null bit to a level. For columns that are mostly null this does ~num_rows of branchy work and allocates a num_rows level buffer even though almost every level is the same value. Add a length-gated bulk-fill path: when the column is majority-null and the sub-range is large enough to amortize the gate's per-call cost, build the definition levels by bulk-filling the null level (a vectorized memset) and overwriting only the non-null positions found via `NullBuffer::valid_indices()`. The per-row path is kept for non-majority-null arrays and for the small sub-ranges produced by list/struct write paths, so those shapes are not regressed. Contributes to apache#9731. Complements apache#9954's all-null fast path by covering the sparse (mostly-but-not-entirely-null) case it does not handle. Threshold sweep on Ryzen 9 9950X (parquet/arrow_writer benches, /default variant, vs main): T primitive list_primitive primitive_sparse list_primitive_sparse ---------------------------------------------------------------------- 0 -3.0% +2.6% -36.1% +7.8% 16 -1.4% +1.8% -34.8% +2.8% 32 -1.1% -0.1% -35.1% +1.7% 64 -1.1% +0.7% -34.5% +1.7% <- chosen 128 -1.0% +1.5% -35.1% +2.4% 256 -1.4% +1.4% -35.1% +2.7% T=0 reproduces the per-call slice/popcount regression on list_primitive_sparse_99pct_null (+7.8%, matches the criterion bot's original measurement on the unguarded version). The +1.7% floor at T>=32 is the structural cost of evaluating the gate itself across ~10K small write_leaf calls in the list path; reducing it further would require hoisting the decision into the caller. T=64 matches T=32 on every shape and gives ~12x margin over the avg list length of ~5. Final benchmarks vs main on Ryzen 9 9950X (T=64, /default variants): primitive/default -1.5% primitive_non_null/default -2.8% primitive_sparse_99pct_null/default -35.1% primitive_all_null/default -66.4% list_primitive/default +1.8% (within noise) list_primitive_non_null/default -0.7% (no change, p=0.45) list_primitive_sparse_99pct_null/default +3.0% (gate-check floor) struct_sparse_99pct_null/default -4.9% bool/default +2.2%
7341370 to
d825a1b
Compare
|
Thanks — applied both. Switched the fast path to Also caught a regression the benchmark bot surfaced that I had missed in my own measurement — I hadn't benchmarked the list paths. Added a length gate on entering the new path:
Breakeven for the list-sparse case is between T=0 and T=32. The +1.7% floor at T≥32 is the structural cost of evaluating the gate across ~10K calls, not the fast-path execution; reducing it further would require hoisting the decision into Re-triggering the bench. |
|
run benchmark arrow_writer |
|
Hi @RyanJamesStewart, thanks for the request (#9967 (comment)). Only whitelisted users can trigger benchmarks. Allowed users: Dandandan, Fokko, Jefffrey, Omega359, adriangb, alamb, asubiotto, brunal, buraksenn, cetra3, codephage2020, coderfender, comphead, erenavsarogullari, etseidl, friendlymatthew, gabotechs, geoffreyclaude, grtlr, haohuaijin, jonathanc-n, kevinjqliu, klion26, kosiew, kumarUjjawal, kunalsinghdadhwal, liamzwbao, mbutrovich, mkleen, mzabaluev, neilconway, rluvaton, sdf-jkl, timsaucer, xudong963, zhuqi-lucas. File an issue against this benchmark runner |
|
Pushed a fmt fixup (the On the bench: the run on On the bench re-trigger: would be good to have the post-gate numbers on record before further review, if anyone on the whitelist is willing to re-run |
| let bits = nulls.inner(); | ||
| info.def_levels.extend_from_iter(range.clone().map(|i| { | ||
| // Safety: range.end was asserted to be in bounds earlier | ||
| let valid = unsafe { bits.value_unchecked(i) }; | ||
| max_def_level - (!valid as i16) | ||
| })); | ||
| info.non_null_indices.reserve(len); | ||
| info.non_null_indices.extend( | ||
| BitIndexIterator::new(bits.inner(), bits.offset() + range.start, len) | ||
| .map(|i| i + range.start), | ||
| ); |
There was a problem hiding this comment.
| let bits = nulls.inner(); | |
| info.def_levels.extend_from_iter(range.clone().map(|i| { | |
| // Safety: range.end was asserted to be in bounds earlier | |
| let valid = unsafe { bits.value_unchecked(i) }; | |
| max_def_level - (!valid as i16) | |
| })); | |
| info.non_null_indices.reserve(len); | |
| info.non_null_indices.extend( | |
| BitIndexIterator::new(bits.inner(), bits.offset() + range.start, len) | |
| .map(|i| i + range.start), | |
| ); | |
| info.def_levels.extend_from_iter( | |
| nulls.iter().map(|valid| max_def_level - (!valid as i16)), | |
| ); |
No?
There was a problem hiding this comment.
Want to make sure I'm reading the suggestion right. nulls here is the full-array NullBuffer (line 953, unsliced; we just assert range.end <= nulls.len()), so nulls.iter() would produce nulls.len() levels rather than len. Did you have in mind slicing first the way the bulk-fill branch a few lines up does, i.e. let range_nulls = nulls.slice(range.start, len) and then iterating range_nulls? That would also let non_null_indices use range_nulls.valid_indices().map(|i| i + range.start), which I think still needs to happen for the downstream value path. Or were you proposing a broader restructure where non_null_indices isn't populated on this branch?
There was a problem hiding this comment.
Did you have in mind slicing first the way the bulk-fill branch a few lines up does, i.e.
let range_nulls = nulls.slice(range.start, len)and then iteratingrange_nulls? That would also letnon_null_indicesuserange_nulls.valid_indices().map(|i| i + range.start), which I think still needs to happen for the downstream value path
I think this.
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/arrow-parquet-null-heavy (a717169) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
etseidl
left a comment
There was a problem hiding this comment.
Thanks @RyanJamesStewart, this looks like a nice improvement. Just a few nits.
| // Bulk-fill fast path. Gated on: | ||
| // - `len >= 64`: per-call slice/popcount/iterator overhead only | ||
| // amortizes on sizable sub-ranges. List/struct paths call | ||
| // write_leaf many times with tiny ranges (avg list length 1-5); | ||
| // paying any per-call popcount there would regress them. A | ||
| // threshold sweep at T={0,16,32,64,128,256} on Ryzen 9 9950X | ||
| // shows the regression floor settles by T=32 and the choice of | ||
| // 64 gives ~12x margin over avg list length without losing the | ||
| // flat-primitive wins. | ||
| // - `nulls.null_count() * 2 >= nulls.len()`: cached `null_count()` | ||
| // is O(1), so this check is free. We use the buffer-level density | ||
| // as a heuristic for the sub-range; for full-array writes (the | ||
| // primary target — flat primitive columns) it's exact. | ||
| // Note: even when this gate skips the fast path, evaluating the gate | ||
| // itself across high-frequency call sites (~10K calls in some list | ||
| // benchmarks) is a small structural cost (~+1-2% on list-sparse | ||
| // cases). It's the price of having any gate at all on this hot path; | ||
| // reducing it further would require hoisting the decision into the | ||
| // caller. The wins on the targeted shapes (-35% sparse-primitive, | ||
| // -66% all-null primitive) far outweigh it. |
There was a problem hiding this comment.
Please refer to https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md#ai-generated-submissions
I think this exposition is better in the PR than the code.
| // reducing it further would require hoisting the decision into the | ||
| // caller. The wins on the targeted shapes (-35% sparse-primitive, | ||
| // -66% all-null primitive) far outweigh it. | ||
| if len >= 64 && nulls.null_count() * 2 >= nulls.len() { |
There was a problem hiding this comment.
Please replace 64 with a constant. Documentation for the constant can point to this PR for an explanation as to how 64 was arrived at.
- Extract the `64` threshold into `BULK_FILL_MIN_LEN` const with a doc comment pointing to this PR for the sweep rationale. - Trim the in-code comment block; threshold-sweep rationale moves to the PR description per CONTRIBUTING.md guidance on AI-generated submissions. - Refactor the slow-path else branch to mirror the fast-path slice idiom: `let range_nulls = nulls.slice(range.start, len);` then `range_nulls.iter()` for `def_levels` and `range_nulls.valid_indices().map(|i| i + range.start)` for `non_null_indices`. Drops the `unsafe value_unchecked` and the manual `BitIndexIterator::new` offset arithmetic. - Drop the now-unused `BitIndexIterator` import. Behavior unchanged. `cargo test -p parquet --features arrow --lib arrow_writer` green (136 tests); clippy and fmt clean.
a717169 to
4eaed80
Compare
|
Pushed v2. Four changes from review:
Tests, clippy, and fmt clean locally; CI is in flight. The fast-path branch and gate are unchanged, so the targeted bench shape (the latest bot run showed |
etseidl
left a comment
There was a problem hiding this comment.
Thanks @RyanJamesStewart, looking better. A few suggestions to reduce some code duplication.
| info.non_null_indices | ||
| .extend(range_nulls.valid_indices().map(|i| i + range.start)); | ||
| } else { | ||
| let range_nulls = nulls.slice(range.start, len); |
| info.non_null_indices | ||
| .extend(range_nulls.valid_indices().map(|i| i + range.start)); |
There was a problem hiding this comment.
and remove this (seems a reserve went missing as well).
…ulls arm The `logical_nulls = None` arm of `write_leaf` extends `non_null_indices` by `len` items but didn't pre-reserve capacity. Add the matching `reserve(len)` before the `extend(range.clone())` so the no-nulls path allocates once instead of relying on `Vec`'s amortized growth. Behavior unchanged. Caught by @etseidl's review round on PR apache#9967.
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/arrow-parquet-null-heavy (3f69dff) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer env:
BENCH_FILTER: list_prim |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/arrow-parquet-null-heavy (3f69dff) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
Hoist the shared work in `write_leaf`'s `Some(nulls)` arm out of the if/else where it can be done without changing the cost profile: - `let range_nulls = nulls.slice(range.start, len)`: O(1) metadata adjustment, safe to compute once and share between branches. - `info.non_null_indices.extend(range_nulls.valid_indices()...)`: a single iterator consumption after the branches, identical work in both paths. The `range_nulls.null_count()` popcount and `reserve(valid_in_range)` stay INSIDE the fast-path branch where bulk-fill needs the count anyway. An earlier revision (59ced5b) hoisted the popcount out and caused 11-24% regressions on `list_primitive*` benches because `write_leaf` is called ~10K times with avg range ~5 from list paths; paying the popcount per call dominated the saved hash probes. Bench (`list_primitive_sparse_99pct_null/default`, local hardware, this revision): 9.83 ms, in line with main's 10.7 ms on GKE c4a-highmem-16, and well below the 12.1 ms the prior dedup attempt showed. Behavior unchanged. 136 arrow_writer tests green; clippy + fmt clean.
…ulls arm The `logical_nulls = None` arm of `write_leaf` extends `non_null_indices` by `len` items but didn't pre-reserve capacity. Add the matching `reserve(len)` before the `extend(range.clone())` so the no-nulls path allocates once instead of relying on `Vec`'s amortized growth. Behavior unchanged. Caught by @etseidl's review round on PR apache#9967.
3f69dff to
df1f323
Compare
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
…erve The partial dedup in 695f178 (hoisting `let range_nulls = nulls.slice(...)` before the gate and `info.non_null_indices.extend(...)` after the if/else) regressed `list_primitive/default` 21% and `list_primitive_sparse_99pct_null/default` 13% on the GKE c4a-highmem-16 bench. Same operations as the v2 shape, but the compiler produced different code for the nullable-list path. Restore the v2 structure (slice declared inside each branch, both extends inline). Keep the `None`-arm `info.non_null_indices.reserve(len)` from df1f323 since that fixed `list_primitive_non_null/default` 1.12 -> 1.00 and addresses etseidl's reserve nit substantively. Bench shapes by revision on c4a-highmem-16: - v2 (4eaed80): list_primitive/default 1.00, non_null 1.00, sparse_99pct 1.00 - v3 (695f178, full dedup): 1.24, 1.12, 1.12 - v4 (df1f323, partial dedup + reserve): 1.21, 1.00, 1.13 - v5 (this commit): expected back to v2's clean shape with reserve fix retained 136 arrow_writer tests green; fmt + clippy clean.
|
Sorry @RyanJamesStewart I led you astray. I think the regression is due to the } else {
let nulls = nulls.inner();
info.def_levels.extend_from_iter(range.clone().map(|i| {
// Safety: range.end was asserted to be in bounds earlier
let valid = unsafe { nulls.value_unchecked(i) };
max_def_level - (!valid as i16)
}));
info.non_null_indices.reserve(len);
info.non_null_indices.extend(
BitIndexIterator::new(nulls.inner(), nulls.offset() + range.start, len)
.map(|i| i + range.start),
);
} |
|
run benchmark arrow_writer env:
BENCH_FILTER: list_primitive/ |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/arrow-parquet-null-heavy (cd166cc) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Per etseidl's diagnosis on PR apache#9967: the `nulls.slice()` call in write_leaf's not-enough-nulls arm is the regression source on list_primitive paths, not the dedup structure. The list_primitive paths call write_leaf ~10K times with avg range ~5, where per-call NullBuffer slice + range_nulls.iter() / valid_indices overhead dominates. Restore the v1 (original) slow-arm pattern: - `let bits = nulls.inner()` once - `def_levels.extend_from_iter(range.clone().map(...))` with `bits.value_unchecked(i)` (safe: range.end asserted in bounds) - `non_null_indices` via `BitIndexIterator::new(bits.inner(), bits.offset() + range.start, len)` Fast path keeps `nulls.slice()` / `valid_indices()` since bulk-fill needs them and `len >= BULK_FILL_MIN_LEN` (64) amortizes the cost. 136 arrow_writer tests green; fmt + clippy clean.
|
Pushed 552c773 with that slow-arm pattern: direct BitIndexIterator on bits.inner() with bits.offset() + range.start, value_unchecked inside the def_levels extend. No nulls.slice() per call on the cold path. Fast path keeps nulls.slice() + valid_indices() since the len >= 64 gate amortizes the slice cost there. Tests, clippy, fmt clean locally on the branch. Ready when you can retrigger the bench. |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/arrow-parquet-null-heavy (552c773) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer env:
BENCH_FILTER: string/zstd_parquet_2|string_non_null/parquet_2 |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/arrow-parquet-null-heavy (552c773) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
etseidl
left a comment
There was a problem hiding this comment.
Benches look good now. Thanks for working with me @RyanJamesStewart.
@HippoBaro any last thoughts?
|
I'll merge this one in and if @HippoBaro has any additional thoughts or suggestions perhaps we can address them as a follow on PR |
|
Apologies I was OOO. This looks good to me, thanks! |
Which issue does this PR close?
AI assistance
Implementation drafted with AI assistance and iterated against the benchmarks below. I've reviewed and own the code, including the gate threshold which I picked from the sweep in Threshold (
BULK_FILL_MIN_LEN). Per the project's CONTRIBUTING guidance on AI-generated submissions.Rationale for this change
When writing a nullable leaf (primitive) Arrow array,
write_leafbuilds the definition-level buffer one element at a time, mapping each null bit to a level. For columns that are mostly null this does ~num_rowsof branchy work and allocates anum_rows-element level buffer even though almost every produced level is the same value. #9954 adds an O(1) fast path for the entirely null case; this PR covers the sparse (mostly-but-not-entirely null) case it doesn't handle, the literal subject of #9731 ("a column that is 99% null … ~100x more work than necessary").What changes are included in this PR?
A single popcount pass over the null mask (
Buffer::count_set_bits_offset, O(num_rows/64)) counts the valid values in the range. When the slice is majority-null, the definition-level buffer is bulk-filled with the null level (a vectorizedVec::resizememset) and only the non-null positions (fromNullBuffer::valid_indices()) are overwritten. The existing per-row path is kept for non-majority-null slices, so balanced and null-light columns are unaffected. Both branches share the samelet range_nulls = nulls.slice(range.start, len)slicing idiom; the slow path usesrange_nulls.iter()for the def-level map andrange_nulls.valid_indices().map(|i| i + range.start)fornon_null_indices, with nounsafe. Output is byte-identical: the level values are unchanged, just produced via memset+scatter (fast path) or via the high-levelNullBufferiterators (slow path) instead of a manualBitIndexIteratorwalk.Threshold (
BULK_FILL_MIN_LEN)The bulk-fill fast path is gated on two conditions:
len >= BULK_FILL_MIN_LEN(currently 64). Per-call slice/popcount/iterator overhead only amortizes on sizable sub-ranges. List/struct paths callwrite_leafmany times with tiny ranges (avg list length 1-5); paying any per-call popcount there would regress them. A threshold sweep at T = {0, 16, 32, 64, 128, 256} on Ryzen 9 9950X shows the regression floor settles by T=32, and the choice of 64 gives ~12x margin over the average list length without losing the flat-primitive wins.nulls.null_count() * 2 >= nulls.len(). The cachednull_count()is O(1), so this check is free. We use the buffer-wide density as a heuristic for the sub-range; for full-array writes (the primary target, flat primitive columns) it's exact.Even when the gate skips the fast path, evaluating it across high-frequency call sites (~10K calls in some list benchmarks) is a small structural cost (~1-2% on list-sparse cases). The wins on the targeted shapes (-35% sparse-primitive, -66% all-null primitive) far outweigh that. Reducing the cost further would require hoisting the decision into the caller.
Are these changes tested?
Existing tests cover this path:
cargo test -p parquet --features arrow --lib arrow_writeris green (136 tests, full of nulls and roundtrips); fullcargo test -p parquet --features arrowgreen modulo the pre-existingPARQUET_TEST_DATAsubmodule failures (unrelated, same onmain).cargo clippy -p parquet --features arrow --libandcargo fmt --checkclean. Theunsafe get_unchecked_mutflagged in the original revision was replaced viaNullBuffer::valid_indices(); the slow-path also dropped itsunsafe value_uncheckedfor the same reason.Are there any user-facing changes?
None.
Benchmarks
cargo bench -p parquet --bench arrow_writer, 1M rows × 7 nullable primitive columns, local Ryzen 9 9950X:The CI benchmark bot (GKE
c4a-highmem-16, Neoverse-V2) on the post-fixup revision shows the same shape with stronger relative wins on the targeted cases:Microbench of the definition-level fill in isolation: 10.3x @ 100%-null, 8.6x @ 99%, 5.2x @ 90%, 1.9x @ 50%, 0.93x @ 10%, 0.81x @ 0%. Crossover ≈ 12-15% null, clean win above ~25%; the
>= 50% nullguard is conservative.This is the materialization-cost half of #9731 (~30% of the 99%-null write); the walk-cost half, a run-length input to the level encoder so the column writer doesn't even iterate all
num_rowslevels, is the larger structural change #9653 is heading toward. This PR is deliberately small and isolated so it lands independently of and rebases cleanly under that work.