fix(parquet): bound data page byte size for large variable-width values#9972
fix(parquet): bound data page byte size for large variable-width values#9972adriangb wants to merge 17 commits into
Conversation
|
run benchmark arrow_writer |
393ead0 to
4823429
Compare
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (4823429) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer |
0fd6dcb to
24b83c7
Compare
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (24b83c7) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (24b83c7) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (70dc497) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (bbe2b7e) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (bbe2b7e) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
Have you considered making the batch size configurable per column? |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Yes, that may be a simpler approach. But I'm hoping we can get to a place where users don't have to think about / configure this. Given they gave us a page size limit it'd be nice if we can always adhere to that... |
| /// push a page far past the configured page byte limit before the | ||
| /// post-write size check fires. | ||
| #[inline] | ||
| fn byte_size(&self) -> usize { |
There was a problem hiding this comment.
This seems to duplicate dict_encoding_size. Also, #9700 wants to rename dict_encoding_size and instead implement it pretty much the same way as here.
|
Another thought...maybe add another chunker like the CDC work added ( ). If we compute batches up front when we know the shape of the data that might be faster 🤷 |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (145ea5d) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
The GKE bench shows `string_dictionary/*` consistently ~+80% across every branch commit, even though the chunker's fast path returns `chunk_size` with a single struct-field load while `has_dictionary()` is true (which it is for the entire `string_dictionary` bench since `create_random_batch` produces a low-cardinality dict that doesn't spill the writer's encoder). Working hypothesis: the regression is icache pressure from the new code's mere presence. The cold path (`byte_budget_sub_batch_size`, `write_granular_chunk`) is never executed for `string_dictionary` but sits inline near the encoder's hot path and pushes hot bytes out of L1i. Mark both cold paths `#[cold]` so LLVM places them in a separate text section. The hot encoder loop should stay tighter in icache. This is a hypothesis-driven attempt; if GKE doesn't move it tells us the regression source is somewhere else and we keep digging. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
0b13cb9 to
77ebc07
Compare
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer |
`value_count_strategy` picked the `Sorted` strategy for any arrow column with `value_indices`. `Sorted` runs `partition_point(|&i| i < end_offset)`, comparing leaf-value indices against a level offset — coordinate spaces that only coincide for flat columns. For repeated/nested columns the leaf values array is decoupled from the rep/def level stream, so `vals_in_chunk` drifts upward without bound as empty-list / sub-`max_def` levels accumulate, spuriously triggering granular sub-batching on columns whose values are small. This was the consistent `list_primitive_non_null` regression (+18-32% across CI runs). Repeated columns (`max_rep_level > 0`) now count values via `DefLevelScan`. Also in this commit: - `count_within_budget_offsets`: drop the Stage-1 contiguity gate. The offset span is a valid O(1) upper bound for any sorted index set, so nullable offset columns skip the O(n) per-chunk walk too. - `write_granular_chunk`: pack whole records up to `sub_batch_size` per mini-batch instead of one record per mini-batch (~25x fewer `write_mini_batch` calls when granular mode fires on lists of large values). - Move `plain_encoded_byte_size` to the end of `encoder.rs`: defining it above the `ColumnValueEncoder` trait shifted downstream compiled code and perturbed unrelated string-writer benchmark layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (4b92635) to 2e8e0c7 (merge-base) diff File an issue against this benchmark runner |
`ValueCountStrategy` was a 3-way precomputed enum (`AllPresent` / `Sorted` / `DefLevelScan`) for answering "how many of this chunk's levels carry a value". `LevelDataRef::value_count` already answers that correctly for every column shape — `Absent`/`Uniform` def levels resolve in O(1), and the O(n) scan only runs for genuinely materialized (nullable/nested) def levels, on the variable-width slow path the chunker is already on. The `Sorted` variant — `partition_point` of leaf-value indices against a level offset — was only ever valid for flat columns; for nested columns those indices live in a different coordinate space, which is what made `vals_in_chunk` drift and spuriously trigger granular sub-batching (`list_primitive_non_null` regression). Deleting the enum removes that bug class structurally rather than guarding against it. Net effect: the chunker module drops from ~320 to ~173 lines, the `'a` lifetime and two parameters disappear from the chunker API, and `ByteBudgetChunker` just stores `max_def_level`. `pick_sub_batch_size` goes back to a plain `#[inline]` (the `#[inline(always)]` was added chasing a `string_dictionary` swing later confirmed to be code-layout noise, not an inlining effect). Perf-neutral — `value_count` vs the old `partition_point` is negligible and only on the post-dict-spill path. `LevelDataRef::value_count` gains a unit test as the now load-bearing value-counting primitive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (beb5fc2) to 2e8e0c7 (merge-base) diff File an issue against this benchmark runner |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
A 4-round alternating A/B (main vs branch-with-#[cold] vs branch-without, 8 benches) showed `#[cold]` is not load-bearing: removing it moved every bench by ≤1% with no consistent direction — pure noise. The comment justifying `#[cold]` on `byte_budget_sub_batch_size` was also wrong: it claimed the path "fires only on columns whose values are individually larger than data_page_size_limit / write_batch_size", when in fact it runs once per chunk for *every* variable-width column once dictionary encoding is abandoned (e.g. the small-string benchmarks). Dropped both `#[cold]` hints and the benchmark-archaeology comments. `#[inline(never)]` is kept on both slow-path helpers. The symbol table confirms it is doing real work — without it `byte_budget_sub_batch_size` and `write_granular_chunk` are inlined bodily into the hot `write_batch_internal` loop (0-1 vs 7 out-of-line copies). Keeping a ~40-line rarely-taken helper out of the hot loop is standard slow-path outlining; the A/B shows it costs nothing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (97e8a45) to 2e8e0c7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (97e8a45) to 2e8e0c7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
@etseidl after much profiling, debugging, etc. I've been able to get this to work with no performance impact (within noise). I recognize this is a non-trivial change but it introduces no public APIs and in theory if it is problematic in any way we can back out of it. The benefit of doing things this way is that we automatically patch buggy / problematic page size blowouts for everyone, without code changes needed on their end or guessing of column sizes necessary. One thing we could do to derisk if you want: add a config option to disable this behavior. Thanks for reviewing this, I hope we can make it work 😄 |
`ByteBudgetChunker` sub-batches mini-batches so a data page cannot overflow `data_page_size_limit`, but it short-circuited entirely while the encoder was dictionary-encoding. The dictionary page accumulates the distinct values themselves, and its spill check runs only once per mini-batch — so a single `write_batch_size` mini-batch of large values could buffer `write_batch_size * value_size` bytes into the dictionary page before the check fired (e.g. a 1024-row mini-batch of 256 KiB values produced a 256 MiB dictionary page against a 1 MiB limit). Drop the dict-encoding short-circuit: the chunker now sub-batches the dictionary-encoding phase against the dictionary page's remaining budget, symmetric with how it already bounds the data page. Fixed-width columns still short-circuit via `static_dict_always_fits`. callgrind shows +0.09% on the dictionary-encoding hot path. Adds `test_column_writer_caps_dictionary_page_size`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (6980026) to 2e8e0c7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Commit 6980026 ("fix(parquet): bound dictionary page size during dict encoding") made ByteBudgetChunker sub-batch the dictionary-encoding phase, so the dictionary page is bounded at its configured limit instead of overshooting by up to one mini-batch. In the `test_string` dictionary-spill case this tightens the dictionary page from 1040 to exactly the 1000-byte limit and spills one mini-batch earlier (125 rows rather than 130) -- the intended effect of that commit, but its hardcoded layout expectations were not updated, so the `arrow_writer_layout` integration test failed. Update the expected layout to match. Test-only change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ByteBudgetChunker inspects per-value byte sizes to keep a data or dictionary page within its budget. count_within_budget_offsets (Utf8, Binary) has a Stage-1 O(1) span check that skips the per-value walk when the whole chunk provably fits -- but the view-array path (count_within_budget_views) had no equivalent and always did a full O(n) scan, even for tiny values that cannot come close to a page limit. View arrays have no prefix-sum offsets buffer, so an exact O(1) span is unavailable, but a conservative one is: a view value's byte length is at most max(12, largest data buffer length), so n values are guaranteed to fit when n * (max_value_len + 4) <= byte_budget. This skips the scan for the common small-value case -- what view arrays are built for, and exactly the case where there is nothing to bound. Columns with genuinely large values still fall through to the exact per-value scan and are still bounded precisely. callgrind on the string_and_binary_view writer benchmark (1M rows): instructions were +4.49% vs main before this change, +0.42% after. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing parquet-page-size-mid-batch (7a3c97c) to 2e8e0c7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
`MAX_INLINE_VIEW_LEN` is a function-local const; rustdoc cannot resolve it as an intra-doc link, so the `-Dwarnings` docs CI job failed with "unresolved link to `MAX_INLINE_VIEW_LEN`". Reference the value (12) directly in prose instead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
We write large values into our parquet files (e.g. a 5MB LLM prompt). A naive write will cause massive pages (we've seen up to 2GB) at default write settings. The main knob to control this is
write_batch_sizewhich defaults to 1024. But if each row is 5MB that's 5GB. On the other hand setting this to something small like 32 kills write performance and is completely unnecessary for other fixed width columns.The writer even documents this (
parquet/src/column/writer/mod.rs):This PR makes the mini-batch size byte-budget aware:
bytes_per_valuefrom the values about to be written and picksub_batch_size = page_byte_limit / bytes_per_value(clamped ≥ 1).sub_batch_size≥ chunk size, so we stay on the existing batched fast path with zero behavior change.Implementation notes
Skip the byte-size check while parquet dictionary encoding is active:
estimated_value_bytesreturns plain-encoded size but a dict-encoded data page only stores small RLE indices, so the estimate would spuriously shrink pages. Dict fallback bounds dict-encoded pages independently.For repeated/nested columns the sub-batch steps record-by-record (rep == 0 boundaries) so a record never spans data pages, matching the parquet format rule.
Regression test
test_column_writer_caps_page_size_for_large_byte_array_valueswrites 64 × 64 KiB BYTE_ARRAY values with a 16 KiB page byte limit. Before this fix that produced a single ~4 MiB page; after, it's one page per value (~64 pages, all within ~2× the value size).Bench results
5-run medians, criterion
arrow_writerbench, default writer properties, on a noisy laptop (run-to-run variance ~±1.6%):primitive/default(i32 25% null)primitive_non_null/defaultbool_non_null/defaultstring/defaultshort_string_non_null/default(new, 1M × 8 B)large_string_non_null/default(new, 1024 × 256 KiB)string_non_null/defaultstring_dictionary/defaultlist_primitive/defaultlist_primitive_non_null/default🤖 Generated with Claude Code