fix: prevent panic in record reader when row group metadata overcounts num_rows#9993
fix: prevent panic in record reader when row group metadata overcounts num_rows#9993BoazC-MSFT wants to merge 3 commits into
Conversation
…s num_rows TypedTripletIter::read_next clears the def_levels, rep_levels, and values buffers when a column is exhausted but returns without resetting curr_triplet_index. If ReaderIter drives iteration past the actual data (because it trusts row group metadata num_rows which may overcount), Reader::read_field calls current_def_level on the exhausted iterator, indexing into an empty vec with the stale index. This causes a panic like "index out of bounds: the len is 0 but the index is 70093". The fix has three layers: (1) reset curr_triplet_index to 0 in the exhaustion path of read_next so the internal state is consistent, (2) return 0 from current_def_level and current_rep_level when has_next is false as a defense-in-depth measure, and (3) check has_next before consuming column data in all read_field variants (PrimitiveReader, OptionReader, RepeatedReader, KeyValueReader) and return an error instead of panicking. Tests cover the triplet-level accessors after exhaustion and three integration scenarios matching the production call stack through ReaderIter with inflated num_records for optional, repeated, and map field types.
|
run benchmarks arrow_reader arrow_reader_clickbench |
1 similar comment
|
run benchmarks arrow_reader arrow_reader_clickbench |
|
Thank you @BoazC-MSFT |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing fix-parquet-record-triplet-index-out-of-bounds (89ece44) to fd1c5b3 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing fix-parquet-record-triplet-index-out-of-bounds (89ece44) to fd1c5b3 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing fix-parquet-record-triplet-index-out-of-bounds (89ece44) to fd1c5b3 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing fix-parquet-record-triplet-index-out-of-bounds (89ece44) to fd1c5b3 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Which issue does this PR close?
Rationale for this change
The record reader (
RowIter/get_row_iter) panics withindex out of boundswhen a Parquet file's row group metadata declares more rows than a column chunk actually contains. This happens in production when reading third-party Parquet files with mismatched metadata. Instead of panicking, the reader should return an error.What changes are included in this PR?
Three layers of fix in
parquet/src/record/:triplet.rs - fix the inconsistent internal state:
curr_triplet_indexto 0 in the exhaustion path ofread_next, so the stale index from the previous batch never persists alongside empty buffers.current_def_levelandcurrent_rep_levelwhenhas_nextis false, as defense-in-depth against any caller that skips thehas_nextcheck.reader.rs - return errors instead of panicking:
has_next()guards before consuming column data in allread_fieldvariants:PrimitiveReader,OptionReader,RepeatedReader, andKeyValueReader. When a column is exhausted mid-iteration,read_fieldnow returnsErr("Unexpected end of column data")which propagates throughReaderIter::nextasSome(Err(...)).Are these changes tested?
Yes. Five new tests:
test_current_def_level_safe_after_exhaustion- drives aTripletIterto exhaustion on an optional column and assertscurrent_def_level()returns 0 instead of panicking.test_current_rep_level_safe_after_exhaustion- same forcurrent_rep_level()on a repeated column.test_reader_iter_returns_error_when_num_records_exceeds_data- exercises the fullReaderIterstack with an optional field (vianulls.snappy.parquet).test_reader_iter_returns_error_for_repeated_field_when_num_records_exceeds_data- same for a repeated primitive field (viarepeated_primitive_no_list.parquet).test_reader_iter_returns_error_for_map_field_when_num_records_exceeds_data- same for a map field projected alone (viamap_no_value.parquet).Each integration test inflates
num_recordsby 1 beyond actual data, asserts all real rows returnOk, and asserts the extra iteration returnsErrcontaining "Unexpected end of column data".Are there any user-facing changes?
Callers of
get_row_iterorRowIterthat previously hit a panic on corrupt/truncated files will now receive anErrfrom the iterator instead. No API signature changes.