-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Parquet] Split byte-array batches transparently when i32 offset would overflow #9504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -397,17 +397,26 @@ impl ByteArrayDecoderPlain { | |||||
| return Err(ParquetError::EOF("eof decoding byte array".into())); | ||||||
| } | ||||||
|
|
||||||
| // Stop early rather than overflow the 32-bit offset buffer. | ||||||
| // self.offset is NOT advanced, so the next read() call resumes | ||||||
| // from this value in the following batch. | ||||||
| if output.would_overflow(len as usize) { | ||||||
| break; | ||||||
| } | ||||||
|
|
||||||
| output.try_push(&buf[start_offset..end_offset], self.validate_utf8)?; | ||||||
|
|
||||||
| self.offset = end_offset; | ||||||
| read += 1; | ||||||
| } | ||||||
| self.max_remaining_values -= to_read; | ||||||
| // Use actual reads, not the requested amount, so max_remaining_values | ||||||
| // stays correct when we stop early. | ||||||
| self.max_remaining_values -= read; | ||||||
|
|
||||||
| if self.validate_utf8 { | ||||||
| output.check_valid_utf8(initial_values_length)?; | ||||||
| output.check_valid_utf8(initial_values_length)? | ||||||
| } | ||||||
| Ok(to_read) | ||||||
| Ok(read) | ||||||
| } | ||||||
|
|
||||||
| pub fn skip(&mut self, to_skip: usize) -> Result<usize> { | ||||||
|
|
@@ -489,22 +498,33 @@ impl ByteArrayDecoderDeltaLength { | |||||
| output.values.reserve(total_bytes); | ||||||
|
|
||||||
| let mut current_offset = self.data_offset; | ||||||
| let mut read = 0; | ||||||
| for length in src_lengths { | ||||||
| let end_offset = current_offset + *length as usize; | ||||||
| let value_len = *length as usize; | ||||||
| let end_offset = current_offset + value_len; | ||||||
|
|
||||||
| // Stop early rather than overflow the 32-bit offset buffer. | ||||||
| // length_offset / data_offset are only advanced by what was | ||||||
| // actually consumed, so the next read() resumes correctly. | ||||||
| if output.would_overflow(value_len) { | ||||||
| break; | ||||||
| } | ||||||
|
|
||||||
| output.try_push( | ||||||
| &self.data.as_ref()[current_offset..end_offset], | ||||||
| self.validate_utf8, | ||||||
| )?; | ||||||
| current_offset = end_offset; | ||||||
| read += 1; | ||||||
| } | ||||||
|
|
||||||
| self.data_offset = current_offset; | ||||||
| self.length_offset += to_read; | ||||||
| self.length_offset += read; | ||||||
|
|
||||||
| if self.validate_utf8 { | ||||||
| output.check_valid_utf8(initial_values_length)?; | ||||||
| output.check_valid_utf8(initial_values_length)? | ||||||
|
||||||
| output.check_valid_utf8(initial_values_length)? | |
| output.check_valid_utf8(initial_values_length)?; |
Copilot
AI
Mar 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let key = keys[0] as usize; can wrap negative dictionary indices to huge usize values, and key + 1 can overflow (e.g. key == -1) causing a panic in debug builds. Convert with usize::try_from(keys[0]) (returning an error on negatives) and avoid key + 1 overflow by comparing key >= dict_offsets.len().saturating_sub(1) or using checked_add(1).
Copilot
AI
Mar 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dictionary path now calls self.decoder.read(1, ...) in a loop, which adds substantial per-value overhead compared to decoding indices in batches. If possible, consider decoding a larger chunk of keys at a time and stopping without losing already-consumed keys (e.g., by tracking how many keys in the provided slice were successfully processed and advancing the decoder by that amount), or otherwise document/justify the performance tradeoff since dictionary pages are often hot paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
if self.validate_utf8 { ... }block isn’t terminated with a semicolon beforeOk(read), which makes this function fail to compile. Add a trailing;after theifblock (or after thecheck_valid_utf8(...)?inside it).