-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat: add boundary utility functions for aligned byte fetching in JSON #19687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
||
| if let Some(file_range) = file_range.as_ref() { | ||
| let raw_start = file_range.start as usize; | ||
| let raw_end = file_range.end as usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you use u64 for these ?
Here you cast them to usize and at https://github.com/apache/datafusion/pull/19687/changes#diff-d0a9c47dbd0bdb20995b4a685f0f7551bebf22287035c99636f2b98013f203b0R52 you cast them back to u64.
The castings here may lead to problems on 32bit systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update in c188abc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also address the clippy errors - https://github.com/apache/datafusion/actions/runs/20811221444/job/59776119936?pr=19687
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in 5eec399
| let mut cursor = fetch_end as u64; | ||
|
|
||
| while cursor < file_size as u64 { | ||
| let chunk_end = std::cmp::min(cursor + scan_window as u64, file_size as u64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add a check that scan_window is bigger than 0, otherwise here the get_range() below will use an empty range.
…dling for zero scan window
Which issue does this PR close?
Rationale for this change
When scanning partitioned JSON files on remote storage systems (HDFS, S3), the current
calculate_range()implementation causes severe read amplification:Current behavior:
find_first_newline(start, file_size)[start..file_size)from object storeWhy this is problematic:
What changes are included in this PR?
Implements
get_aligned_bytes()with efficient in-memory boundary alignment:Start boundary alignment:
[start-1..end]rangestart == 0, use as-isbytes[0](positionstart-1) is newlinestart; if no, scan forward in memory for first newlineNoneif no newline found in fetched rangeEnd boundary alignment:
end >= file_sizeor last byte is newline, return immediately (zero-copy)Key optimization:
Bytes::slice()for zero-copy when possibleVecwhen extension is actually neededAre these changes tested?
Yes
Are there any user-facing changes?
No