Skip to content

Conversation

@Weijun-H
Copy link
Member

@Weijun-H Weijun-H commented Jan 7, 2026

Which issue does this PR close?

  • Closes #.

Rationale for this change

When scanning partitioned JSON files on remote storage systems (HDFS, S3), the current calculate_range() implementation causes severe read amplification:

Current behavior:

  • For each partition boundary, calls find_first_newline(start, file_size)
  • Requests byte range [start..file_size) from object store
  • Observed 4-7x read amplification in production (reading 278MB-1084MB to find newlines)

Why this is problematic:

  1. Remote storage systems have high latency for range requests
  2. Reading unnecessary data wastes network bandwidth
  3. For large files (100MB+), boundary checks dominate scan time
  4. The newline character is typically within a few KB from the boundary

What changes are included in this PR?

Implements get_aligned_bytes() with efficient in-memory boundary alignment:

Start boundary alignment:

  • Fetch only [start-1..end] range
  • If start == 0, use as-is
  • Else check if bytes[0] (position start-1) is newline
  • If yes, start from start; if no, scan forward in memory for first newline
  • Return None if no newline found in fetched range

End boundary alignment:

  • Fast path: If end >= file_size or last byte is newline, return immediately (zero-copy)
  • Slow path: Extend in small chunks (4KB default) until newline found
  • Pre-allocate capacity to reduce reallocations

Key optimization:

  • 95%+ of cases hit the fast path (boundaries already aligned)
  • Uses Bytes::slice() for zero-copy when possible
  • Only allocates Vec when extension is actually needed
图片

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the datasource Changes to the datasource crate label Jan 7, 2026

if let Some(file_range) = file_range.as_ref() {
let raw_start = file_range.start as usize;
let raw_end = file_range.end as usize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use u64 for these ?
Here you cast them to usize and at https://github.com/apache/datafusion/pull/19687/changes#diff-d0a9c47dbd0bdb20995b4a685f0f7551bebf22287035c99636f2b98013f203b0R52 you cast them back to u64.
The castings here may lead to problems on 32bit systems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update in c188abc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 5eec399

let mut cursor = fetch_end as u64;

while cursor < file_size as u64 {
let chunk_end = std::cmp::min(cursor + scan_window as u64, file_size as u64);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a check that scan_window is bigger than 0, otherwise here the get_range() below will use an empty range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants