feat: add boundary utility functions for aligned byte fetching in JSON #19687

Weijun-H · 2026-01-07T19:37:44Z

Which issue does this PR close?

Closes #.

Rationale for this change

When scanning partitioned JSON files on remote storage systems (HDFS, S3), the current calculate_range() implementation causes severe read amplification:

Current behavior:

For each partition boundary, calls find_first_newline(start, file_size)
Requests byte range [start..file_size) from object store
Observed 4-7x read amplification in production (reading 278MB-1084MB to find newlines)

Why this is problematic:

Remote storage systems have high latency for range requests
Reading unnecessary data wastes network bandwidth
For large files (100MB+), boundary checks dominate scan time
The newline character is typically within a few KB from the boundary

What changes are included in this PR?

Implements get_aligned_bytes() with efficient in-memory boundary alignment:

Start boundary alignment:

Fetch only [start-1..end] range
If start == 0, use as-is
Else check if bytes[0] (position start-1) is newline
If yes, start from start; if no, scan forward in memory for first newline
Return None if no newline found in fetched range

End boundary alignment:

Fast path: If end >= file_size or last byte is newline, return immediately (zero-copy)
Slow path: Extend in small chunks (4KB default) until newline found
Pre-allocate capacity to reduce reallocations

Key optimization:

95%+ of cases hit the fast path (boundaries already aligned)
Uses Bytes::slice() for zero-copy when possible
Only allocates Vec when extension is actually needed

Are these changes tested?

Yes

Are there any user-facing changes?

No

…N source

…cies

martin-g · 2026-01-08T08:27:33Z

datafusion/datasource-json/src/source.rs

+
+            if let Some(file_range) = file_range.as_ref() {
+                let raw_start = file_range.start as usize;
+                let raw_end = file_range.end as usize;


Why don't you use u64 for these ?
Here you cast them to usize and at https://github.com/apache/datafusion/pull/19687/changes#diff-d0a9c47dbd0bdb20995b4a685f0f7551bebf22287035c99636f2b98013f203b0R52 you cast them back to u64.
The castings here may lead to problems on 32bit systems.

update in c188abc

Please also address the clippy errors - https://github.com/apache/datafusion/actions/runs/20811221444/job/59776119936?pr=19687

fixed in 5eec399

martin-g · 2026-01-08T08:32:57Z

datafusion/datasource-json/src/boundary_utils.rs

+    let mut cursor = fetch_end as u64;
+
+    while cursor < file_size as u64 {
+        let chunk_end = std::cmp::min(cursor + scan_window as u64, file_size as u64);


It would be good to add a check that scan_window is bigger than 0, otherwise here the get_range() below will use an empty range.

…dling for zero scan window

Weijun-H added 4 commits January 7, 2026 20:14

feat: add boundary utility functions for aligned byte fetching in JSO…

1c10aba

…N source

feat: add benchmark for JSON boundary data source and update dependen…

ba606af

…cies

chore

a91bcc6

chore

9e1c07d

github-actions bot added the datasource Changes to the datasource crate label Jan 7, 2026

martin-g reviewed Jan 8, 2026

View reviewed changes

Weijun-H added 2 commits January 8, 2026 10:58

feat: update boundary handling to use u64 for sizes and add error han…

c188abc

…dling for zero scan window

chore: fix clippy

5eec399

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add boundary utility functions for aligned byte fetching in JSON #19687

feat: add boundary utility functions for aligned byte fetching in JSON #19687

Weijun-H commented Jan 7, 2026 •

edited

Loading

Uh oh!

martin-g Jan 8, 2026

Uh oh!

Weijun-H Jan 8, 2026

Uh oh!

martin-g Jan 8, 2026

Uh oh!

Weijun-H Jan 8, 2026

Uh oh!

martin-g Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add boundary utility functions for aligned byte fetching in JSON #19687

Are you sure you want to change the base?

feat: add boundary utility functions for aligned byte fetching in JSON #19687

Conversation

Weijun-H commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

martin-g Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Weijun-H Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Weijun-H Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Weijun-H commented Jan 7, 2026 •

edited

Loading