Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Oct 29, 2025

Which issue does this PR close?

Closes #2661

Rationale for this change

The current approach to passing batches from the JVM to native code creates GC pressure because the JVM retains ArrowArray and ArrowSchema wrapper objects for each batch that is being accumulated on the native side in operators such as SortExec and ShuffleWriterExec.

What changes are included in this PR?

  • Always take a deep copy in ScanExec to that JVM resources can be gargage-collected
  • Add FFI guide to contributors guide

How are these changes tested?

@codecov-commenter
Copy link

codecov-commenter commented Oct 29, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.19%. Comparing base (f09f8af) to head (34baeed).
⚠️ Report is 650 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2662      +/-   ##
============================================
+ Coverage     56.12%   59.19%   +3.07%     
- Complexity      976     1447     +471     
============================================
  Files           119      147      +28     
  Lines         11743    13747    +2004     
  Branches       2251     2361     +110     
============================================
+ Hits           6591     8138    +1547     
- Misses         4012     4387     +375     
- Partials       1140     1222      +82     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove
Copy link
Member Author

@EmilyMatt fyi

@andygrove andygrove marked this pull request as draft October 29, 2025 19:05
@andygrove
Copy link
Member Author

I am moving this to draft until #2664 is complete

@andygrove andygrove changed the title feat: Add config for forcing deep copy in native ScanExec feat: Fix GC pressure when fetching batches from JVM Oct 29, 2025
@andygrove andygrove marked this pull request as ready for review October 29, 2025 20:05
@mbutrovich
Copy link
Contributor

I know you're quite busy @andygrove, but is it possible to use your awesome memory profiling feature to see if there's a meaningful difference with this PR?

@andygrove andygrove marked this pull request as draft October 30, 2025 14:41
@andygrove
Copy link
Member Author

I know you're quite busy @andygrove, but is it possible to use your awesome memory profiling feature to see if there's a meaningful difference with this PR?

Yes, I'll focus on this today. I'm also considering moving the documentation into a separate PR. Another consideration is whether to add yet another config option to choose the behavior here, but I am wary of adding more configs that we have to explain.

@andygrove
Copy link
Member Author

I moved the docs to a new PR: #2668

@andygrove
Copy link
Member Author

@EmilyMatt Do you have any tips for finding a good repro for the GC pressure issue? I am trying to reproduce this locally so that I can demonstrate the benefit.

@EmilyMatt
Copy link
Contributor

@EmilyMatt Do you have any tips for finding a good repro for the GC pressure issue? I am trying to reproduce this locally so that I can demonstrate the benefit.

Unfortunately I was also unable to reproduce this locally.
The images I sent previously were saved on my machine from a while back^^
I do have the following pointers:

  1. Use multiple sequential scan operators with something that ends with a loop that consumes fully (I.e., IcebergCompat -> Union -> Shuffle Write)
  2. Use a lot of data with a lot of RAM, but few CPU cores.
  3. Use an unbounded memory pool, I think this issue is more prevalent without spilling, so the operators will accumulate a lot of data without returning.

@andygrove
Copy link
Member Author

@EmilyMatt Do you have any tips for finding a good repro for the GC pressure issue? I am trying to reproduce this locally so that I can demonstrate the benefit.

Unfortunately I was also unable to reproduce this locally. The images I sent previously were saved on my machine from a while back^^ I do have the following pointers:

  1. Use multiple sequential scan operators with something that ends with a loop that consumes fully (I.e., IcebergCompat -> Union -> Shuffle Write)
  2. Use a lot of data with a lot of RAM, but few CPU cores.
  3. Use an unbounded memory pool, I think this issue is more prevalent without spilling, so the operators will accumulate a lot of data without returning.

Thanks @EmilyMatt.

Yes, with the unified pool, we will spill to disk and that will release the JVM wrapper objects, so maybe this is not an issue now. Thanks for helping me understand the issue. This has resulted in improved documentation in the contributor guide that explains this issue.

https://datafusion.apache.org/comet/contributor-guide/ffi.html

I will close this PR and will close issue #2661 but feel free to reopen if this is still an issue for you.

@andygrove andygrove closed this Nov 3, 2025
@andygrove andygrove deleted the deep-copy-config branch November 3, 2025 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Current FFI approach causes GC pressure

4 participants