chore: Remove `CopyExec` [WIP] #2639

andygrove · 2025-10-23T15:23:26Z

Which issue does this PR close?

This follows on from the work in #2635 where we copy and/or unpack dictionary-encoded arrays in the scan, so we no longer need to insert CopyExec into the plan.

Rationale for this change

Reduce complexity
Remove redundant code

What changes are included in this PR?

How are these changes tested?

codecov-commenter · 2025-10-23T15:45:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.19%. Comparing base (f09f8af) to head (5ef1014).
⚠️ Report is 645 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2639      +/-   ##
============================================
+ Coverage     56.12%   59.19%   +3.06%     
- Complexity      976     1449     +473     
============================================
  Files           119      147      +28     
  Lines         11743    13746    +2003     
  Branches       2251     2362     +111     
============================================
+ Hits           6591     8137    +1546     
- Misses         4012     4387     +375     
- Partials       1140     1222      +82

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

EmilyMatt · 2025-10-23T15:47:10Z

This could be very problematic for performance.
Am AFK but will explain further in a bit

EmilyMatt · 2025-10-23T16:23:12Z

Ok so the previous time I wanted to make this PR I encountered something interesting:
Imagine something like the following code on DataFusion's SortExec

futures::stream::once(async move {
                        while let Some(batch) = input.next().await {
                            let batch = batch?;
                            sorter.insert_batch(batch).await?;
                        }
                        sorter.sort().await
                    })
                    .try_flatten(),

This will essentially consume the entire input iterator, all the way to the operator producing the ColumnarBatch on the Scala side.
The underlying arrow arrays can be produced off-heap, it doesn't matter, because the wrapped ArrowArray and ArrowSchema objects are created using heap memory, so any ArrayRef created from the ArrayData of that passed arrow array, will use a small amount of heap memory.
The issue is that a code like the one above, that will consume the entire input iterator, will cause such high GC pressure, that performance can decrease by up to 10x compared to Spark.
It will not always show up on local performance runs(I saw the horrible performance when running on an EC2 Cluster with a huge amount of data)
The only solution I've found was to do a deep copy of the ArrayData itself.
I know this seems paradoxical but it's a real-life issue.
The best way I know to handle this is to just do a full copy of the ArrayData before make_array in the scan for all arrays.
The unpacking can happen before that I guess, but hopefully DataFusion will have enough support for dictionaries this could be ignored completely.

When we use off-heap memory, naturally we want to reduce the executor memory, which exacerbates the issue.

EmilyMatt · 2025-10-23T16:30:18Z

Comparison of with and without copying the underlying ArrayData

With:

Without:

andygrove · 2025-10-23T16:32:20Z

Ok so the previous time I wanted to make this PR I encountered something interesting: Imagine something like the following code on DataFusion's SortExec
futures::stream::once(async move {
                        while let Some(batch) = input.next().await {
                            let batch = batch?;
                            sorter.insert_batch(batch).await?;
                        }
                        sorter.sort().await
                    })
                    .try_flatten(),
This will essentially consume the entire input iterator, all the way to the operator producing the ColumnarBatch on the Scala side. The underlying arrow arrays can be produced off-heap, it doesn't matter, because the wrapped ArrowArray and ArrowSchema objects are created using heap memory, so any ArrayRef created from the ArrayData of that passed arrow array, will use a small amount of heap memory. The issue is that a code like the one above, that will consume the entire input iterator, will cause such high GC pressure, that performance can decrease by up to 10x compared to Spark. It will not always show up on local performance runs(I saw the horrible performance when running on an EC2 Cluster with a huge amount of data) The only solution I've found was to do a deep copy of the ArrayData itself. I know this seems paradoxical but it's a real-life issue. The best way I know to handle this is to just do a full copy of the ArrayData before make_array in the scan for all arrays. The unpacking can happen before that I guess, but hopefully DataFusion will have enough support for dictionaries this could be ignored completely.

When we use off-heap memory, naturally we want to reduce the executor memory, which exacerbates the issue.

Thanks, this is really helpful. I will experiment with this.

andygrove · 2025-10-24T14:55:32Z

Some native_datafusion tests are failing:

2025-10-23T20:23:31.2697381Z - join (native_datafusion, native shuffle) *** FAILED *** (463 milliseconds)
2025-10-23T20:23:31.2702227Z   org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 4382.0 failed 1 times, most recent failure: Lost task 3.0 in stage 4382.0 (TID 10847) (localhost executor driver): org.apache.comet.CometNativeException: Invalid HashJoinExec, partition count mismatch 1!=5,consider using RepartitionExec.

andygrove · 2025-10-25T17:41:01Z

Ok so the previous time I wanted to make this PR I encountered something interesting: Imagine something like the following code on DataFusion's SortExec
futures::stream::once(async move {
                        while let Some(batch) = input.next().await {
                            let batch = batch?;
                            sorter.insert_batch(batch).await?;
                        }
                        sorter.sort().await
                    })
                    .try_flatten(),
This will essentially consume the entire input iterator, all the way to the operator producing the ColumnarBatch on the Scala side. The underlying arrow arrays can be produced off-heap, it doesn't matter, because the wrapped ArrowArray and ArrowSchema objects are created using heap memory, so any ArrayRef created from the ArrayData of that passed arrow array, will use a small amount of heap memory. The issue is that a code like the one above, that will consume the entire input iterator, will cause such high GC pressure, that performance can decrease by up to 10x compared to Spark. It will not always show up on local performance runs(I saw the horrible performance when running on an EC2 Cluster with a huge amount of data) The only solution I've found was to do a deep copy of the ArrayData itself. I know this seems paradoxical but it's a real-life issue. The best way I know to handle this is to just do a full copy of the ArrayData before make_array in the scan for all arrays. The unpacking can happen before that I guess, but hopefully DataFusion will have enough support for dictionaries this could be ignored completely.
When we use off-heap memory, naturally we want to reduce the executor memory, which exacerbates the issue.
Thanks, this is really helpful. I will experiment with this.

Claude's analysis of the PR and this comment:

The Core Problem

Memory Model Mismatch

When Arrow arrays are passed from Spark (JVM) to DataFusion (Rust):

Data buffers: Can be allocated off-heap (native memory)
Wrapper objects (ArrowArray, ArrowSchema): Always allocated on Java heap
Even though the actual data is off-heap, each ArrayRef needs these small heap-allocated wrappers

GC Pressure from Buffering Operators

The issue arises with operators like SortExec that consume their entire input:

  futures::stream::once(async move {
      while let Some(batch) = input.next().await {
          sorter.insert_batch(batch).await?;  // Accumulates all batches
      }
      sorter.sort().await  // Only then produces output
  })

This pattern:

Consumes the entire input iterator before producing output
Creates many wrapper objects (one set per batch)
Keeps all wrappers alive until sorting completes
Causes severe GC pressure as thousands of small objects accumulate on the heap

Why It's Worse with Off-Heap Memory

When using off-heap memory for performance, users typically:

Reduce executor heap size (since data is off-heap)
This makes the heap smaller → GC pressure from wrapper objects becomes catastrophic
Can cause 10x performance degradation on clusters with large data

Why Deep Copy Solves It (The Paradox)

The comment suggests doing a deep copy of ArrayData before make_array in the scan. This seems counterintuitive but works because:

Immediate Materialization: Deep copy fully materializes data into new arrays
Immediate GC: Original wrapper objects can be garbage collected right away
Clean Boundaries: Each batch owns its data completely
Less GC Thrashing: Even though copying costs CPU, it's cheaper than continuous GC pauses

Without copy: Many small wrapper objects → constant GC pressureWith copy: Upfront copy cost → clean memory lifecycle → smooth execution

Current PR Context

Looking at your PR, you've:

Removed the separate CopyExec operator
Moved copy_array and copy_or_unpack_array functions into scan.rs
The copy happens at scan time (lines 270-278 in scan.rs):

  let array = if arrow_ffi_safe {
      copy_or_unpack_array(&array, &CopyMode::UnpackOrClone)?
  } else {
      copy_array(&array)  // Deep copy
  };

The commenter is saying: Make sure this deep copy happens for all arrays to prevent the GC pressure issue, especially for operators that buffer entire inputs.

andygrove · 2025-10-29T14:33:39Z

Some native_datafusion tests are failing:

2025-10-23T20:23:31.2697381Z - join (native_datafusion, native shuffle) *** FAILED *** (463 milliseconds)
2025-10-23T20:23:31.2702227Z   org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 4382.0 failed 1 times, most recent failure: Lost task 3.0 in stage 4382.0 (TID 10847) (localhost executor driver): org.apache.comet.CometNativeException: Invalid HashJoinExec, partition count mismatch 1!=5,consider using RepartitionExec.

CopyExec was masking an issue when using native_datafusion scans: #2660

I am not sure what to do about this yet, but I have skipped some tests in this PR for now.

@mbutrovich fyi

andygrove · 2025-10-29T16:01:23Z

I have created two new PRs to replace this one:

andygrove added 9 commits October 22, 2025 17:42

dict encode

9e9d131

fix

7ef2add

update docs

2bbbcc7

unpack in scan

5ccfb0c

unpack in scan

8168616

enable tests

9599c47

remove CopyExec

e212a01

move copy functions to scan

f006e57

remove debug println

2c04e29

andygrove added 3 commits October 23, 2025 09:55

update test

149f68c

remove redundant code

3bff1d6

clippy

5e1d834

andygrove added 2 commits October 23, 2025 11:16

add a config

0b77c82

upmerge

567b7dd

andygrove changed the title ~~chore: Remove CopyExec~~ chore: Remove CopyExec [WIP] Oct 24, 2025

andygrove added 3 commits October 25, 2025 11:43

address feedback

08edcee

update tests

f71718e

skip some tests for native_datafusion

b588d66

skip fewer tests

5ef1014

andygrove mentioned this pull request Oct 29, 2025

Current FFI approach causes GC pressure #2661

Open

andygrove closed this Oct 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: Remove `CopyExec` [WIP] #2639

chore: Remove `CopyExec` [WIP] #2639

Uh oh!

andygrove commented Oct 23, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Oct 23, 2025 •

edited

Loading

Uh oh!

EmilyMatt commented Oct 23, 2025

Uh oh!

EmilyMatt commented Oct 23, 2025 •

edited

Loading

Uh oh!

EmilyMatt commented Oct 23, 2025

Uh oh!

andygrove commented Oct 23, 2025

Uh oh!

andygrove commented Oct 24, 2025

Uh oh!

andygrove commented Oct 25, 2025

Uh oh!

andygrove commented Oct 29, 2025

Uh oh!

andygrove commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chore: Remove CopyExec [WIP] #2639

chore: Remove CopyExec [WIP] #2639

Uh oh!

Conversation

andygrove commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

codecov-commenter commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

EmilyMatt commented Oct 23, 2025

Uh oh!

EmilyMatt commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EmilyMatt commented Oct 23, 2025

Uh oh!

andygrove commented Oct 23, 2025

Uh oh!

andygrove commented Oct 24, 2025

Uh oh!

andygrove commented Oct 25, 2025

Uh oh!

andygrove commented Oct 29, 2025

Uh oh!

andygrove commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chore: Remove `CopyExec` [WIP] #2639

chore: Remove `CopyExec` [WIP] #2639

andygrove commented Oct 23, 2025 •

edited

Loading

codecov-commenter commented Oct 23, 2025 •

edited

Loading

EmilyMatt commented Oct 23, 2025 •

edited

Loading