Skip to content

feat: CometNativeScan per-partition plan serde#3511

Merged
andygrove merged 4 commits intoapache:mainfrom
mbutrovich:native_datafusion_per_partition_serde
Feb 13, 2026
Merged

feat: CometNativeScan per-partition plan serde#3511
andygrove merged 4 commits intoapache:mainfrom
mbutrovich:native_datafusion_per_partition_serde

Conversation

@mbutrovich
Copy link
Contributor

@mbutrovich mbutrovich commented Feb 13, 2026

Which issue does this PR close?

Partially address #3510. Subset of what #3446 was.

Rationale for this change

CometNativeScan currently serializes all file partitions upfront in the driver and sends the entire list to every executor task, which then indexes by partition ID. This approach has several drawbacks:

  1. Memory overhead: For scans with many partitions, serializing all files for all partitions consumes significant driver memory
  2. Network waste: Each executor task receives data for all partitions but only uses its own partition's files
  3. Inconsistent with Iceberg: PR feat: CometExecRDD supports per-partition plan data, reduce Iceberg native scan serialization, add DPP [iceberg]  #3349 introduced split-mode serialization for Iceberg scans, but NativeScan still uses the old pattern

This PR extends the PlanDataInjector framework to CometNativeScan, bringing it in line with the Iceberg implementation.

What changes are included in this PR?

  • Split NativeScan serialization into common data (schemas, filters, projections) and per-partition data (file lists)
  • Common data serialized once at planning time and embedded in the native operator
  • Per-partition data lazily serialized at execution time via @transient lazy val
  • Each executor task receives only its partition's files via runtime injection

How are these changes tested?

Existing tests.

@mbutrovich mbutrovich changed the title feat: Native datafusion per partition serde feat: CometNativeScan per-partition plan serde Feb 13, 2026
@mbutrovich mbutrovich marked this pull request as ready for review February 13, 2026 15:40
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @mbutrovich!

@andygrove andygrove merged commit a6741e8 into apache:main Feb 13, 2026
146 of 147 checks passed
@mbutrovich mbutrovich deleted the native_datafusion_per_partition_serde branch February 13, 2026 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants