feat: Iceberg scan based serializing FileScanTasks to iceberg-rust #2528

mbutrovich · 2025-10-06T02:37:21Z

This is mostly for discussion at the moment. There are slides from the 10/9/25 Iceberg-Rust community call here where I presented this effort here.

Rationale for this change

I was inspired by @RussellSpitzer's recent talk and wanted to revisit the abstraction layer at which Comet integrates with Iceberg. We have the iceberg_compat codepath for Iceberg integration, but this requires code changes in Iceberg Java to integrate with Parquet reader instantiation. Instead, this prototype works at the FileScanTask layer after planning. This prototype starts us toward fully-native Iceberg scans to match our Parquet logic with native_datafusion scans without any changes in upstream Iceberg Java code.

What changes are included in this PR?

New CometIcebergNativeScanExec node on the Scala side.
Use reflection to extract scan properties, mostly FileScanTasks and serialize to native code.
New IcebergScanExec on native side that uses FileScanTasks to perform reads in iceberg-rust.

How are these changes tested?

New CometIcebergNativeSuite.

Benefits over `iceberg_compat`?

No upstream code changes needed in Iceberg Java, no references to Comet needed in Iceberg anymore.
Better parallelism for file reading, more similar to native_datafusion.
No separate DataFusion runtime, these run in the same context as other operators (compared to iceberg_compat).
Better testing for iceberg-rust. I think I already found a shortcoming with row group pruning logic.
Tested with Iceberg 1.5, 1.7, 1.10.

Current Limitations/Concerns?

I lied about no upstream changes. I need one line changed in iceberg-rust and will open a PR there to make an API public. Currently this PR relies on my fork of iceberg-rust.
Need to try running Iceberg Java tests with this. I need to look at our current pipelines, since in theory we don’t want to apply the diff for iceberg_compat to Iceberg.
Need to explore/validate OpenDAL support for credential providers.
We'd need to try to keep iceberg-rust in sync with Comet's DataFusion dependency. I also had to bump my iceberg-rust fork to DataFusion 50.
We've already entangled Comet and Iceberg Java code, what would the deprecation of that code look like?
RecordBatchTransformer instead of SchemaAdapter/PhysicalExprAdapter. Need to understand the compatibility gap there.
Don't have access to ArrowReaderOptions yet (needed for proper Spark-compatible INT96 handling) https://github.com/apache/iceberg-rust/blob/dc349284a4204c1a56af47fb3177ace6f9e899a0/crates/iceberg/src/arrow/reader.rs#L1384.

codecov-commenter · 2025-10-06T02:54:32Z

Codecov Report

❌ Patch coverage is 3.36134% with 575 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.14%. Comparing base (f09f8af) to head (d9a5a1e).
⚠️ Report is 649 commits behind head on main.

Files with missing lines	Patch %	Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala	2.79%	344 Missing and 4 partials ⚠️
...e/spark/sql/comet/CometIcebergNativeScanExec.scala	0.00%	111 Missing ⚠️
...n/scala/org/apache/comet/rules/CometScanRule.scala	0.97%	101 Missing and 1 partial ⚠️
...n/scala/org/apache/comet/rules/CometExecRule.scala	7.69%	11 Missing and 1 partial ⚠️
...la/org/apache/comet/objectstore/NativeConfig.scala	0.00%	1 Missing ⚠️
...n/scala/org/apache/spark/sql/comet/operators.scala	75.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2528      +/-   ##
============================================
- Coverage     56.12%   55.14%   -0.99%     
- Complexity      976     1386     +410     
============================================
  Files           119      148      +29     
  Lines         11743    14348    +2605     
  Branches       2251     2474     +223     
============================================
+ Hits           6591     7912    +1321     
- Misses         4012     5218    +1206     
- Partials       1140     1218      +78

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

comphead · 2025-10-06T15:22:35Z

It is promising!

# Conflicts: # native/Cargo.lock # spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

…eberg version back to 1.8.1 after hitting known segfaults with old versions.

## Which issue does this PR close? - Part of #1749. ## What changes are included in this PR? - Change `ArrowReaderBuilder::new` to be `pub` instead of `pub(crate)`. ## Are these changes tested? - No new tests for this. Currently being used in DataFusion Comet: apache/datafusion-comet#2528

# Conflicts: # docs/source/user-guide/latest/configs.md # native/Cargo.lock # native/Cargo.toml # native/core/Cargo.toml

# Conflicts: # native/Cargo.lock # native/core/Cargo.toml

mbutrovich · 2025-10-27T14:40:13Z

This morning's progress:

After fixing custom scheme fallback:

SO CLOSE...

mbutrovich · 2025-10-27T14:44:22Z

I had to turn off countDeletes in the TestSparkReaderDeletes suite because iceberg-rust (rightly) merges the equality deletes with table filters to evaluate them together in Arrow-rs's Parquet reader. This makes filtered rows and deleted rows indistinguishable, so the counts won't match. We still get correctness checks after skipped the counts though, so I'm confident in the tests. They still assert that:

Deletes are correctly applied (rows filtered)
The _deleted metadata column works
Equality and positional deletes function properly

…rtition pruning works okay. This fixes TestPartitionPruning Iceberg Java tests.

…ssions with complex types.

mbutrovich · 2025-10-28T10:53:20Z

Sure I added fallbacks for 2 scenarios that iceberg-rust doesn't support yet (pushdown filters with complex types, and a few unsupported hive partitioning data types), but still...

mbutrovich added 3 commits October 5, 2025 21:53

CometNativeIcebergScan with iceberg-rust using FileScanTasks.

cded0ad

Clean up tests a little.

4f3004b

Remove old comment.

4afec43

mbutrovich added 6 commits October 6, 2025 06:58

Fix machete and missing suite CI failures.

fc97ce9

Fix unused variables.

cca4911

Spark 4.0 needs Iceberg 1.10, let's see if that works in CI.

93f466d

Remove errant println.

970b692

Remove old path() code path.

c44973b

Update old comment.

0f83fd4

mbutrovich added 2 commits October 6, 2025 11:49

Iceberg 1.5.x compatible reflection. Use 1.5.2 for Spark 3.4 and 3.5.

6cbbd09

Fix scalastyle issues.

6966a12

mbutrovich changed the title ~~feat: Iceberg scan based serializing FileScanTasks to iceberg-rust~~ feat: [iceberg] Scan based serializing FileScanTasks to iceberg-rust Oct 6, 2025

mbutrovich force-pushed the iceberg-rust branch from 227332c to 6966a12 Compare October 6, 2025 20:03

mbutrovich changed the title ~~feat: [iceberg] Scan based serializing FileScanTasks to iceberg-rust~~ feat: Iceberg scan based serializing FileScanTasks to iceberg-rust Oct 6, 2025

mbutrovich added 7 commits October 7, 2025 13:03

Merge branch 'main' into iceberg-rust

1153d71

# Conflicts: # native/Cargo.lock # spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

Remove unused import.

a0f4d63

Clean up docs a bit.

a9cebfd

Refactor and cleanup.

6b2175a

Refactor and cleanup.

3618407

Add IcebergFileStream based on DataFusion, add benchmark. Bump the Ic…

8091a81

…eberg version back to 1.8.1 after hitting known segfaults with old versions.

Fix CometReadBenchmark.

880599e

This was referenced Oct 15, 2025

feat(reader): Make ArrowReaderBuilder::new public apache/iceberg-rust#1748

Merged

ArrowReader enhancements for Apache DataFusion Comet apache/iceberg-rust#1749

Open

mbutrovich added 4 commits October 16, 2025 16:04

Merge branch 'main' into iceberg-rust

5127e1c

# Conflicts: # docs/source/user-guide/latest/configs.md # native/Cargo.lock # native/Cargo.toml # native/core/Cargo.toml

Fixes after bringing in upstream/main.

878c971

Basic complex type support.

e66799e

CometFuzzIceberg stuff.

4f2f3b8

mbutrovich added a commit that referenced this pull request Oct 25, 2025

feat: cherry-pick UUID conversion logic from #2528. (#2648)

8078e09

mbutrovich added 11 commits October 25, 2025 10:45

Merge branch 'main' into iceberg-rust

78591fa

# Conflicts: # native/Cargo.lock # native/core/Cargo.toml

Dump DF 50.3 and df50 iceberg-rust commit.

50a60ee

Update metrics recording for iceberg_scan.rs.

3611b8a

FileStreamMetrics for iceberg_scan.rs

6361943

Fix format.

b3c88b9

numSplits metric.

b359171

more filtering tests.

f0b2d54

Change num_splits to be a runtime count instead of serialization time.

a5129d8

Fix Spark 4 with ImmutableSQLMetric.

861a575

New 1.9.1.diff

27a1a75

New 1.8.1.diff

7ca2cd4

mbutrovich added 8 commits October 27, 2025 11:29

Fall back on unsupported file schemes, but add new tests to verify pa…

eb09e43

…rtition pruning works okay. This fixes TestPartitionPruning Iceberg Java tests.

Fix partitioning test in CometIcebergNativeSuite

591ff74

Fix schema evolution with snapshots.

2311d60

Fix schemas for delete files.

0c9a78d

Fall back for now for unsupported partitioning types and filter expre…

87f436a

…ssions with complex types.

Fix compilation

5a88d19

date32 schema change test.

b0e6452

bump df50

5485508

mbutrovich added 2 commits October 29, 2025 11:55

adjust fallback logic for complex types, add new tests.

eb3b93d

Bump df50.

1740f18

This was referenced Oct 30, 2025

feat(reader): Date32 from days since epoch for Literal:try_from_json apache/iceberg-rust#1803

Open

feat(reader): handle field ID conflicts in RecordBatchTransformer apache/iceberg-rust#1804

Draft

mbutrovich added 3 commits October 29, 2025 20:24

Bump df50.

d9a5a1e

Bump df50.

f76cc99

Bump df50.

f33fb38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Iceberg scan based serializing FileScanTasks to iceberg-rust #2528

feat: Iceberg scan based serializing FileScanTasks to iceberg-rust #2528

Uh oh!

mbutrovich commented Oct 6, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Oct 6, 2025 •

edited

Loading

Uh oh!

comphead commented Oct 6, 2025

Uh oh!

mbutrovich commented Oct 27, 2025 •

edited

Loading

Uh oh!

mbutrovich commented Oct 27, 2025

Uh oh!

mbutrovich commented Oct 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Iceberg scan based serializing FileScanTasks to iceberg-rust #2528

Are you sure you want to change the base?

feat: Iceberg scan based serializing FileScanTasks to iceberg-rust #2528

Uh oh!

Conversation

mbutrovich commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Benefits over iceberg_compat?

Current Limitations/Concerns?

Uh oh!

codecov-commenter commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

comphead commented Oct 6, 2025

Uh oh!

mbutrovich commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbutrovich commented Oct 27, 2025

Uh oh!

mbutrovich commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mbutrovich commented Oct 6, 2025 •

edited

Loading

Benefits over `iceberg_compat`?

codecov-commenter commented Oct 6, 2025 •

edited

Loading

mbutrovich commented Oct 27, 2025 •

edited

Loading

mbutrovich commented Oct 28, 2025 •

edited

Loading