Skip to content

Conversation

@mbutrovich
Copy link
Contributor

@mbutrovich mbutrovich commented Oct 6, 2025

This is mostly for discussion at the moment. There are slides from the 10/9/25 Iceberg-Rust community call here where I presented this effort here.

Rationale for this change

I was inspired by @RussellSpitzer's recent talk and wanted to revisit the abstraction layer at which Comet integrates with Iceberg. We have the iceberg_compat codepath for Iceberg integration, but this requires code changes in Iceberg Java to integrate with Parquet reader instantiation. Instead, this prototype works at the FileScanTask layer after planning. This prototype starts us toward fully-native Iceberg scans to match our Parquet logic with native_datafusion scans without any changes in upstream Iceberg Java code.

What changes are included in this PR?

  • New CometIcebergNativeScanExec node on the Scala side.
  • Use reflection to extract scan properties, mostly FileScanTasks and serialize to native code.
  • New IcebergScanExec on native side that uses FileScanTasks to perform reads in iceberg-rust.

How are these changes tested?

New CometIcebergNativeSuite.

Benefits over iceberg_compat?

  • No upstream code changes needed in Iceberg Java, no references to Comet needed in Iceberg anymore.
  • Better parallelism for file reading, more similar to native_datafusion.
  • No separate DataFusion runtime, these run in the same context as other operators (compared to iceberg_compat).
  • Better testing for iceberg-rust. I think I already found a shortcoming with row group pruning logic.
  • Tested with Iceberg 1.5, 1.7, 1.10.

Current Limitations/Concerns?

  • I lied about no upstream changes. I need one line changed in iceberg-rust and will open a PR there to make an API public. Currently this PR relies on my fork of iceberg-rust.
  • Need to try running Iceberg Java tests with this. I need to look at our current pipelines, since in theory we don’t want to apply the diff for iceberg_compat to Iceberg.
  • Need to explore/validate OpenDAL support for credential providers.
  • We'd need to try to keep iceberg-rust in sync with Comet's DataFusion dependency. I also had to bump my iceberg-rust fork to DataFusion 50.
  • We've already entangled Comet and Iceberg Java code, what would the deprecation of that code look like?
  • RecordBatchTransformer instead of SchemaAdapter/PhysicalExprAdapter. Need to understand the compatibility gap there.
  • Don't have access to ArrowReaderOptions yet (needed for proper Spark-compatible INT96 handling) https://github.com/apache/iceberg-rust/blob/dc349284a4204c1a56af47fb3177ace6f9e899a0/crates/iceberg/src/arrow/reader.rs#L1384.

@codecov-commenter
Copy link

codecov-commenter commented Oct 6, 2025

Codecov Report

❌ Patch coverage is 3.36134% with 575 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.14%. Comparing base (f09f8af) to head (d9a5a1e).
⚠️ Report is 649 commits behind head on main.

Files with missing lines Patch % Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala 2.79% 344 Missing and 4 partials ⚠️
...e/spark/sql/comet/CometIcebergNativeScanExec.scala 0.00% 111 Missing ⚠️
...n/scala/org/apache/comet/rules/CometScanRule.scala 0.97% 101 Missing and 1 partial ⚠️
...n/scala/org/apache/comet/rules/CometExecRule.scala 7.69% 11 Missing and 1 partial ⚠️
...la/org/apache/comet/objectstore/NativeConfig.scala 0.00% 1 Missing ⚠️
...n/scala/org/apache/spark/sql/comet/operators.scala 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2528      +/-   ##
============================================
- Coverage     56.12%   55.14%   -0.99%     
- Complexity      976     1386     +410     
============================================
  Files           119      148      +29     
  Lines         11743    14348    +2605     
  Branches       2251     2474     +223     
============================================
+ Hits           6591     7912    +1321     
- Misses         4012     5218    +1206     
- Partials       1140     1218      +78     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@comphead
Copy link
Contributor

comphead commented Oct 6, 2025

It is promising!

@mbutrovich mbutrovich changed the title feat: Iceberg scan based serializing FileScanTasks to iceberg-rust feat: [iceberg] Scan based serializing FileScanTasks to iceberg-rust Oct 6, 2025
@mbutrovich mbutrovich changed the title feat: [iceberg] Scan based serializing FileScanTasks to iceberg-rust feat: Iceberg scan based serializing FileScanTasks to iceberg-rust Oct 6, 2025
# Conflicts:
#	native/Cargo.lock
#	spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala
…eberg version back to 1.8.1 after hitting known segfaults with old versions.
liurenjie1024 pushed a commit to apache/iceberg-rust that referenced this pull request Oct 16, 2025
## Which issue does this PR close?


- Part of #1749.

## What changes are included in this PR?

- Change `ArrowReaderBuilder::new` to be `pub` instead of `pub(crate)`.

## Are these changes tested?

- No new tests for this. Currently being used in DataFusion Comet:
apache/datafusion-comet#2528
# Conflicts:
#	docs/source/user-guide/latest/configs.md
#	native/Cargo.lock
#	native/Cargo.toml
#	native/core/Cargo.toml
@mbutrovich
Copy link
Contributor Author

mbutrovich commented Oct 27, 2025

This morning's progress:
Screenshot 2025-10-27 at 10 39 41 AM
After fixing custom scheme fallback:
Screenshot 2025-10-27 at 12 28 16 PM
SO CLOSE...
Screenshot 2025-10-27 at 8 23 51 PM

@mbutrovich
Copy link
Contributor Author

I had to turn off countDeletes in the TestSparkReaderDeletes suite because iceberg-rust (rightly) merges the equality deletes with table filters to evaluate them together in Arrow-rs's Parquet reader. This makes filtered rows and deleted rows indistinguishable, so the counts won't match. We still get correctness checks after skipped the counts though, so I'm confident in the tests. They still assert that:

  • Deletes are correctly applied (rows filtered)
  • The _deleted metadata column works
  • Equality and positional deletes function properly

@mbutrovich
Copy link
Contributor Author

mbutrovich commented Oct 28, 2025

Sure I added fallbacks for 2 scenarios that iceberg-rust doesn't support yet (pushdown filters with complex types, and a few unsupported hive partitioning data types), but still...
Screenshot 2025-10-28 at 2 09 29 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants