Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
cded0ad
CometNativeIcebergScan with iceberg-rust using FileScanTasks.
mbutrovich Oct 6, 2025
4f3004b
Clean up tests a little.
mbutrovich Oct 6, 2025
4afec43
Remove old comment.
mbutrovich Oct 6, 2025
fc97ce9
Fix machete and missing suite CI failures.
mbutrovich Oct 6, 2025
cca4911
Fix unused variables.
mbutrovich Oct 6, 2025
93f466d
Spark 4.0 needs Iceberg 1.10, let's see if that works in CI.
mbutrovich Oct 6, 2025
970b692
Remove errant println.
mbutrovich Oct 6, 2025
c44973b
Remove old path() code path.
mbutrovich Oct 6, 2025
0f83fd4
Update old comment.
mbutrovich Oct 6, 2025
6cbbd09
Iceberg 1.5.x compatible reflection. Use 1.5.2 for Spark 3.4 and 3.5.
mbutrovich Oct 6, 2025
6966a12
Fix scalastyle issues.
mbutrovich Oct 6, 2025
1153d71
Merge branch 'main' into iceberg-rust
mbutrovich Oct 7, 2025
a0f4d63
Remove unused import.
mbutrovich Oct 7, 2025
a9cebfd
Clean up docs a bit.
mbutrovich Oct 7, 2025
6b2175a
Refactor and cleanup.
mbutrovich Oct 7, 2025
3618407
Refactor and cleanup.
mbutrovich Oct 7, 2025
8091a81
Add IcebergFileStream based on DataFusion, add benchmark. Bump the Ic…
mbutrovich Oct 8, 2025
880599e
Fix CometReadBenchmark.
mbutrovich Oct 8, 2025
5127e1c
Merge branch 'main' into iceberg-rust
mbutrovich Oct 16, 2025
878c971
Fixes after bringing in upstream/main.
mbutrovich Oct 16, 2025
e66799e
Basic complex type support.
mbutrovich Oct 16, 2025
4f2f3b8
CometFuzzIceberg stuff.
mbutrovich Oct 20, 2025
71df65c
Merge branch 'main' into iceberg-rust
mbutrovich Oct 21, 2025
3371cc1
format and fix conflicts.
mbutrovich Oct 21, 2025
1c40d43
Basic S3 test and properties support
mbutrovich Oct 21, 2025
40c9a07
Fix NPE.
mbutrovich Oct 21, 2025
19797f3
Merge branch 'main' into iceberg-rust
mbutrovich Oct 21, 2025
236b339
Support migrated tables via https://github.com/apache/iceberg-rust/pu…
mbutrovich Oct 22, 2025
ce367cc
Update df50 commit based on field ID fix.
mbutrovich Oct 22, 2025
bd6c609
Bump df50 commit.
mbutrovich Oct 22, 2025
33fa891
Support hive-partitioned Parquet files migrated to Iceberg tables wit…
mbutrovich Oct 22, 2025
ca13cc6
Bump df50.
mbutrovich Oct 22, 2025
b4e829f
Merge branch 'main' into iceberg-rust
mbutrovich Oct 22, 2025
e19e201
Fix after merging main.
mbutrovich Oct 22, 2025
52019a9
update df50.
mbutrovich Oct 23, 2025
e62a1ee
fall back for table format v3, ORC, and Avro scans.
mbutrovich Oct 23, 2025
b97f36a
Fix TestFilterPushDown Iceberg Java suite by including filters in exp…
mbutrovich Oct 23, 2025
08bfd70
Fix format.
mbutrovich Oct 23, 2025
a3bf186
Fix format.
mbutrovich Oct 23, 2025
a51652f
Fix UUID Iceberg type.
mbutrovich Oct 24, 2025
b06800c
Fix UUID Iceberg test.
mbutrovich Oct 24, 2025
905dc97
Bump df50.
mbutrovich Oct 24, 2025
bdb5029
Merge branch 'main' into iceberg-rust
mbutrovich Oct 24, 2025
f8714bc
Iceberg planning and output_rows metrics.
mbutrovich Oct 25, 2025
5f8256e
more output_rows tests.
mbutrovich Oct 25, 2025
78591fa
Merge branch 'main' into iceberg-rust
mbutrovich Oct 25, 2025
50a60ee
Dump DF 50.3 and df50 iceberg-rust commit.
mbutrovich Oct 25, 2025
3611b8a
Update metrics recording for iceberg_scan.rs.
mbutrovich Oct 25, 2025
6361943
FileStreamMetrics for iceberg_scan.rs
mbutrovich Oct 25, 2025
b3c88b9
Fix format.
mbutrovich Oct 25, 2025
b359171
numSplits metric.
mbutrovich Oct 26, 2025
f0b2d54
more filtering tests.
mbutrovich Oct 26, 2025
a5129d8
Change num_splits to be a runtime count instead of serialization time.
mbutrovich Oct 26, 2025
861a575
Fix Spark 4 with ImmutableSQLMetric.
mbutrovich Oct 26, 2025
27a1a75
New 1.9.1.diff
mbutrovich Oct 27, 2025
7ca2cd4
New 1.8.1.diff
mbutrovich Oct 27, 2025
eb09e43
Fall back on unsupported file schemes, but add new tests to verify pa…
mbutrovich Oct 27, 2025
591ff74
Fix partitioning test in CometIcebergNativeSuite
mbutrovich Oct 27, 2025
2311d60
Fix schema evolution with snapshots.
mbutrovich Oct 27, 2025
0c9a78d
Fix schemas for delete files.
mbutrovich Oct 28, 2025
87f436a
Fall back for now for unsupported partitioning types and filter expre…
mbutrovich Oct 28, 2025
5a88d19
Fix compilation
mbutrovich Oct 28, 2025
b0e6452
date32 schema change test.
mbutrovich Oct 28, 2025
5485508
bump df50
mbutrovich Oct 28, 2025
eb3b93d
adjust fallback logic for complex types, add new tests.
mbutrovich Oct 29, 2025
1740f18
Bump df50.
mbutrovich Oct 29, 2025
d9a5a1e
Bump df50.
mbutrovich Oct 30, 2025
f76cc99
Bump df50.
mbutrovich Oct 30, 2025
f33fb38
Bump df50.
mbutrovich Oct 30, 2025
133772d
Serialize PartitionSpec stuff. Fixes ~50 spark-extensions tests from …
mbutrovich Oct 30, 2025
bf1342f
Bump df50.
mbutrovich Oct 30, 2025
a719a95
Merge branch 'main' into iceberg-rust
mbutrovich Oct 30, 2025
caf21c5
Bump df50.
mbutrovich Oct 31, 2025
a2021b5
Fall back on InMemoryFileIO tables (views).
mbutrovich Oct 31, 2025
03afbbd
Fall back on truncate function.
mbutrovich Oct 31, 2025
9ae3605
Add fuzz iceberg suite to CI again (it got lost when updating main)
mbutrovich Nov 3, 2025
30a27e1
Merge branch 'main' into iceberg-rust
mbutrovich Nov 3, 2025
e3b0806
Apply #2675's partitioning fix to IcebergScanExec.
mbutrovich Nov 3, 2025
2497ead
move IcebergScan serialization logic to a new file.
mbutrovich Nov 3, 2025
cf09648
separate checks and serialization logic, reduce redundant checks
mbutrovich Nov 3, 2025
1f86a8e
remove num_partitions serialization
mbutrovich Nov 3, 2025
c5ce759
clean up planner.rs deserialization and comments
mbutrovich Nov 3, 2025
b53fa78
clean up iceberg_scan.rs comments
mbutrovich Nov 3, 2025
58e3b3a
clean up CometIcebergNativeScanExec comments
mbutrovich Nov 3, 2025
fca2dd7
clean up more scala comments
mbutrovich Nov 3, 2025
6f77912
Clean up planner.rs comments.
mbutrovich Nov 3, 2025
b88facf
clean up more planner.rs comments
mbutrovich Nov 3, 2025
b37a8cb
Merge branch 'main' into iceberg-rust
mbutrovich Nov 3, 2025
47894e7
fix conflicts with main
mbutrovich Nov 3, 2025
fdc149e
Fix TestForwardCompatibility
mbutrovich Nov 3, 2025
d63829d
Fix serialization of partitionData, bump df50 to fix deserialization …
mbutrovich Nov 3, 2025
f2f1807
Format
mbutrovich Nov 3, 2025
32c35b9
Fix format
mbutrovich Nov 4, 2025
1a169b3
Fix format for realsies
mbutrovich Nov 4, 2025
c58d2ce
name mapping changes for iceberg-rust #1821.
mbutrovich Nov 4, 2025
c962714
clean up stray comments, format
mbutrovich Nov 4, 2025
7277365
Merge branch 'main' into iceberg-rust
mbutrovich Nov 4, 2025
a52c69d
Update 1.8.1.diff with spotlessApply.
mbutrovich Nov 6, 2025
95f6e24
Merge branch 'main' into iceberg-rust
mbutrovich Nov 6, 2025
1b82ac3
Merge branch 'main' into iceberg-rust
mbutrovich Nov 6, 2025
2cd4d7d
No longer inject partition default-values, it's redundant now that we…
mbutrovich Nov 6, 2025
d88c911
Fix format.
mbutrovich Nov 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/pr_build_linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ jobs:
value: |
org.apache.comet.CometFuzzTestSuite
org.apache.comet.CometFuzzAggregateSuite
org.apache.comet.CometFuzzIcebergSuite
org.apache.comet.CometFuzzMathSuite
org.apache.comet.DataGeneratorSuite
- name: "shuffle"
Expand All @@ -124,6 +125,7 @@ jobs:
org.apache.spark.sql.comet.ParquetDatetimeRebaseV2Suite
org.apache.spark.sql.comet.ParquetEncryptionITCase
org.apache.comet.exec.CometNativeReaderSuite
org.apache.comet.CometIcebergNativeSuite
- name: "exec"
value: |
org.apache.comet.exec.CometAggregateSuite
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/pr_build_macos.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ jobs:
value: |
org.apache.comet.CometFuzzTestSuite
org.apache.comet.CometFuzzAggregateSuite
org.apache.comet.CometFuzzIcebergSuite
org.apache.comet.CometFuzzMathSuite
org.apache.comet.DataGeneratorSuite
- name: "shuffle"
Expand All @@ -89,6 +90,7 @@ jobs:
org.apache.spark.sql.comet.ParquetDatetimeRebaseV2Suite
org.apache.spark.sql.comet.ParquetEncryptionITCase
org.apache.comet.exec.CometNativeReaderSuite
org.apache.comet.CometIcebergNativeSuite
- name: "exec"
value: |
org.apache.comet.exec.CometAggregateSuite
Expand Down
10 changes: 10 additions & 0 deletions common/src/main/scala/org/apache/comet/CometConf.scala
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,16 @@ object CometConf extends ShimCometConf {
.getOrElse("COMET_PARQUET_SCAN_IMPL", SCAN_AUTO)
.toLowerCase(Locale.ROOT))

val COMET_ICEBERG_NATIVE_ENABLED: ConfigEntry[Boolean] =
conf("spark.comet.scan.icebergNative.enabled")
.category(CATEGORY_SCAN)
.doc(
"Whether to enable native Iceberg table scan using iceberg-rust. When enabled, " +
"Iceberg tables are read directly through native execution, bypassing Spark's " +
"DataSource V2 API for better performance.")
.booleanConf
.createWithDefault(false)

val COMET_RESPECT_PARQUET_FILTER_PUSHDOWN: ConfigEntry[Boolean] =
conf("spark.comet.parquet.respectFilterPushdown")
.category(CATEGORY_PARQUET)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ object NativeConfig {
* consistent and standardized cloud storage support across all providers.
*/
def extractObjectStoreOptions(hadoopConf: Configuration, uri: URI): Map[String, String] = {
val scheme = uri.getScheme.toLowerCase(Locale.ROOT)
val scheme = Option(uri.getScheme).map(_.toLowerCase(Locale.ROOT)).getOrElse("file")

import scala.jdk.CollectionConverters._
val options = scala.collection.mutable.Map[String, String]()
Expand Down
1 change: 1 addition & 0 deletions dev/ci/check-suites.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ def file_to_class_name(path: Path) -> str | None:
ignore_list = [
"org.apache.comet.parquet.ParquetReadSuite", # abstract
"org.apache.comet.parquet.ParquetReadFromS3Suite", # manual test suite
"org.apache.comet.IcebergReadFromS3Suite", # manual test suite
"org.apache.spark.sql.comet.CometPlanStabilitySuite", # abstract
"org.apache.spark.sql.comet.ParquetDatetimeRebaseSuite", # abstract
"org.apache.comet.exec.CometColumnarShuffleSuite" # abstract
Expand Down
170 changes: 90 additions & 80 deletions dev/diffs/iceberg/1.8.1.diff

Large diffs are not rendered by default.

170 changes: 90 additions & 80 deletions dev/diffs/iceberg/1.9.1.diff

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions docs/source/user-guide/latest/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,8 @@ and sorting on floating-point data can be enabled by setting `spark.comet.expres
## Incompatible Expressions

Expressions that are not 100% Spark-compatible will fall back to Spark by default and can be enabled by setting
`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See
the [Comet Supported Expressions Guide](expressions.md) for more information on this configuration setting.
`spark.comet.expression.EXPRNAME.allowIncompatible=true`, where `EXPRNAME` is the Spark expression class name. See
the [Comet Supported Expressions Guide](expressions.md) for more information on this configuration setting.

It is also possible to specify `spark.comet.expression.allowIncompatible=true` to enable all
incompatible expressions.
Expand Down
1 change: 1 addition & 0 deletions docs/source/user-guide/latest/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ Comet provides the following configuration settings.
| `spark.comet.convert.parquet.enabled` | When enabled, data from Spark (non-native) Parquet v1 and v2 scans will be converted to Arrow format. Note that to enable native vectorized execution, both this config and `spark.comet.exec.enabled` need to be enabled. | false |
| `spark.comet.scan.allowIncompatible` | Some Comet scan implementations are not currently fully compatible with Spark for all datatypes. Set this config to true to allow them anyway. For more information, refer to the [Comet Compatibility Guide](https://datafusion.apache.org/comet/user-guide/compatibility.html). | false |
| `spark.comet.scan.enabled` | Whether to enable native scans. When this is turned on, Spark will use Comet to read supported data sources (currently only Parquet is supported natively). Note that to enable native vectorized execution, both this config and `spark.comet.exec.enabled` need to be enabled. | true |
| `spark.comet.scan.icebergNative.enabled` | Whether to enable native Iceberg table scan using iceberg-rust. When enabled, Iceberg tables are read directly through native execution, bypassing Spark's DataSource V2 API for better performance. | false |
| `spark.comet.scan.preFetch.enabled` | Whether to enable pre-fetching feature of CometScan. | false |
| `spark.comet.scan.preFetch.threadNum` | The number of threads running pre-fetching for CometScan. Effective if spark.comet.scan.preFetch.enabled is enabled. Note that more pre-fetching threads means more memory requirement to store pre-fetched row groups. | 2 |
| `spark.comet.sparkToColumnar.enabled` | Whether to enable Spark to Arrow columnar conversion. When this is turned on, Comet will convert operators in `spark.comet.sparkToColumnar.supportedOperatorList` into Arrow columnar format before processing. | false |
Expand Down
Loading
Loading