Skip to content

feat: Iceberg V2 delete file support in druid-iceberg-extensions#19266

Open
Shekharrajak wants to merge 51 commits into
apache:masterfrom
Shekharrajak:feature/iceberg-v2-delete-support
Open

feat: Iceberg V2 delete file support in druid-iceberg-extensions#19266
Shekharrajak wants to merge 51 commits into
apache:masterfrom
Shekharrajak:feature/iceberg-v2-delete-support

Conversation

@Shekharrajak

@Shekharrajak Shekharrajak commented Apr 6, 2026

Copy link
Copy Markdown
Contributor

Ref #19190

Description

In IcebergCatalog.extractSnapshotDataFiles(), line:

dataFilePaths.add(task.file().location());

discards task.deletes() entirely. Every FileScanTask from tableScan.planFiles() carries a List that must
be applied for correct v2 reads. The current code passes only raw file paths to warehouseSource, and Druid's
ParquetReader has zero awareness of Iceberg delete files.

Changes

DeleteFileInfo.java Serializable POJO: path, contentType (POSITION/EQUALITY), equalityFieldIds
IcebergFileTaskInputSource.java Per-task InputSource carrying data file + delete metadata + schema JSON +
warehouseSource
IcebergNativeRecordReader.java Manual positional + equality delete application with streaming reads via
Parquet.read()
IcebergRecordConverter.java Iceberg Record to Map with full type coverage

Release note

Iceberg V2 Delete File Support: When FileScanTask.deletes() returns non-empty, the extension creates per-task IcebergFileTaskInputSource objects
carrying serializable metadata (data file path, delete file paths/types/equality field IDs, schema JSON). Workers
apply deletes at read time via IcebergNativeRecordReader which reads position-delete and equality-delete Parquet
files, builds filter sets, and streams the data file while skipping deleted rows. V1 tables (no delete files) continue
to use the existing warehouseSource path unchanged.


Key changed/added classes in this PR
  • DeleteFileInfo.java
  • IcebergFileTaskInputSource.java
  • IcebergNativeRecordReader.java
  • IcebergRecordConverter.java
  • IcebergInputSource.java
  • IcebergCatalog.java
  • IcebergDruidModule.java

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@Shekharrajak Shekharrajak changed the title Iceberg V2 Delete File Support Iceberg V2 delete file support in druid-iceberg-extensions Apr 6, 2026
@Shekharrajak Shekharrajak changed the title Iceberg V2 delete file support in druid-iceberg-extensions feat: Iceberg V2 delete file support in druid-iceberg-extensions Apr 6, 2026

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings that could not be attached inline:

  • extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergInputSource.java:164 - [P1] V2 tables with deletes produce zero splits. When any delete file is present, retrieveIcebergDatafiles() sets delegateInputSource to EmptyInputSource, so createSplits() returns an empty stream and estimateNumSplits() returns 0. MSQ and parallel ingestion slice SplittableInputSource inputs exclusively through createSplits()/withSplit(), so an Iceberg v2 table with deletes will schedule no readable input slices and ingest no rows instead of using the native reader path.

@Shekharrajak

Copy link
Copy Markdown
Contributor Author
  • extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergInputSource.java:164 - [P1] V2 tables with deletes produce zero splits. When any delete file is present, retrieveIcebergDatafiles() sets delegateInputSource to EmptyInputSource, so createSplits() returns an empty stream and estimateNumSplits() returns 0. MSQ and parallel ingestion slice SplittableInputSource inputs exclusively through createSplits()/withSplit(), so an Iceberg v2 table with deletes will schedule no readable input slices and ingest no rows instead of using the native reader path.

updated : 309c97b#diff-1b9776e43fa17c32a610eee043a6fd6cbf32b29ae4026d24ea798ac9a6f638bcR168

@Shekharrajak

Copy link
Copy Markdown
Contributor Author

Noted gaps in iceberg v2 spec support #19471

@Shekharrajak Shekharrajak force-pushed the feature/iceberg-v2-delete-support branch from 6c98a82 to bab8237 Compare May 18, 2026 11:53

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 0
P2 1
P3 0
Total 1
Severity Findings
P0 0
P1 0
P2 1
P3 0
Total 1

Reviewed 15 of 15 changed files. The earlier FileIO metadata follow-up is addressed, so no inline reply is needed; the new finding is below.


This is an automated review by Codex GPT-5.5

@Shekharrajak

Copy link
Copy Markdown
Contributor Author

Flaky test reported #19491

Please help in triggering the one failed CI check run. We can work on this flaky test separately.

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.

Reviewed 15 of 15 changed files.


This is an automated review by Codex GPT-5.5

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 1
P2 1
P3 0
Total 2

Reviewed 15 of 15 changed files.


This is an automated review by Codex GPT-5.5

Comment thread extensions-contrib/druid-iceberg-extensions/pom.xml Outdated
this.tableSchemaJson = tableSchemaJson;
this.warehouseSource = warehouseSource;
this.inputRowSchema = inputRowSchema;
this.hadoopConf = new Configuration();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] V2 reader drops configured warehouse/Hadoop access

The v2 path constructs a fresh Hadoop Configuration and then opens data and delete files through a recreated Iceberg FileIO, while the configured warehouseSource is only stored and never used. Existing HDFS/S3 warehouse ingestion relies on warehouseSource and injected Hadoop configuration, so v2 tables with deletes can plan successfully on the controller but fail on workers when those files require the Druid warehouse source or cluster Hadoop credentials.

@Shekharrajak Shekharrajak force-pushed the feature/iceberg-v2-delete-support branch from fe5c0c9 to 2cbaeb6 Compare May 27, 2026 17:43
@Shekharrajak Shekharrajak force-pushed the feature/iceberg-v2-delete-support branch from 2cbaeb6 to 5ecc28c Compare May 27, 2026 17:44

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 2
P2 0
P3 0
Total 2

The original iceberg-data dependency follow-up is handled; I found two remaining v2 read-path issues.

Reviewed 16 of 16 changed files.


This is an automated review by Codex GPT-5.5

@JsonProperty("fileIOImpl") @Nullable final String fileIOImpl,
@JsonProperty("fileIOProperties") @Nullable final Map<String, String> fileIOProperties,
@JsonProperty("fileFormat") @Nullable final String fileFormat,
final Configuration hadoopConf

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Unbound Configuration breaks v2 split deserialization

IcebergFileTaskInputSource is registered as a Jackson subtype and returned from withSplit for v2 tables, so workers need to deserialize it. The @JsonCreator leaves hadoopConf as an unannotated constructor parameter, unlike the catalog classes that use @JacksonInject @hiveconf, so Druid's ObjectMapper has no property or injectable value to bind. This can make v2 delete splits fail before reading. Inject it, annotate it, or create a safe default, and add a Jackson round-trip test for this input source.


// Step 3: Stream data file with delete application
requireParquet(dataFilePath);
final InputFile dataInputFile = fileIO.newInputFile(dataFilePath);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] V2 path bypasses warehouseSource file access

When delete files are present, the new reader opens data and delete files through Iceberg FileIO instead of the configured warehouseSource/inputFormat. Existing Iceberg specs use warehouseSource for S3/GCS/local access settings, endpoints, and credentials, so a table that only adds delete files can silently switch file-access mechanisms and fail or read from the wrong filesystem. The v2 path should preserve or translate the warehouseSource-backed access configuration for native reads.

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the incremental update and full changed-file set for correctness, edge cases, concurrency, and integration risks; no issues found.

Reviewed 17 of 17 changed files.


This is an automated review by Codex GPT-5.5

…lasspath collision with druid-orc-extensions
@jtuglu1 jtuglu1 requested a review from a2l007 June 1, 2026 18:21
@jtuglu1

jtuglu1 commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Related: #19534

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the incremental update and full changed-file set for correctness, edge cases, lifecycle, API compatibility, and integration risks; no issues found.

Reviewed 17 of 17 changed files.


This is an automated review by Codex GPT-5.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants