feat: Iceberg V2 delete file support in druid-iceberg-extensions by Shekharrajak · Pull Request #19266 · apache/druid

Shekharrajak · 2026-04-06T10:26:35Z

Ref #19190

Description

In IcebergCatalog.extractSnapshotDataFiles(), line:

dataFilePaths.add(task.file().location());

discards task.deletes() entirely. Every FileScanTask from tableScan.planFiles() carries a List that must
be applied for correct v2 reads. The current code passes only raw file paths to warehouseSource, and Druid's
ParquetReader has zero awareness of Iceberg delete files.

Changes

DeleteFileInfo.java Serializable POJO: path, contentType (POSITION/EQUALITY), equalityFieldIds
IcebergFileTaskInputSource.java Per-task InputSource carrying data file + delete metadata + schema JSON +
warehouseSource
IcebergNativeRecordReader.java Manual positional + equality delete application with streaming reads via
Parquet.read()
IcebergRecordConverter.java Iceberg Record to Map with full type coverage

Release note

Iceberg V2 Delete File Support: When FileScanTask.deletes() returns non-empty, the extension creates per-task IcebergFileTaskInputSource objects
carrying serializable metadata (data file path, delete file paths/types/equality field IDs, schema JSON). Workers
apply deletes at read time via IcebergNativeRecordReader which reads position-delete and equality-delete Parquet
files, builds filter sets, and streams the data file while skipping deleted rows. V1 tables (no delete files) continue
to use the existing warehouseSource path unchanged.

Key changed/added classes in this PR

DeleteFileInfo.java
IcebergFileTaskInputSource.java
IcebergNativeRecordReader.java
IcebergRecordConverter.java
IcebergInputSource.java
IcebergCatalog.java
IcebergDruidModule.java

This PR has:

FrankChen021

Findings that could not be attached inline:

extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergInputSource.java:164 - [P1] V2 tables with deletes produce zero splits. When any delete file is present, retrieveIcebergDatafiles() sets delegateInputSource to EmptyInputSource, so createSplits() returns an empty stream and estimateNumSplits() returns 0. MSQ and parallel ingestion slice SplittableInputSource inputs exclusively through createSplits()/withSplit(), so an Iceberg v2 table with deletes will schedule no readable input slices and ingest no rows instead of using the native reader path.

Shekharrajak · 2026-05-17T07:31:41Z

extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergInputSource.java:164 - [P1] V2 tables with deletes produce zero splits. When any delete file is present, retrieveIcebergDatafiles() sets delegateInputSource to EmptyInputSource, so createSplits() returns an empty stream and estimateNumSplits() returns 0. MSQ and parallel ingestion slice SplittableInputSource inputs exclusively through createSplits()/withSplit(), so an Iceberg v2 table with deletes will schedule no readable input slices and ingest no rows instead of using the native reader path.

updated : 309c97b#diff-1b9776e43fa17c32a610eee043a6fd6cbf32b29ae4026d24ea798ac9a6f638bcR168

Shekharrajak · 2026-05-17T08:30:43Z

Noted gaps in iceberg v2 spec support #19471

FrankChen021

Severity	Findings
P0	0
P1	0
P2	1
P3	0
Total	1

Severity	Findings
P0	0
P1	0
P2	1
P3	0
Total	1

Reviewed 15 of 15 changed files. The earlier FileIO metadata follow-up is addressed, so no inline reply is needed; the new finding is below.

This is an automated review by Codex GPT-5.5

Shekharrajak · 2026-05-20T17:33:01Z

Flaky test reported #19491

Please help in triggering the one failed CI check run. We can work on this flaky test separately.

FrankChen021

I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.

Reviewed 15 of 15 changed files.

This is an automated review by Codex GPT-5.5

FrankChen021

Severity	Findings
P0	0
P1	1
P2	1
P3	0
Total	2

Reviewed 15 of 15 changed files.

This is an automated review by Codex GPT-5.5

FrankChen021 · 2026-05-27T16:29:24Z

+    this.tableSchemaJson = tableSchemaJson;
+    this.warehouseSource = warehouseSource;
+    this.inputRowSchema = inputRowSchema;
+    this.hadoopConf = new Configuration();


[P2] V2 reader drops configured warehouse/Hadoop access

The v2 path constructs a fresh Hadoop Configuration and then opens data and delete files through a recreated Iceberg FileIO, while the configured warehouseSource is only stored and never used. Existing HDFS/S3 warehouse ingestion relies on warehouseSource and injected Hadoop configuration, so v2 tables with deletes can plan successfully on the controller but fail on workers when those files require the Druid warehouse source or cluster Hadoop credentials.

…ersion

…delete detection

…natures

…ion coverage

…uster

…pache#19472)

…ceberg V2 delete tests

…id row data requirement

…itation (apache#19472)

…ding format field

…used-dep suppression

… reader

FrankChen021

Severity	Findings
P0	0
P1	2
P2	0
P3	0
Total	2

The original iceberg-data dependency follow-up is handled; I found two remaining v2 read-path issues.

Reviewed 16 of 16 changed files.

This is an automated review by Codex GPT-5.5

FrankChen021 · 2026-05-29T13:18:05Z

+      @JsonProperty("fileIOImpl") @Nullable final String fileIOImpl,
+      @JsonProperty("fileIOProperties") @Nullable final Map<String, String> fileIOProperties,
+      @JsonProperty("fileFormat") @Nullable final String fileFormat,
+      final Configuration hadoopConf


[P1] Unbound Configuration breaks v2 split deserialization

IcebergFileTaskInputSource is registered as a Jackson subtype and returned from withSplit for v2 tables, so workers need to deserialize it. The @JsonCreator leaves hadoopConf as an unannotated constructor parameter, unlike the catalog classes that use @JacksonInject @hiveconf, so Druid's ObjectMapper has no property or injectable value to bind. This can make v2 delete splits fail before reading. Inject it, annotate it, or create a safe default, and add a Jackson round-trip test for this input source.

FrankChen021 · 2026-05-29T13:18:05Z

+
+    // Step 3: Stream data file with delete application
+    requireParquet(dataFilePath);
+    final InputFile dataInputFile = fileIO.newInputFile(dataFilePath);


[P1] V2 path bypasses warehouseSource file access

When delete files are present, the new reader opens data and delete files through Iceberg FileIO instead of the configured warehouseSource/inputFormat. Existing Iceberg specs use warehouseSource for S3/GCS/local access settings, endpoints, and credentials, so a table that only adds delete files can silently switch file-access mechanisms and fail or read from the wrong filesystem. The v2 path should preserve or translate the warehouseSource-backed access configuration for native reads.

5962d62)

…ource

…nf for reader

FrankChen021

I reviewed the incremental update and full changed-file set for correctness, edge cases, concurrency, and integration risks; no issues found.

Reviewed 17 of 17 changed files.

This is an automated review by Codex GPT-5.5

…lasspath collision with druid-orc-extensions

jtuglu1 · 2026-06-01T18:21:04Z

Related: #19534

FrankChen021

I reviewed the incremental update and full changed-file set for correctness, edge cases, lifecycle, API compatibility, and integration risks; no issues found.

Reviewed 17 of 17 changed files.

This is an automated review by Codex GPT-5.5

github-actions Bot added the Area - Dependencies label Apr 6, 2026

Shekharrajak changed the title ~~Iceberg V2 Delete File Support~~ Iceberg V2 delete file support in druid-iceberg-extensions Apr 6, 2026

github-advanced-security AI found potential problems Apr 6, 2026

View reviewed changes

Comment thread ...id-iceberg-extensions/src/test/java/org/apache/druid/iceberg/input/V2DeleteHandlingTest.java Fixed

github-actions Bot added the Area - Documentation label Apr 6, 2026

Shekharrajak changed the title ~~Iceberg V2 delete file support in druid-iceberg-extensions~~ feat: Iceberg V2 delete file support in druid-iceberg-extensions Apr 6, 2026

github-advanced-security AI found potential problems Apr 6, 2026

View reviewed changes

Comment thread ...ts/src/test/java/org/apache/druid/testing/embedded/iceberg/IcebergV2DeleteIngestionTest.java Fixed

Comment thread ...ts/src/test/java/org/apache/druid/testing/embedded/iceberg/IcebergV2DeleteIngestionTest.java Fixed

FrankChen021 reviewed Apr 25, 2026

View reviewed changes

Comment thread ...eberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergNativeRecordReader.java Outdated

Shekharrajak mentioned this pull request May 17, 2026

Align Iceberg dependency version across iceberg-extensions and embedded-tests modules #19469

Closed

Shekharrajak force-pushed the feature/iceberg-v2-delete-support branch from 8e510d3 to 309c97b Compare May 17, 2026 07:30

Shekharrajak force-pushed the feature/iceberg-v2-delete-support branch from 6c98a82 to bab8237 Compare May 18, 2026 11:53

FrankChen021 reviewed May 18, 2026

View reviewed changes

Comment thread ...eberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergNativeRecordReader.java

Shekharrajak mentioned this pull request May 18, 2026

Add ORC and Avro file format support to Druid's Iceberg input source #19472

Open

jtuglu1 self-requested a review May 19, 2026 08:33

Shekharrajak mentioned this pull request May 20, 2026

Flaky Test: ITLocalInputSourceAllInputFormatTest #19491

Closed

FrankChen021 reviewed May 21, 2026

View reviewed changes

Shekharrajak mentioned this pull request May 25, 2026

fix: OrcInputFormat concurrent FileSystem init race condition (#19491) #19497

Merged

Shekharrajak force-pushed the feature/iceberg-v2-delete-support branch from 8a22a95 to c757d50 Compare May 26, 2026 03:30

Shekharrajak mentioned this pull request May 26, 2026

Flaky Test: CostBasedAutoScalerIntegrationTest #19517

Open

FrankChen021 reviewed May 27, 2026

View reviewed changes

Shekharrajak added 8 commits May 27, 2026 23:04

add V2DeleteHandling enum for iceberg v2 delete file support

0285cac

add iceberg-data dependency and promote iceberg-parquet to compile scope

3cd0903

implement IcebergRecordConverter for iceberg Record to Druid map conv…

282eb5b

…ersion

implement streaming IcebergNativeRecordReader with v2 delete support

24d8178

add FileScanResult and extractFileScanTasks to IcebergCatalog for v2 …

aaf7527

…delete detection

wire v2DeleteHandling into IcebergInputSource with native reader routing

6ae9a92

add unit tests for IcebergRecordConverter and V2DeleteHandling

75a386a

fix compilation: use public Iceberg APIs and correct Druid method sig…

bfc327b

…natures

Shekharrajak added 13 commits May 27, 2026 23:04

Iceberg: extract delete-file mapping into testable helper with reject…

6717a9e

…ion coverage

Iceberg: test partitioned v2 table delete scoping

2942e5f

Iceberg: make v2 path splittable so MSQ and parallel batch ingest rows

80b2c8a

Iceberg: make withSplit and getSplitHintSpecOrDefault self-loading

0d5cb9e

Fix missing @OverRide on setup/tearDown in IcebergV2DeleteIngestionTest

dacbb05

Call super.setup/tearDown in IcebergV2DeleteIngestionTest to start cl…

0b8775e

…uster

Fix unclosed Iceberg writer before toDataFile/toDeleteFile calls

3af8a59

Fix GenericParquetWriter::buildWriter -> ::create for Iceberg 1.10

fa883f6

Guard non-Parquet Iceberg formats with UnsupportedOperationException (a…

72d44bd

…pache#19472)

Fix MSQ ClassCastException, null StructLike, and ORDER BY issues in I…

a56c177

…ceberg V2 delete tests

Fix positional delete writer: use withSpec instead of forTable to avo…

bbcb23f

…id row data requirement

Guard delete file formats independently and document Parquet-only lim…

604e496

…itation (apache#19472)

Fix DeleteFileInfo constructor calls in V2DeleteHandlingTest after ad…

386a37f

…ding format field

Shekharrajak force-pushed the feature/iceberg-v2-delete-support branch from fe5c0c9 to 2cbaeb6 Compare May 27, 2026 17:43

fix(iceberg): promote iceberg-data to compile scope and drop stale un…

5ecc28c

…used-dep suppression

Shekharrajak force-pushed the feature/iceberg-v2-delete-support branch from 2cbaeb6 to 5ecc28c Compare May 27, 2026 17:44

fix(iceberg): plumb catalog Hadoop Configuration through to V2 native…

5962d62

… reader

FrankChen021 reviewed May 29, 2026

View reviewed changes

Shekharrajak added 5 commits May 31, 2026 22:24

revert: undo Configuration plumbing through IcebergFileTaskInputSource (

d935f2a

5962d62)

iceberg: expose getHadoopConfigOverrides on catalog SPI

db5ca12

iceberg: carry hadoopConfigOverrides on FileScanResult and v2 input s…

d8298f5

…ource

iceberg: inject @hiveconf into v2 input source and build effective co…

fe61dd5

…nf for reader

iceberg: add v2 input source serde and override-prop tests

477db74

FrankChen021 reviewed Jun 1, 2026

View reviewed changes

iceberg: exclude transitive orc-core from iceberg-data to avoid ORC c…

e9b3b82

…lasspath collision with druid-orc-extensions

jtuglu1 requested a review from a2l007 June 1, 2026 18:21

Shekharrajak mentioned this pull request Jun 2, 2026

Flaky Test: KafkaBoundedSupervisorTest #19542

Open

FrankChen021 reviewed Jun 2, 2026

View reviewed changes

Shekharrajak mentioned this pull request Jun 2, 2026

fix: KafkaBoundedSupervisorTest flaky test #19543

Open

10 tasks

Conversation

Shekharrajak commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Release note

Key changed/added classes in this PR

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Shekharrajak commented May 17, 2026

Uh oh!

Shekharrajak commented May 17, 2026

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Shekharrajak commented May 20, 2026

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FrankChen021 May 27, 2026

Choose a reason for hiding this comment

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

FrankChen021 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

FrankChen021 May 29, 2026

Choose a reason for hiding this comment

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

jtuglu1 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Shekharrajak commented Apr 6, 2026 •

edited

Loading

jtuglu1 commented Jun 1, 2026 •

edited

Loading