feat: Iceberg V2 delete file support in druid-iceberg-extensions#19266
feat: Iceberg V2 delete file support in druid-iceberg-extensions#19266Shekharrajak wants to merge 51 commits into
Conversation
FrankChen021
left a comment
There was a problem hiding this comment.
Findings that could not be attached inline:
- extensions-contrib/druid-iceberg-extensions/src/main/java/org/apache/druid/iceberg/input/IcebergInputSource.java:164 - [P1] V2 tables with deletes produce zero splits. When any delete file is present, retrieveIcebergDatafiles() sets delegateInputSource to EmptyInputSource, so createSplits() returns an empty stream and estimateNumSplits() returns 0. MSQ and parallel ingestion slice SplittableInputSource inputs exclusively through createSplits()/withSplit(), so an Iceberg v2 table with deletes will schedule no readable input slices and ingest no rows instead of using the native reader path.
8e510d3 to
309c97b
Compare
updated : 309c97b#diff-1b9776e43fa17c32a610eee043a6fd6cbf32b29ae4026d24ea798ac9a6f638bcR168 |
|
Noted gaps in iceberg v2 spec support #19471 |
6c98a82 to
bab8237
Compare
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 0 |
| P2 | 1 |
| P3 | 0 |
| Total | 1 |
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 0 |
| P2 | 1 |
| P3 | 0 |
| Total | 1 |
Reviewed 15 of 15 changed files. The earlier FileIO metadata follow-up is addressed, so no inline reply is needed; the new finding is below.
This is an automated review by Codex GPT-5.5
|
Flaky test reported #19491 Please help in triggering the one failed CI check run. We can work on this flaky test separately. |
FrankChen021
left a comment
There was a problem hiding this comment.
I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.
Reviewed 15 of 15 changed files.
This is an automated review by Codex GPT-5.5
8a22a95 to
c757d50
Compare
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 1 |
| P2 | 1 |
| P3 | 0 |
| Total | 2 |
Reviewed 15 of 15 changed files.
This is an automated review by Codex GPT-5.5
| this.tableSchemaJson = tableSchemaJson; | ||
| this.warehouseSource = warehouseSource; | ||
| this.inputRowSchema = inputRowSchema; | ||
| this.hadoopConf = new Configuration(); |
There was a problem hiding this comment.
[P2] V2 reader drops configured warehouse/Hadoop access
The v2 path constructs a fresh Hadoop Configuration and then opens data and delete files through a recreated Iceberg FileIO, while the configured warehouseSource is only stored and never used. Existing HDFS/S3 warehouse ingestion relies on warehouseSource and injected Hadoop configuration, so v2 tables with deletes can plan successfully on the controller but fail on workers when those files require the Druid warehouse source or cluster Hadoop credentials.
…ceberg V2 delete tests
…id row data requirement
…ding format field
fe5c0c9 to
2cbaeb6
Compare
…used-dep suppression
2cbaeb6 to
5ecc28c
Compare
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 2 |
| P2 | 0 |
| P3 | 0 |
| Total | 2 |
The original iceberg-data dependency follow-up is handled; I found two remaining v2 read-path issues.
Reviewed 16 of 16 changed files.
This is an automated review by Codex GPT-5.5
| @JsonProperty("fileIOImpl") @Nullable final String fileIOImpl, | ||
| @JsonProperty("fileIOProperties") @Nullable final Map<String, String> fileIOProperties, | ||
| @JsonProperty("fileFormat") @Nullable final String fileFormat, | ||
| final Configuration hadoopConf |
There was a problem hiding this comment.
[P1] Unbound Configuration breaks v2 split deserialization
IcebergFileTaskInputSource is registered as a Jackson subtype and returned from withSplit for v2 tables, so workers need to deserialize it. The @JsonCreator leaves hadoopConf as an unannotated constructor parameter, unlike the catalog classes that use @JacksonInject @hiveconf, so Druid's ObjectMapper has no property or injectable value to bind. This can make v2 delete splits fail before reading. Inject it, annotate it, or create a safe default, and add a Jackson round-trip test for this input source.
|
|
||
| // Step 3: Stream data file with delete application | ||
| requireParquet(dataFilePath); | ||
| final InputFile dataInputFile = fileIO.newInputFile(dataFilePath); |
There was a problem hiding this comment.
[P1] V2 path bypasses warehouseSource file access
When delete files are present, the new reader opens data and delete files through Iceberg FileIO instead of the configured warehouseSource/inputFormat. Existing Iceberg specs use warehouseSource for S3/GCS/local access settings, endpoints, and credentials, so a table that only adds delete files can silently switch file-access mechanisms and fail or read from the wrong filesystem. The v2 path should preserve or translate the warehouseSource-backed access configuration for native reads.
FrankChen021
left a comment
There was a problem hiding this comment.
I reviewed the incremental update and full changed-file set for correctness, edge cases, concurrency, and integration risks; no issues found.
Reviewed 17 of 17 changed files.
This is an automated review by Codex GPT-5.5
…lasspath collision with druid-orc-extensions
|
Related: #19534 |
FrankChen021
left a comment
There was a problem hiding this comment.
I reviewed the incremental update and full changed-file set for correctness, edge cases, lifecycle, API compatibility, and integration risks; no issues found.
Reviewed 17 of 17 changed files.
This is an automated review by Codex GPT-5.5
Ref #19190
Description
In IcebergCatalog.extractSnapshotDataFiles(), line:
discards task.deletes() entirely. Every FileScanTask from tableScan.planFiles() carries a List that must
be applied for correct v2 reads. The current code passes only raw file paths to warehouseSource, and Druid's
ParquetReader has zero awareness of Iceberg delete files.
Changes
DeleteFileInfo.java Serializable POJO: path, contentType (POSITION/EQUALITY), equalityFieldIds
IcebergFileTaskInputSource.java Per-task InputSource carrying data file + delete metadata + schema JSON +
warehouseSource
IcebergNativeRecordReader.java Manual positional + equality delete application with streaming reads via
Parquet.read()
IcebergRecordConverter.java Iceberg Record to Map with full type coverage
Release note
Iceberg V2 Delete File Support: When FileScanTask.deletes() returns non-empty, the extension creates per-task IcebergFileTaskInputSource objects
carrying serializable metadata (data file path, delete file paths/types/equality field IDs, schema JSON). Workers
apply deletes at read time via IcebergNativeRecordReader which reads position-delete and equality-delete Parquet
files, builds filter sets, and streams the data file while skipping deleted rows. V1 tables (no delete files) continue
to use the existing warehouseSource path unchanged.
Key changed/added classes in this PR
DeleteFileInfo.javaIcebergFileTaskInputSource.javaIcebergNativeRecordReader.javaIcebergRecordConverter.javaIcebergInputSource.javaIcebergCatalog.javaThis PR has: