[kernel-spark] Add getFileChanges() to support Kernel-based DSv2 streaming (Part I) #5313

zikangh · 2025-10-07T23:13:33Z

Which Delta project/connector is this regarding?

Description

This PR is Part I of implementing SparkMicroBatchStream.getFileChanges() to support Kernel-based dsv2 Delta streaming (M1 milestone).

Reads Delta commit range and converts actions to KernelIndexedFile objects with proper indexing and sentinel values.
Basic 1-pass commit validation.

Followups include schema evolution support and initial snapshot support (marked TODO(M1) in code)

How was this patch tested?

Parameterized tests verifying parity between DSv1 (DeltaSource) and DSv2 (SparkMicroBatchStream).

Does this PR introduce any user-facing changes?

No

zikangh · 2025-10-08T01:11:38Z

Hi @huan233usc @gengliangwang @jerrypeng, could you please help review this PR?

huan233usc · 2025-10-08T17:20:50Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/KernelIndexedFile.java

+ *
+ * <p>Indexed: refers to the index in DeltaSourceOffset, assigned by the streaming engine.
+ */
+public class KernelIndexedFile {


Nit: just call it IndexedFile? Kernel is just an impl details

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

gengliangwang · 2025-10-08T22:59:06Z

kernel-spark/src/main/java/io/delta/kernel/spark/utils/StreamingHelper.java

+   * from the first row (rowId=0).
+   */
+  public static long getVersion(ColumnarBatch batch) {
+    assert batch.getSize() > 0;


let's follow https://github.com/delta-io/delta/blob/master/kernel/kernel-api/src/main/java/io/delta/kernel/internal/actions/RowBackedAction.java#L46 and create a new helper function here

protected int getFieldIndex(String fieldName) { int index = row.getSchema().indexOf(fieldName); checkArgument(index >= 0, "Field '%s' not found in schema: %s", fieldName, row.getSchema()); return index; }

Done. Thank you!

gengliangwang · 2025-10-08T23:36:41Z

kernel-spark/src/test/java/io/delta/kernel/spark/read/SparkMicroBatchStreamTest.java

+        int id = i * 100 + j;
+        insertValues.append(String.format("(%d, 'User%d')", id, id));
+      }
+      spark.sql(String.format("INSERT INTO %s VALUES %s", testTableName, insertValues.toString()));


I wonder if we can following https://github.com/delta-io/delta/blob/master/kernel-spark/src/test/java/io/delta/kernel/spark/read/SparkGoldenTableTest.java#L579 and test many of the golden tables.

I think it might be overkill at this point, especially when there are so many unsupported table types. I agree we should eventually add a test like this. Added a TODO.

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

huan233usc · 2025-10-08T23:39:33Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+          continue;
+        }
+        long version = StreamingHelper.getVersion(batch);
+        validateCommit(batch, version, endOffset);


Should validate happen after processing previous version?

Done, thanks!

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

kernel-spark/src/main/java/io/delta/kernel/spark/utils/StreamingHelper.java

jerrypeng · 2025-10-09T05:44:48Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+    }
+    CommitRange commitRange = builder.build(engine);
+    // Required by kernel: perform protocol validation by creating a snapshot at startVersion.
+    Snapshot startSnapshot =


Why do you need to get a snapshot even if we start reading from a specific delta log version?

It's required by the kernel to fetch actions:

delta/kernel/kernel-api/src/main/java/io/delta/kernel/CommitRange.java

Line 96 in f3619ef

* @param startSnapshot the snapshot for startVersion, required to ensure the table is readable by

jerrypeng · 2025-10-09T05:53:23Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+    Snapshot startSnapshot =
+        TableManager.loadSnapshot(tablePath).atVersion(startVersion).build(engine);
+    // TODO(M1): This is not working with ccv2 table
+    Set<DeltaAction> actionSet = new HashSet<>(Arrays.asList(DeltaAction.ADD, DeltaAction.REMOVE));


Ideally this is class static variable so it will only be allocated once per query run.

Why do we also need to get the "REMOVE" actions?

Done.

See validateCommit() -- the current behavior of the delta connector is that we fail the pipeline if any commit contains a REMOVE (unless skipDeletes or skipChangeCommits are specified). Streaming jobs are meant to process append-only data

jerrypeng · 2025-10-09T06:06:04Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+        }
+        long version = StreamingHelper.getVersion(batch);
+        // TODO(M1): migrate to kernel's commit-level iterator (WIP).
+        // The current one-pass algorithm assumes REMOVE actions proceed ADD actions


Where are you filtering out the "REMOVE" actions?

We throw an error whenever we encounter a REMOVE -- because ETL jobs should process append-only data. We fail explicitly to avoid correctness issues.
In M2, we'll also support ignoreChangedCommits and ignoreDeletes to skip these commits silently.
To properly handle update actions, users would need to use CDF.

gengliangwang · 2025-10-09T19:19:56Z

build.sbt

  .dependsOn(kernelApi)
  .dependsOn(kernelDefaults)
-  .dependsOn(spark % "test->test")
+  .dependsOn(spark % "compile->compile;test->test")


is this necessary?

Without this, compilation would fail because the program cannot find org.apache.spark.sql.delta.DeltaErrors and org.apache.spark.sql.delta.sources.DeltaSourceOffset.

The dependency between V2 & V1 is still under discussion. However, I don't want to block your development because of this. We can still create a new copy of the DeltaErrors and DeltaSourceOffset if we decide not to have code reuse

zikangh · 2025-10-09T20:07:54Z

This comment is generated by AI.

Yes, this change is necessary. The main source code in SparkMicroBatchStream.java (not just tests) now uses classes from the spark module:

org.apache.spark.sql.delta.DeltaErrors (line 241 in validateCommit())
org.apache.spark.sql.delta.sources.DeltaSourceOffset (used throughout getFileChanges() and helper methods)

These are production dependencies used in the implementation, so we need compile->compile in addition to the existing test->test dependency.

huan233usc · 2025-10-09T21:31:18Z

kernel-spark/src/main/java/io/delta/kernel/spark/utils/StreamingHelper.java

+    }
+
+    Row addFileRow = StructRow.fromStructVector(addVector, rowId);
+    if (addFileRow == null) {


Should we throw here? given -- addVector.isNullAt(rowId) is false

We call this method even on REMOVE rows in extractIndexedFilesFromBatch

if this is a remove row, iiuc the method will return on L175? Did I miss something?

Ah yes. You are right. Done.

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

kernel-spark/src/test/java/io/delta/kernel/spark/read/SparkMicroBatchStreamTest.java

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

kernel-spark/src/test/java/io/delta/kernel/spark/read/SparkMicroBatchStreamTest.java

jerrypeng · 2025-10-10T23:40:27Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+      // A version can be split across multiple batches.
+      long currentVersion = -1;
+      long currentIndex = 0;
+      List<IndexedFile> currentVersionFiles = new ArrayList<>();


Its more performant to use a linked list if we don't actually know the size the list will be

I don't think so. The per-node overhead of a linked list outweighs the cost of resizing of an arraylist, especially for my use case (addAll(), add(), clear()). We would maybe get a performance benefit if we do a lot of deletes and inserts at the beginning or in the middle (which we are not).

jerrypeng · 2025-10-10T23:41:50Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+
+    // TODO(#5319): check trackingMetadataChange flag and compare with stream metadata.
+
+    result.addAll(dataFiles);


This is not very efficient. "dataFiles" should just be a linked list and you can append and prepend in constant time.

ditto -- I don't think linked lists would help here.

jerrypeng · 2025-10-10T23:49:58Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+        // The current one-pass algorithm assumes REMOVE actions proceed ADD actions
+        // in a commit; we should implement a proper two-pass approach once kernel API is ready.
+
+        if (currentVersion != -1 && version != currentVersion) {


This logic here is kind of confusing. All you are trying to do is sandwich the index files between the BASE_INDEX sentinel file and END_INDEX sentinel file right? Why not simplify the logic to be

allIndexedFiles.add(beginSentinelFile) allIndexedFiles.addAll(allIndexFilesInBatch) allIndexedFiles.add(endSentinelFile)

We only insert sentinels before and after a version. The code is complex because the kernel breaks up a commit into batches (ColumnarBatch) to avoid overwhelming memory. I reorganized the code a bit to make this clear. Could you take another look?

jerrypeng · 2025-10-10T23:52:50Z

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java

+   */
+  private List<IndexedFile> extractIndexedFilesFromBatch(
+      ColumnarBatch batch, long version, long startIndex) {
+    List<IndexedFile> indexedFiles = new ArrayList<>();


Use linkedlist

Same rationale as above -- we are only doing addAll() and add(), arrayList would be faster and more memory-efficient.

jerrypeng · 2025-10-11T00:13:55Z

kernel-spark/src/test/java/io/delta/kernel/spark/read/SparkMicroBatchStreamTest.java

+        Arguments.of(
+            0L, BASE_INDEX, isInitialSnapshot, Optional.of(2L), Optional.of(5L), "v0 to v2 id:5"),
+        Arguments.of(
+            1L, 5L, isInitialSnapshot, Optional.of(3L), Optional.of(10L), "v1 id:5 to v3 id:10"),


What about to and from END_INDEX?

Done. Thanks!

jerrypeng · 2025-10-11T00:16:41Z

kernel-spark/src/test/java/io/delta/kernel/spark/read/SparkMicroBatchStreamTest.java

+   */
+  @ParameterizedTest
+  @MethodSource("getFileChangesParameters")
+  public void testGetFileChanges(


Should we also test with other types of actions in the delta log?

We do test REMOVEs & ADDs, I added METADATA too (which now will yield empty commits).

huan233usc · 2025-10-13T17:46:11Z

kernel-spark/src/test/java/io/delta/kernel/spark/read/SparkMicroBatchStreamTest.java

+              "Index mismatch at index %d: dsv1=%d, dsv2=%d",
+              i, deltaFile.index(), kernelFile.getIndex()));
+
+      String deltaPath = deltaFile.add() != null ? deltaFile.add().path() : null;


nit: document that deltaFile.add() != null could happen when it is starting/ending index

huan233usc

I think it is good as a starting point.

getFileChanges 1

477556a

zikangh mentioned this pull request Oct 7, 2025

getFileChanges part 1 zikangh/delta#2

Open

5 tasks

zikangh changed the title ~~[kernel dsv2 streaming] Add logic that reads the delta commit log in preparation for streaming read (Part I)~~ [kernel dsv2 streaming] Add logic that reads the delta commit log to determine offsets for streaming read (Part I) Oct 7, 2025

format

f881fb3

zikangh changed the title ~~[kernel dsv2 streaming] Add logic that reads the delta commit log to determine offsets for streaming read (Part I)~~ [kernel-spark] Add getFileChanges() to support Kernel-based DSv2 streaming (Part I) Oct 7, 2025

format again

a91bc67

huan233usc requested review from gengliangwang and huan233usc October 8, 2025 04:15

huan233usc reviewed Oct 8, 2025

View reviewed changes

gengliangwang reviewed Oct 8, 2025

View reviewed changes

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java Outdated Show resolved Hide resolved

gengliangwang reviewed Oct 8, 2025

View reviewed changes

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java Outdated Show resolved Hide resolved

gengliangwang reviewed Oct 8, 2025

View reviewed changes

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java Outdated Show resolved Hide resolved

gengliangwang reviewed Oct 8, 2025

View reviewed changes

kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkMicroBatchStream.java Outdated Show resolved Hide resolved

huan233usc reviewed Oct 8, 2025

View reviewed changes

zikangh added 2 commits October 9, 2025 00:44

address comments

ecc3f27

add todo

d91f873

gengliangwang reviewed Oct 9, 2025

View reviewed changes