[Kernel][Clustering #4] add ClusteringUtils #4322

KaiqiJinWow · 2025-03-26T04:11:39Z

Which Delta project/connector is this regarding?

Description

Split the main PR #4265 for faster review

This PR adds the ClusteringUtils which includes a few utils would could be used for clustering feature

getClusteringDomainMetadata: Generate the domain metadata for the clustering columns.
getClusteringColumnsOptional: Extract ClusteringColumns from a given snapshot.
validateDataFileStatus: validate the per-file statistics and per-column statistics for clustering columns

How was this patch tested?

Unit tests, more tests would be added later in integration tests.

Does this PR introduce any user-facing changes?

KaiqiJinWow · 2025-03-26T21:56:33Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/DeltaErrors.java

@@ -18,6 +18,7 @@
 import static java.lang.String.format;

 import io.delta.kernel.exceptions.*;


To Reviewers,

This is a stack PR and please only review the latests commit which includes all the relevant code changes.

vkorukanti · 2025-03-27T21:58:44Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/clustering/ClusteringUtils.java

+      if (!minValues.containsKey(column)
+          || !maxValues.containsKey(column)
+          || !nullCounts.containsKey(column)) {
+        throw DeltaErrors.missingColumnStatsForClustering(column, dataFileStatus);


is there a case where if all values are null, we don't have min and max?

@raveeram-db do we have any utility that can validate stats are valid? @KaiqiJinWow is asking.

Don't think we have any, I think we thought we'll leave it up to the engines to validate but we could perhaps refactor out basic checks like the one here into a helper?

delta/kernel/kernel-defaults/src/main/java/io/delta/kernel/defaults/internal/parquet/ParquetStatsReader.java

Line 106 in 1517688

if (numNulls != null && rowCount == numNulls) {

raveeram-db · 2025-03-31T23:36:08Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/clustering/ClusteringUtils.java

+      throw DeltaErrors.missingFileStatsForClustering(clusteringColumns, dataFileStatus);
+    }
+
+    DataFileStatistics dataFileStatistics = dataFileStatus.getStatistics().get();


Should we throw if missing? dataFileStatus.getStatistics().orElseThrow(..)

KaiqiJinWow force-pushed the stack/add_cluster_utils branch 2 times, most recently from 6b652bb to 7e9e1b6 Compare March 26, 2025 17:13

KaiqiJinWow commented Mar 26, 2025

View reviewed changes

huan233usc requested review from vkorukanti, huan233usc, scottsand-db, allisonport-db and raveeram-db March 26, 2025 21:58

KaiqiJinWow force-pushed the stack/add_cluster_utils branch 2 times, most recently from 72dc84b to 37dbd23 Compare March 27, 2025 01:56

vkorukanti requested changes Mar 27, 2025

View reviewed changes

KaiqiJinWow added 3 commits March 31, 2025 10:44

add metadata domain

Loading
Loading status checks…

3c75942

update

Loading
Loading status checks…

c248167

add ClusteringUtils

Loading
Loading status checks…

08de273

KaiqiJinWow force-pushed the stack/add_cluster_utils branch from 37dbd23 to 08de273 Compare March 31, 2025 23:24

raveeram-db reviewed Mar 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel][Clustering #4] add ClusteringUtils #4322

[Kernel][Clustering #4] add ClusteringUtils #4322

KaiqiJinWow commented Mar 26, 2025

KaiqiJinWow Mar 26, 2025

vkorukanti Mar 27, 2025

vkorukanti Mar 31, 2025

raveeram-db Mar 31, 2025

raveeram-db Mar 31, 2025

		@@ -18,6 +18,7 @@
		import static java.lang.String.format;

		import io.delta.kernel.exceptions.*;

[Kernel][Clustering #4] add ClusteringUtils #4322

Are you sure you want to change the base?

[Kernel][Clustering #4] add ClusteringUtils #4322

Conversation

KaiqiJinWow commented Mar 26, 2025

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

KaiqiJinWow Mar 26, 2025

Choose a reason for hiding this comment

vkorukanti Mar 27, 2025

Choose a reason for hiding this comment

vkorukanti Mar 31, 2025

Choose a reason for hiding this comment

raveeram-db Mar 31, 2025

Choose a reason for hiding this comment

raveeram-db Mar 31, 2025

Choose a reason for hiding this comment