Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel][Clustering #4] add ClusteringUtils #4322

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

KaiqiJinWow
Copy link
Contributor

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Split the main PR #4265 for faster review

This PR adds the ClusteringUtils which includes a few utils would could be used for clustering feature

  • getClusteringDomainMetadata: Generate the domain metadata for the clustering columns.
  • getClusteringColumnsOptional: Extract ClusteringColumns from a given snapshot.
  • validateDataFileStatus: validate the per-file statistics and per-column statistics for clustering columns

How was this patch tested?

Unit tests, more tests would be added later in integration tests.

Does this PR introduce any user-facing changes?

Sorry, something went wrong.

@KaiqiJinWow KaiqiJinWow force-pushed the stack/add_cluster_utils branch 2 times, most recently from 6b652bb to 7e9e1b6 Compare March 26, 2025 17:13
@@ -18,6 +18,7 @@
import static java.lang.String.format;

import io.delta.kernel.exceptions.*;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To Reviewers,

This is a stack PR and please only review the latests commit which includes all the relevant code changes.

@KaiqiJinWow KaiqiJinWow force-pushed the stack/add_cluster_utils branch 2 times, most recently from 72dc84b to 37dbd23 Compare March 27, 2025 01:56
Comment on lines +82 to +85
if (!minValues.containsKey(column)
|| !maxValues.containsKey(column)
|| !nullCounts.containsKey(column)) {
throw DeltaErrors.missingColumnStatsForClustering(column, dataFileStatus);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a case where if all values are null, we don't have min and max?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raveeram-db do we have any utility that can validate stats are valid? @KaiqiJinWow is asking.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we have any, I think we thought we'll leave it up to the engines to validate but we could perhaps refactor out basic checks like the one here into a helper?

@KaiqiJinWow KaiqiJinWow force-pushed the stack/add_cluster_utils branch from 37dbd23 to 08de273 Compare March 31, 2025 23:24
throw DeltaErrors.missingFileStatsForClustering(clusteringColumns, dataFileStatus);
}

DataFileStatistics dataFileStatistics = dataFileStatus.getStatistics().get();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw if missing? dataFileStatus.getStatistics().orElseThrow(..)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants