-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel][Clustering #4] add ClusteringUtils #4322
base: master
Are you sure you want to change the base?
Conversation
6b652bb
to
7e9e1b6
Compare
@@ -18,6 +18,7 @@ | |||
import static java.lang.String.format; | |||
|
|||
import io.delta.kernel.exceptions.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To Reviewers,
This is a stack PR and please only review the latests commit which includes all the relevant code changes.
72dc84b
to
37dbd23
Compare
if (!minValues.containsKey(column) | ||
|| !maxValues.containsKey(column) | ||
|| !nullCounts.containsKey(column)) { | ||
throw DeltaErrors.missingColumnStatsForClustering(column, dataFileStatus); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a case where if all values are null, we don't have min and max?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raveeram-db do we have any utility that can validate stats are valid? @KaiqiJinWow is asking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think we have any, I think we thought we'll leave it up to the engines to validate but we could perhaps refactor out basic checks like the one here into a helper?
Line 106 in 1517688
if (numNulls != null && rowCount == numNulls) { |
37dbd23
to
08de273
Compare
throw DeltaErrors.missingFileStatsForClustering(clusteringColumns, dataFileStatus); | ||
} | ||
|
||
DataFileStatistics dataFileStatistics = dataFileStatus.getStatistics().get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we throw if missing? dataFileStatus.getStatistics().orElseThrow(..)
Which Delta project/connector is this regarding?
Description
Split the main PR #4265 for faster review
This PR adds the ClusteringUtils which includes a few utils would could be used for clustering feature
How was this patch tested?
Unit tests, more tests would be added later in integration tests.
Does this PR introduce any user-facing changes?