[core][spark] Add statistics sidecar for NDV sketches#8382
Closed
kerwin-zk wants to merge 1 commit into
Closed
Conversation
04bf948 to
dc9db43
Compare
JingsongLi
reviewed
Jun 30, 2026
JingsongLi
left a comment
Contributor
There was a problem hiding this comment.
Can this PR be fully implemented? We will first determine whether this feature is overall reasonable through a complete implementation.
dc9db43 to
cdc91da
Compare
Contributor
Author
|
Thanks for the suggestion and the review. I have updated this PR from a metadata-only proposal to a complete core implementation. The PR now includes:
I kept Spark |
cdc91da to
3d4abb2
Compare
Contributor
|
@kerwin-zk I think you can merge ALL changes in this single PR. If we reach a consensus, you can then break it down into smaller PRs to continue the progress. |
3d4abb2 to
dd12c96
Compare
dd12c96 to
1e108e2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
For #8381.
This PR implements a core + Spark statistics sidecar vertical slice for mergeable NDV sketches.
It adds:
StatisticsBlobMetadatafor typed blob references, field ids, optional snapshot / sequence numbers, properties, sidecar file location, offset, and length.StatisticsBlobandStatisticsSidecarFileto write multiple statistics blob payloads into one sidecar file and read them back by metadata.Statistics.blobMetadataJSON serialization with backward-compatible reads when the field is absent.StatsFileHandlersidecar read/write APIs and sidecar cleanup when deleting statistics files.FileStorePathFactory.statsSidecarFileFactory()and thestat-sidecar-prefix, so binary sidecars are distinguishable from JSON statistics files under the statistics directory.StatisticsNdvSketchfor Paimon-specificpaimon-ndv-theta-sketch-v1blobs. The payload is an Apache DataSketches compact Theta sketch, while the container is Paimon's own statistics sidecar format.ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNSsidecar production behindspark.paimon.analyze.ndv-sketch.enabled(default: false). When enabled, Spark writes one NDV sketch blob per analyzed column and attaches the sidecar metadata to the committedStatisticsfile.Existing scalar column statistics remain unchanged; the sidecar metadata is additional optional statistics metadata.
Tests
CI