Skip to content

[core][spark] Add statistics sidecar for NDV sketches#8382

Closed
kerwin-zk wants to merge 1 commit into
apache:masterfrom
kerwin-zk:feature/statistics-blob-metadata
Closed

[core][spark] Add statistics sidecar for NDV sketches#8382
kerwin-zk wants to merge 1 commit into
apache:masterfrom
kerwin-zk:feature/statistics-blob-metadata

Conversation

@kerwin-zk

@kerwin-zk kerwin-zk commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Purpose

For #8381.

This PR implements a core + Spark statistics sidecar vertical slice for mergeable NDV sketches.

It adds:

  • StatisticsBlobMetadata for typed blob references, field ids, optional snapshot / sequence numbers, properties, sidecar file location, offset, and length.
  • StatisticsBlob and StatisticsSidecarFile to write multiple statistics blob payloads into one sidecar file and read them back by metadata.
  • Statistics.blobMetadata JSON serialization with backward-compatible reads when the field is absent.
  • StatsFileHandler sidecar read/write APIs and sidecar cleanup when deleting statistics files.
  • orphan file cleanup protection for sidecar files referenced by live statistics files.
  • FileStorePathFactory.statsSidecarFileFactory() and the stat-sidecar- prefix, so binary sidecars are distinguishable from JSON statistics files under the statistics directory.
  • StatisticsNdvSketch for Paimon-specific paimon-ndv-theta-sketch-v1 blobs. The payload is an Apache DataSketches compact Theta sketch, while the container is Paimon's own statistics sidecar format.
  • Spark ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS sidecar production behind spark.paimon.analyze.ndv-sketch.enabled (default: false). When enabled, Spark writes one NDV sketch blob per analyzed column and attaches the sidecar metadata to the committed Statistics file.

Existing scalar column statistics remain unchanged; the sidecar metadata is additional optional statistics metadata.

Tests

CI

@kerwin-zk kerwin-zk force-pushed the feature/statistics-blob-metadata branch from 04bf948 to dc9db43 Compare June 29, 2026 09:37
@kerwin-zk kerwin-zk marked this pull request as ready for review June 29, 2026 09:43

@JingsongLi JingsongLi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this PR be fully implemented? We will first determine whether this feature is overall reasonable through a complete implementation.

@kerwin-zk kerwin-zk force-pushed the feature/statistics-blob-metadata branch from dc9db43 to cdc91da Compare June 30, 2026 10:01
@kerwin-zk kerwin-zk changed the title [core] Add statistics blob metadata [core] Add statistics sidecar for NDV sketches Jun 30, 2026
@kerwin-zk

kerwin-zk commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the suggestion and the review.

I have updated this PR from a metadata-only proposal to a complete core implementation. The PR now includes:

  • statistics sidecar read / write support for multiple typed binary blob payloads;
  • Statistics.blobMetadata JSON lifecycle with backward compatibility;
  • StatsFileHandler sidecar APIs, sidecar cleanup on statistics deletion, and orphan-file cleanup protection for sidecars still referenced by live statistics files;
  • mergeable NDV blob support based on Apache DataSketches Theta sketch;
  • a Paimon-specific NDV blob type, paimon-ndv-theta-sketch-v1, since this PR uses Paimon's own sidecar container rather than an Iceberg Puffin file;
  • a separate stat-sidecar- file prefix via statsSidecarFileFactory(), so binary sidecars do not share the JSON statistics file prefix;
  • validation for sidecar writes and reads, including rejecting empty sidecar writes and checking offset + length against the actual sidecar file length before reading;
  • documentation that the ndv metadata property is a cached estimate.

I kept Spark ANALYZE / CBO producer wiring out of this PR intentionally, so that we can first review the core format and lifecycle behavior. If this direction looks reasonable, Spark can be wired to
produce and consume the NDV sidecar in a follow-up PR.

@kerwin-zk kerwin-zk force-pushed the feature/statistics-blob-metadata branch from cdc91da to 3d4abb2 Compare June 30, 2026 11:02
@JingsongLi

Copy link
Copy Markdown
Contributor

@kerwin-zk I think you can merge ALL changes in this single PR. If we reach a consensus, you can then break it down into smaller PRs to continue the progress.

@kerwin-zk kerwin-zk force-pushed the feature/statistics-blob-metadata branch from 3d4abb2 to dd12c96 Compare July 1, 2026 06:48
@kerwin-zk kerwin-zk changed the title [core] Add statistics sidecar for NDV sketches [core][spark] Add statistics sidecar for NDV sketches Jul 1, 2026
@kerwin-zk kerwin-zk force-pushed the feature/statistics-blob-metadata branch from dd12c96 to 1e108e2 Compare July 1, 2026 07:46
@kerwin-zk kerwin-zk marked this pull request as draft July 1, 2026 08:30
@kerwin-zk kerwin-zk closed this Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants