Skip to content

Conversation

@cboumalh
Copy link
Contributor

@cboumalh cboumalh commented Nov 5, 2025

What changes were proposed in this pull request?

Implement support for tuple sketches in Apache Spark to enable approximate set cardinality, frequency, and similarity computations over multiple dimensions efficiently

Why are the changes needed?

Spark currently lacks support for tuple sketches, which allow efficient approximate computations over key–value data.
These changes add tuple sketch support to enable fast and memory-efficient estimates of distinct counts, frequencies, and set similarities across multiple dimensions.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

WIP

Was this patch authored or co-authored using generative AI tooling?

Yes

@cboumalh cboumalh marked this pull request as draft November 5, 2025 00:42
@github-actions github-actions bot added the SQL label Nov 5, 2025
@cboumalh
Copy link
Contributor Author

cboumalh commented Nov 5, 2025

cc @dtenedor @mkaravel @gengliangwang (still WIP)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant