Skip to content

feat: estimate cardinality for semi and anti-joins using distinct counts#20904

Open
buraksenn wants to merge 3 commits intoapache:mainfrom
buraksenn:use-ndv-for-semi-and-anti-join
Open

feat: estimate cardinality for semi and anti-joins using distinct counts#20904
buraksenn wants to merge 3 commits intoapache:mainfrom
buraksenn:use-ndv-for-semi-and-anti-join

Conversation

@buraksenn
Copy link
Contributor

Which issue does this PR close?

Does not close but part of #20766

Rationale for this change

Details are in #20766. But main idea is to use existing distinct count information to optimize joins similar to how Spark/Trino does

What changes are included in this PR?

This PR extends cardinality estimation for semi/anti joins using distinct counts

Are these changes tested?

I've added cases but not sure if I should've added benchmarks on this.

Are there any user-facing changes?

No

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Mar 12, 2026
Copy link
Member

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a couple of minor points and a few tests to be added. The only change I'd like to see is bailing out when either side has no stats for a column pair.

None
}

/// Estimates the number of outer rows that have at least one matching
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The math looks sound to me, and coherent with that of #20846.

I was wondering if you did check other notable systems using CBO like Trino or Spark.

If so, consider adding a note, this will help reviewers trust the change, as already battle-tested elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This builds up on same assumption in the inner join in the same file estimate_inner_join_cardinality. I saw similar thing in postgres https://github.com/postgres/postgres/blob/02976b0a1718037f73fded250411b013e81fdafa/src/backend/utils/adt/selfuncs.c#L2718. I may need to check Spark and Trino again. In the epic it said about them but not sure about this.

If you have any reservations about I can close or maybe try to be more conservative on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we can look into what trino did, I think they had something for this, but the postgres approach makes sense

}

let mut selectivity = 1.0_f64;
let mut has_ndv = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is more about having a selectivity estimate than NDV (judging on lines 774 which set it, and 778 that consumes it), how would has_selectivity_estimate sound?

let inner_has_stats = inner_stat.distinct_count.get_value().is_some()
|| (inner_stat.min_value.get_value().is_some()
&& inner_stat.max_value.get_value().is_some());
if !outer_has_stats && !inner_has_stats {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather be even more conservative, and turn the AND into an OR: with missing stats (both NDV and min/max), the number of rows is used as fallback, mixing it NDV would make the estimation probably too inaccurate to be useful, so my suggestion is as-follows:

Suggested change
if !outer_has_stats && !inner_has_stats {
if !outer_has_stats || !inner_has_stats {

(10, Inexact(30), Absent, Absent, Absent),
Some(50),
),
// NDV-based semi join: outer_ndv=20, inner_ndv=10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good test coverage, but I'd also add test cases for:

  • Multi-column join keys (to exercise the multiplicative selectivity path, which is new code)
  • Mixed stats availability (one column has NDV, another doesn't)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course let me add it

@buraksenn buraksenn force-pushed the use-ndv-for-semi-and-anti-join branch from 79dcc2b to ee530c3 Compare March 12, 2026 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants