feat: estimate cardinality for semi and anti-joins using distinct counts#20904
feat: estimate cardinality for semi and anti-joins using distinct counts#20904buraksenn wants to merge 3 commits intoapache:mainfrom
Conversation
asolimando
left a comment
There was a problem hiding this comment.
LGTM, a couple of minor points and a few tests to be added. The only change I'd like to see is bailing out when either side has no stats for a column pair.
| None | ||
| } | ||
|
|
||
| /// Estimates the number of outer rows that have at least one matching |
There was a problem hiding this comment.
The math looks sound to me, and coherent with that of #20846.
I was wondering if you did check other notable systems using CBO like Trino or Spark.
If so, consider adding a note, this will help reviewers trust the change, as already battle-tested elsewhere.
There was a problem hiding this comment.
This builds up on same assumption in the inner join in the same file estimate_inner_join_cardinality. I saw similar thing in postgres https://github.com/postgres/postgres/blob/02976b0a1718037f73fded250411b013e81fdafa/src/backend/utils/adt/selfuncs.c#L2718. I may need to check Spark and Trino again. In the epic it said about them but not sure about this.
If you have any reservations about I can close or maybe try to be more conservative on this
There was a problem hiding this comment.
Yes I think we can look into what trino did, I think they had something for this, but the postgres approach makes sense
| } | ||
|
|
||
| let mut selectivity = 1.0_f64; | ||
| let mut has_ndv = false; |
There was a problem hiding this comment.
Nit: this is more about having a selectivity estimate than NDV (judging on lines 774 which set it, and 778 that consumes it), how would has_selectivity_estimate sound?
| let inner_has_stats = inner_stat.distinct_count.get_value().is_some() | ||
| || (inner_stat.min_value.get_value().is_some() | ||
| && inner_stat.max_value.get_value().is_some()); | ||
| if !outer_has_stats && !inner_has_stats { |
There was a problem hiding this comment.
I'd rather be even more conservative, and turn the AND into an OR: with missing stats (both NDV and min/max), the number of rows is used as fallback, mixing it NDV would make the estimation probably too inaccurate to be useful, so my suggestion is as-follows:
| if !outer_has_stats && !inner_has_stats { | |
| if !outer_has_stats || !inner_has_stats { |
| (10, Inexact(30), Absent, Absent, Absent), | ||
| Some(50), | ||
| ), | ||
| // NDV-based semi join: outer_ndv=20, inner_ndv=10 |
There was a problem hiding this comment.
Good test coverage, but I'd also add test cases for:
- Multi-column join keys (to exercise the multiplicative selectivity path, which is new code)
- Mixed stats availability (one column has NDV, another doesn't)
There was a problem hiding this comment.
Of course let me add it
a82d83e to
79dcc2b
Compare
79dcc2b to
ee530c3
Compare
Which issue does this PR close?
Does not close but part of #20766
Rationale for this change
Details are in #20766. But main idea is to use existing distinct count information to optimize joins similar to how Spark/Trino does
What changes are included in this PR?
This PR extends cardinality estimation for semi/anti joins using distinct counts
Are these changes tested?
I've added cases but not sure if I should've added benchmarks on this.
Are there any user-facing changes?
No