feat: estimate cardinality for semi and anti-joins using distinct counts by buraksenn · Pull Request #20904 · apache/datafusion

buraksenn · 2026-03-12T12:46:35Z

Which issue does this PR close?

Does not close but part of #20766

Rationale for this change

Details are in #20766. But main idea is to use existing distinct count information to optimize joins similar to how Spark/Trino does

What changes are included in this PR?

This PR extends cardinality estimation for semi/anti joins using distinct counts

Are these changes tested?

I've added cases but not sure if I should've added benchmarks on this.

Are there any user-facing changes?

No

asolimando

LGTM, a couple of minor points and a few tests to be added. The only change I'd like to see is bailing out when either side has no stats for a column pair.

asolimando · 2026-03-12T13:54:12Z

datafusion/physical-plan/src/joins/utils.rs

    None
 }

+/// Estimates the number of outer rows that have at least one matching


The math looks sound to me, and coherent with that of #20846.

I was wondering if you did check other notable systems using CBO like Trino or Spark.

If so, consider adding a note, this will help reviewers trust the change, as already battle-tested elsewhere.

This builds up on same assumption in the inner join in the same file estimate_inner_join_cardinality. I saw similar thing in postgres https://github.com/postgres/postgres/blob/02976b0a1718037f73fded250411b013e81fdafa/src/backend/utils/adt/selfuncs.c#L2718. I may need to check Spark and Trino again. In the epic it said about them but not sure about this.

If you have any reservations about I can close or maybe try to be more conservative on this

Yes I think we can look into what trino did, I think they had something for this, but the postgres approach makes sense

asolimando · 2026-03-12T14:32:00Z

datafusion/physical-plan/src/joins/utils.rs

+    }
+
+    let mut selectivity = 1.0_f64;
+    let mut has_ndv = false;


Nit: this is more about having a selectivity estimate than NDV (judging on lines 774 which set it, and 778 that consumes it), how would has_selectivity_estimate sound?

asolimando · 2026-03-12T14:42:03Z

datafusion/physical-plan/src/joins/utils.rs

+        let inner_has_stats = inner_stat.distinct_count.get_value().is_some()
+            || (inner_stat.min_value.get_value().is_some()
+                && inner_stat.max_value.get_value().is_some());
+        if !outer_has_stats && !inner_has_stats {


I'd rather be even more conservative, and turn the AND into an OR: with missing stats (both NDV and min/max), the number of rows is used as fallback, mixing it NDV would make the estimation probably too inaccurate to be useful, so my suggestion is as-follows:

Suggested change

if !outer_has_stats && !inner_has_stats {

if !outer_has_stats || !inner_has_stats {

asolimando · 2026-03-12T14:45:08Z

datafusion/physical-plan/src/joins/utils.rs

                (10, Inexact(30), Absent, Absent, Absent),
                Some(50),
            ),
+            // NDV-based semi join: outer_ndv=20, inner_ndv=10


Good test coverage, but I'd also add test cases for:

Multi-column join keys (to exercise the multiplicative selectivity path, which is new code)

Mixed stats availability (one column has NDV, another doesn't)

Of course let me add it

feat: use ndv to estimate cardinality

6b5a8d2

github-actions bot added the physical-plan Changes to the physical-plan crate label Mar 12, 2026

fall back if both sides dont have stats

6c989d1

asolimando reviewed Mar 12, 2026

View reviewed changes

jonathanc-n mentioned this pull request Mar 12, 2026

EPIC: Making use of NDVs (number of distinct values) in DataFusion #20766

Open

buraksenn force-pushed the use-ndv-for-semi-and-anti-join branch from a82d83e to 79dcc2b Compare March 12, 2026 20:40

address review comments

ee530c3

buraksenn force-pushed the use-ndv-for-semi-and-anti-join branch from 79dcc2b to ee530c3 Compare March 12, 2026 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: estimate cardinality for semi and anti-joins using distinct counts#20904

feat: estimate cardinality for semi and anti-joins using distinct counts#20904
buraksenn wants to merge 3 commits intoapache:mainfrom
buraksenn:use-ndv-for-semi-and-anti-join

buraksenn commented Mar 12, 2026

Uh oh!

asolimando left a comment

Uh oh!

asolimando Mar 12, 2026

Uh oh!

buraksenn Mar 12, 2026

Uh oh!

jonathanc-n Mar 12, 2026

Uh oh!

asolimando Mar 12, 2026

Uh oh!

asolimando Mar 12, 2026

Uh oh!

asolimando Mar 12, 2026

Uh oh!

buraksenn Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if !outer_has_stats && !inner_has_stats {
	if !outer_has_stats \|\| !inner_has_stats {

Conversation

buraksenn commented Mar 12, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

asolimando left a comment

Choose a reason for hiding this comment

Uh oh!

asolimando Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

buraksenn Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

jonathanc-n Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

asolimando Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

buraksenn Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants