Skip to content

Use exact distinct_count from statistics if exists for COUNT(DISTINCT column)) calculations#20845

Open
buraksenn wants to merge 4 commits intoapache:mainfrom
buraksenn:use-distinct-count-from-stats-in-distinct-count
Open

Use exact distinct_count from statistics if exists for COUNT(DISTINCT column)) calculations#20845
buraksenn wants to merge 4 commits intoapache:mainfrom
buraksenn:use-distinct-count-from-stats-in-distinct-count

Conversation

@buraksenn
Copy link
Contributor

Which issue does this PR close?

Does not close but part of #20766

Rationale for this change

If distinct_count is known for the column from statistics we can directly use it in COUNT(DISTINCT column)) expressions instead of calculating.

What changes are included in this PR?

  • Return distinct_count value if exists from statistics
  • Unit tests

Are these changes tested?

Added additional tests

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate functions Changes to functions implementation labels Mar 10, 2026
// expressions like casts or literals are not supported.
let col_expr = expr.as_any().downcast_ref::<expressions::Column>()?;
if let Precision::Exact(dc) = col_stats[col_expr.index()].distinct_count {
return Some(ScalarValue::Int64(Some(dc as i64)));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes NULL is not counted in distinct_count column. AFAIS from the repository it is not counted so this should be ok but wanted to put a comment here

Copy link
Contributor

@jonathanc-n jonathanc-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @buraksenn couple comments

Ok(())
}

fn mock_data_with_distinct_count(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to combine these two functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not exactly understand but changed it to one table driven test

// expressions like casts or literals are not supported.
let col_expr = expr.as_any().downcast_ref::<expressions::Column>()?;
if let Precision::Exact(dc) = col_stats[col_expr.index()].distinct_count {
return Some(ScalarValue::Int64(Some(dc as i64)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Distinct count is usize, we should deal with the case where it could overflow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since other paths were casting as well I did not do it actually but let me do it for all paths

Copy link
Contributor

@jonathanc-n jonathanc-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @buraksenn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants