Skip to content

Replace distinct with aggregate for where-in/exists subquery #5430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ygf11
Copy link
Contributor

@ygf11 ygf11 commented Feb 28, 2023

Which issue does this PR close?

Part of #5429.

Rationale for this change

What changes are included in this PR?

  1. Find where-in/exists and optimize them.
  2. Support Aggregate subquery for where-exists to join, when the aggregate comes from distinct.

Are these changes tested?

Yes.

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels Feb 28, 2023
@ygf11 ygf11 marked this pull request as ready for review February 28, 2023 11:39
@ygf11
Copy link
Contributor Author

ygf11 commented Feb 28, 2023

Please take a look @mingmwang @jackwener @alamb

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ygf11 -- I think this code looks quite good

I am not sure if this will work with expressions (not just column references) but a few more tests I think will demonstrate one way or the other

cc @mingmwang

Comment on lines +1780 to +1781
/// Check whether it is a Distinct.
/// A Distinct means all fields of the schema are the expressions of group by.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure what this means -- is this check designed to check the input expressions to the GroupBy or the output expressions?

If the output expressions, here would be an alternate description

Suggested change
/// Check whether it is a Distinct.
/// A Distinct means all fields of the schema are the expressions of group by.
/// Return true if the output values are distinct (have no duplicates)
///
/// In order for this to return true, all fields of the output schema must be expressions of the group by

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this correct? Or is this check designed to check the input expressions only?

Copy link
Contributor

@mingmwang mingmwang Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel confusing to the naming and the implementation of this method. Maybe we should call this is_group_by_only() ?
And as alamb mentioned, SELECT a ... GROUP BY a, a should be the group by only aggregation also.

/// A Distinct means all fields of the schema are the expressions of group by.
pub fn is_distinct(&self) -> datafusion_common::Result<bool> {
let group_expr_size = self.group_expr.len();
if !self.aggr_expr.is_empty() || group_expr_size != self.schema.fields().len() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't SELECT a ... GROUP BY a, a be distinct even though the number of group exprs didn't match? Maybe this case isn't important to handle now

}

#[test]
fn exists_subquery_aggragte_distinct() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn exists_subquery_aggragte_distinct() -> Result<()> {
fn exists_subquery_aggregate_distinct() -> Result<()> {


let subquery = LogicalPlanBuilder::from(subquery_scan)
.filter(col("sq.a").gt(col("test.b")))?
.project(vec![col("sq.a"), col("sq.c")])?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for:

  1. When there is no projection in the subquery?
  2. When the projection in the subquery is an expression (like sq.a + sq.b)?

// distinct in where-in subquery
let subquery = LogicalPlanBuilder::from(subquery_scan)
.filter(col("test.a").eq(col("sq.a")))?
.project(vec![col("sq.b"), col("sq.c")])?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these tests too I recommend projecting an expression (not just columns)

@jackwener jackwener self-requested a review February 28, 2023 17:19
@mingmwang
Copy link
Contributor

mingmwang commented Mar 1, 2023

@ygf11 @jackwener
I'm not sure whether we should make the rule ReplaceDistinctWithAggregate to handle the where-in/exists subquery specifically. My original thinking was let the rules DecorrelateWhereExists and DecorrelateWhereIn rewrite the subqueries to Joins and in the second pass the rule ReplaceDistinctWithAggregate will rewrite the Distinct to Aggregate since we already run those rules multiple times, so that we can keep a relatively simple ReplaceDistinctWithAggregate rule.

For those expr subqueries that can not be decorrelated, we can create another optimization task and apply all the existing rules to them

@ygf11
Copy link
Contributor Author

ygf11 commented Mar 1, 2023

I'm not sure whether we should make the rule ReplaceDistinctWithAggregate to handle the where-in/exists subquery specifically. My original thinking was let the rules DecorrelateWhereExists and DecorrelateWhereIn rewrite the subqueries to Joins and in the second pass the rule ReplaceDistinctWithAggregate will rewrite the Distinct to Aggregate since we already run those rules multiple times, so that we can keep a relatively simple ReplaceDistinctWithAggregate rule.
For those expr subqueries that can not be decorrelated, we can create another optimization task and apply all the existing rules to them

Thanks @mingmwang. Rewriting distinct in second pass is ok to me.

@jackwener
Copy link
Member

jackwener commented Mar 1, 2023

@ygf11 @jackwener I'm not sure whether we should make the rule ReplaceDistinctWithAggregate to handle the where-in/exists subquery specifically. My original thinking was let the rules DecorrelateWhereExists and DecorrelateWhereIn rewrite the subqueries to Joins and in the second pass the rule ReplaceDistinctWithAggregate will rewrite the Distinct to Aggregate since we already run those rules multiple times, so that we can keep a relatively simple ReplaceDistinctWithAggregate rule.

For those expr subqueries that can not be decorrelated, we can create another optimization task and apply all the existing rules to them

Agree with it.
This is exactly what confuse me at first, I think we don't need handle it in this PR because it will be done in other rule.

@mingmwang
Copy link
Contributor

Yes, I think most of the rules do not need to handle subqueries specifically, we need to apply PushDownFilter, PushDownLimit... to subqueries also, if the subqueries can not be decorrelated. we can have another optimization task/process and apply the existing rules to subqueries.

There is another PR related to subqueries and distinct, I think let's get this PR merge first.
#5345

@mingmwang
Copy link
Contributor

@ygf11
Can we close this PR now as this is not required ?.

@ygf11 ygf11 closed this Mar 14, 2023
@ygf11 ygf11 deleted the replace-distinct branch March 15, 2023 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants