Replace distinct with aggregate for where-in/exists subquery #5430

ygf11 · 2023-02-28T10:05:12Z

Which issue does this PR close?

Part of #5429.

Rationale for this change

What changes are included in this PR?

Find where-in/exists and optimize them.
Support Aggregate subquery for where-exists to join, when the aggregate comes from distinct.

Are these changes tested?

Yes.

Are there any user-facing changes?

ygf11 · 2023-02-28T11:39:53Z

Please take a look @mingmwang @jackwener @alamb

alamb

Thanks @ygf11 -- I think this code looks quite good

I am not sure if this will work with expressions (not just column references) but a few more tests I think will demonstrate one way or the other

cc @mingmwang

alamb · 2023-02-28T14:58:00Z

datafusion/expr/src/logical_plan/plan.rs

+    /// Check whether it is a Distinct.
+    /// A Distinct means all fields of the schema are the expressions of group by.


I am not quite sure what this means -- is this check designed to check the input expressions to the GroupBy or the output expressions?

If the output expressions, here would be an alternate description

Suggested change

/// Check whether it is a Distinct.

/// A Distinct means all fields of the schema are the expressions of group by.

/// Return true if the output values are distinct (have no duplicates)

///

/// In order for this to return true, all fields of the output schema must be expressions of the group by

is this correct? Or is this check designed to check the input expressions only?

I also feel confusing to the naming and the implementation of this method. Maybe we should call this is_group_by_only() ?
And as alamb mentioned, SELECT a ... GROUP BY a, a should be the group by only aggregation also.

alamb · 2023-02-28T14:59:36Z

datafusion/expr/src/logical_plan/plan.rs

+    /// A Distinct means all fields of the schema are the expressions of group by.
+    pub fn is_distinct(&self) -> datafusion_common::Result<bool> {
+        let group_expr_size = self.group_expr.len();
+        if !self.aggr_expr.is_empty() || group_expr_size != self.schema.fields().len() {


wouldn't SELECT a ... GROUP BY a, a be distinct even though the number of group exprs didn't match? Maybe this case isn't important to handle now

alamb · 2023-02-28T16:47:28Z

datafusion/optimizer/src/decorrelate_where_exists.rs

+    }
+
+    #[test]
+    fn exists_subquery_aggragte_distinct() -> Result<()> {


Suggested change

fn exists_subquery_aggragte_distinct() -> Result<()> {

fn exists_subquery_aggregate_distinct() -> Result<()> {

alamb · 2023-02-28T16:48:49Z

datafusion/optimizer/src/decorrelate_where_exists.rs

+
+        let subquery = LogicalPlanBuilder::from(subquery_scan)
+            .filter(col("sq.a").gt(col("test.b")))?
+            .project(vec![col("sq.a"), col("sq.c")])?


Can you add a test for:

When there is no projection in the subquery?

When the projection in the subquery is an expression (like sq.a + sq.b)?

alamb · 2023-02-28T16:51:33Z

datafusion/optimizer/src/replace_distinct_aggregate.rs

+        // distinct in where-in subquery
+        let subquery = LogicalPlanBuilder::from(subquery_scan)
+            .filter(col("test.a").eq(col("sq.a")))?
+            .project(vec![col("sq.b"), col("sq.c")])?


In these tests too I recommend projecting an expression (not just columns)

mingmwang · 2023-03-01T08:01:49Z

@ygf11 @jackwener
I'm not sure whether we should make the rule ReplaceDistinctWithAggregate to handle the where-in/exists subquery specifically. My original thinking was let the rules DecorrelateWhereExists and DecorrelateWhereIn rewrite the subqueries to Joins and in the second pass the rule ReplaceDistinctWithAggregate will rewrite the Distinct to Aggregate since we already run those rules multiple times, so that we can keep a relatively simple ReplaceDistinctWithAggregate rule.

For those expr subqueries that can not be decorrelated, we can create another optimization task and apply all the existing rules to them

ygf11 · 2023-03-01T11:35:03Z

I'm not sure whether we should make the rule ReplaceDistinctWithAggregate to handle the where-in/exists subquery specifically. My original thinking was let the rules DecorrelateWhereExists and DecorrelateWhereIn rewrite the subqueries to Joins and in the second pass the rule ReplaceDistinctWithAggregate will rewrite the Distinct to Aggregate since we already run those rules multiple times, so that we can keep a relatively simple ReplaceDistinctWithAggregate rule.
For those expr subqueries that can not be decorrelated, we can create another optimization task and apply all the existing rules to them

Thanks @mingmwang. Rewriting distinct in second pass is ok to me.

jackwener · 2023-03-01T13:32:27Z

@ygf11 @jackwener I'm not sure whether we should make the rule ReplaceDistinctWithAggregate to handle the where-in/exists subquery specifically. My original thinking was let the rules DecorrelateWhereExists and DecorrelateWhereIn rewrite the subqueries to Joins and in the second pass the rule ReplaceDistinctWithAggregate will rewrite the Distinct to Aggregate since we already run those rules multiple times, so that we can keep a relatively simple ReplaceDistinctWithAggregate rule.

For those expr subqueries that can not be decorrelated, we can create another optimization task and apply all the existing rules to them

Agree with it.
This is exactly what confuse me at first, I think we don't need handle it in this PR because it will be done in other rule.

mingmwang · 2023-03-02T03:17:21Z

Yes, I think most of the rules do not need to handle subqueries specifically, we need to apply PushDownFilter, PushDownLimit... to subqueries also, if the subqueries can not be decorrelated. we can have another optimization task/process and apply the existing rules to subqueries.

There is another PR related to subqueries and distinct, I think let's get this PR merge first.
#5345

mingmwang · 2023-03-13T10:18:21Z

@ygf11
Can we close this PR now as this is not required ?.

Replace distinct with aggregate for subquery

0322f8c

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels Feb 28, 2023

ygf11 marked this pull request as ready for review February 28, 2023 11:39

alamb reviewed Feb 28, 2023

View reviewed changes

jackwener self-requested a review February 28, 2023 17:19

ygf11 closed this Mar 14, 2023

ygf11 deleted the replace-distinct branch March 15, 2023 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace distinct with aggregate for where-in/exists subquery #5430

Replace distinct with aggregate for where-in/exists subquery #5430

ygf11 commented Feb 28, 2023

ygf11 commented Feb 28, 2023

alamb left a comment

alamb Feb 28, 2023

alamb Feb 28, 2023

mingmwang Mar 1, 2023 •

edited

Loading

alamb Feb 28, 2023

alamb Feb 28, 2023

alamb Feb 28, 2023

alamb Feb 28, 2023

mingmwang commented Mar 1, 2023 •

edited

Loading

ygf11 commented Mar 1, 2023 •

edited

Loading

jackwener commented Mar 1, 2023 •

edited

Loading

mingmwang commented Mar 2, 2023

mingmwang commented Mar 13, 2023

		/// Check whether it is a Distinct.
		/// A Distinct means all fields of the schema are the expressions of group by.

	fn exists_subquery_aggragte_distinct() -> Result<()> {
	fn exists_subquery_aggregate_distinct() -> Result<()> {

Replace distinct with aggregate for where-in/exists subquery #5430

Replace distinct with aggregate for where-in/exists subquery #5430

Conversation

ygf11 commented Feb 28, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

ygf11 commented Feb 28, 2023

alamb left a comment

Choose a reason for hiding this comment

alamb Feb 28, 2023

Choose a reason for hiding this comment

alamb Feb 28, 2023

Choose a reason for hiding this comment

mingmwang Mar 1, 2023 • edited Loading

Choose a reason for hiding this comment

alamb Feb 28, 2023

Choose a reason for hiding this comment

alamb Feb 28, 2023

Choose a reason for hiding this comment

alamb Feb 28, 2023

Choose a reason for hiding this comment

alamb Feb 28, 2023

Choose a reason for hiding this comment

mingmwang commented Mar 1, 2023 • edited Loading

ygf11 commented Mar 1, 2023 • edited Loading

jackwener commented Mar 1, 2023 • edited Loading

mingmwang commented Mar 2, 2023

mingmwang commented Mar 13, 2023

mingmwang Mar 1, 2023 •

edited

Loading

mingmwang commented Mar 1, 2023 •

edited

Loading

ygf11 commented Mar 1, 2023 •

edited

Loading

jackwener commented Mar 1, 2023 •

edited

Loading