-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Replace distinct with aggregate for where-in/exists subquery #5430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please take a look @mingmwang @jackwener @alamb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ygf11 -- I think this code looks quite good
I am not sure if this will work with expressions (not just column references) but a few more tests I think will demonstrate one way or the other
cc @mingmwang
/// Check whether it is a Distinct. | ||
/// A Distinct means all fields of the schema are the expressions of group by. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure what this means -- is this check designed to check the input expressions to the GroupBy or the output expressions?
If the output expressions, here would be an alternate description
/// Check whether it is a Distinct. | |
/// A Distinct means all fields of the schema are the expressions of group by. | |
/// Return true if the output values are distinct (have no duplicates) | |
/// | |
/// In order for this to return true, all fields of the output schema must be expressions of the group by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this correct? Or is this check designed to check the input expressions only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel confusing to the naming and the implementation of this method. Maybe we should call this is_group_by_only()
?
And as alamb mentioned, SELECT a ... GROUP BY a, a
should be the group by only aggregation also.
/// A Distinct means all fields of the schema are the expressions of group by. | ||
pub fn is_distinct(&self) -> datafusion_common::Result<bool> { | ||
let group_expr_size = self.group_expr.len(); | ||
if !self.aggr_expr.is_empty() || group_expr_size != self.schema.fields().len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't SELECT a ... GROUP BY a, a
be distinct even though the number of group exprs didn't match? Maybe this case isn't important to handle now
} | ||
|
||
#[test] | ||
fn exists_subquery_aggragte_distinct() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn exists_subquery_aggragte_distinct() -> Result<()> { | |
fn exists_subquery_aggregate_distinct() -> Result<()> { |
|
||
let subquery = LogicalPlanBuilder::from(subquery_scan) | ||
.filter(col("sq.a").gt(col("test.b")))? | ||
.project(vec![col("sq.a"), col("sq.c")])? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for:
- When there is no projection in the subquery?
- When the projection in the subquery is an expression (like
sq.a + sq.b
)?
// distinct in where-in subquery | ||
let subquery = LogicalPlanBuilder::from(subquery_scan) | ||
.filter(col("test.a").eq(col("sq.a")))? | ||
.project(vec![col("sq.b"), col("sq.c")])? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In these tests too I recommend projecting an expression (not just columns)
@ygf11 @jackwener For those expr subqueries that can not be decorrelated, we can create another optimization task and apply all the existing rules to them |
Thanks @mingmwang. Rewriting distinct in second pass is ok to me. |
Agree with it. |
Yes, I think most of the rules do not need to handle subqueries specifically, we need to apply PushDownFilter, PushDownLimit... to subqueries also, if the subqueries can not be decorrelated. we can have another optimization task/process and apply the existing rules to subqueries. There is another PR related to subqueries and distinct, I think let's get this PR merge first. |
@ygf11 |
Which issue does this PR close?
Part of #5429.
Rationale for this change
What changes are included in this PR?
where-in/exists
and optimize them.Are these changes tested?
Yes.
Are there any user-facing changes?