-
Notifications
You must be signed in to change notification settings - Fork 1.7k
perf: Optimize CASE for any WHEN false #17835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if i == 0 { | ||
return Ok(Transformed::yes(*then_)); | ||
let mut remove_indices = Vec::with_capacity(when_then_expr.len()); | ||
let out_type = info.get_data_type(&when_then_expr[0].1)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introducing this get_data_type
call made some of the existing tests fail because it was trying to get the data type of a column that didn't exist in the schema. I updated the existing tests to use the actual column names e.g (col("c1")
, col("c3")
) or string literals (e.g lit("a")) instead of the invalid column names (e.g col("a")
) hence why so many random changes in the old tests. When I ran queries in the CLI, it seemed like Datafusion was catching the invalid column names before it got to this code, so I think this should be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me moment to convince myself that we did not need to gate on !when_then_expr.empty()
to ensure when_then_expr[0]
doesn't panic -- and that is because .any()
needs at least one expr to evaluate to true.
TLDR this is fine, I am just recording my thought process for anyone else who is interested
cc @alamb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me 👍
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs
Outdated
Show resolved
Hide resolved
// Remove any CASE false statements | ||
for i in remove_indices.iter().rev() { | ||
when_then_expr.remove(*i); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need to consider any weird cases like if there's a very large case statement with lots of false; if removing one by one like this could be a performance issue or if we shouldn't worry about that too much 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the logic to instead move all valid entries to a new new_when_then_expr
vector. Do you think this is better? I figured it should at least be more predictable, since it's at most O(n) moves. Whereas the old deletion logic can end up involving multiple O(n) left shifts. I believe that would technically be O(n^2).
…oving from the original vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i == 0 { | ||
return Ok(Transformed::yes(*then_)); | ||
let mut remove_indices = Vec::with_capacity(when_then_expr.len()); | ||
let out_type = info.get_data_type(&when_then_expr[0].1)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me moment to convince myself that we did not need to gate on !when_then_expr.empty()
to ensure when_then_expr[0]
doesn't panic -- and that is because .any()
needs at least one expr to evaluate to true.
TLDR this is fine, I am just recording my thought process for anyone else who is interested
); | ||
|
||
// CASE WHEN true THEN col("a") ELSE col("b") END --> col("a") | ||
// CASE WHEN true THEN col('a') ELSE col('b') END --> col('a') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was worried about this change, as it seems to potentially change the intent of the test -- to use literals rather than column references.
However, I see the issue is that the DataTypes need to match and after spending some time rewriting these tests to use column references rather than literals I think the literals are fine
Which issue does this PR close?
CASE WHEN false THEN ...
#17590Expr::Case
expressions #1693Rationale for this change
Extends the case simplification by simplifying the expr when any of the when statements are false.
What changes are included in this PR?
Implements the following simplifications:
Are these changes tested?
Yes
Are there any user-facing changes?
No