Skip to content

Conversation

petern48
Copy link
Contributor

@petern48 petern48 commented Sep 30, 2025

Which issue does this PR close?

Rationale for this change

Extends the case simplification by simplifying the expr when any of the when statements are false.

What changes are included in this PR?

Implements the following simplifications:

CASE WHEN false THEN A END --> NULL
CASE WHEN false THEN A ELSE B END --> B
CASE WHEN X THEN A WHEN false THEN B END --> CASE WHEN X THEN A ELSE B END

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the optimizer Optimizer rules label Sep 30, 2025
if i == 0 {
return Ok(Transformed::yes(*then_));
let mut remove_indices = Vec::with_capacity(when_then_expr.len());
let out_type = info.get_data_type(&when_then_expr[0].1)?;
Copy link
Contributor Author

@petern48 petern48 Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing this get_data_type call made some of the existing tests fail because it was trying to get the data type of a column that didn't exist in the schema. I updated the existing tests to use the actual column names e.g (col("c1"), col("c3")) or string literals (e.g lit("a")) instead of the invalid column names (e.g col("a")) hence why so many random changes in the old tests. When I ran queries in the CLI, it seemed like Datafusion was catching the invalid column names before it got to this code, so I think this should be safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me moment to convince myself that we did not need to gate on !when_then_expr.empty() to ensure when_then_expr[0] doesn't panic -- and that is because .any() needs at least one expr to evaluate to true.

TLDR this is fine, I am just recording my thought process for anyone else who is interested

@petern48 petern48 marked this pull request as ready for review September 30, 2025 03:01
@petern48
Copy link
Contributor Author

cc @alamb

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me 👍

Comment on lines 1470 to 1473
// Remove any CASE false statements
for i in remove_indices.iter().rev() {
when_then_expr.remove(*i);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need to consider any weird cases like if there's a very large case statement with lots of false; if removing one by one like this could be a performance issue or if we shouldn't worry about that too much 🤔

Copy link
Contributor Author

@petern48 petern48 Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the logic to instead move all valid entries to a new new_when_then_expr vector. Do you think this is better? I figured it should at least be more predictable, since it's at most O(n) moves. Whereas the old deletion logic can end up involving multiple O(n) left shifts. I believe that would technically be O(n^2).

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @petern48 and @Jefffrey -- this looks great

if i == 0 {
return Ok(Transformed::yes(*then_));
let mut remove_indices = Vec::with_capacity(when_then_expr.len());
let out_type = info.get_data_type(&when_then_expr[0].1)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me moment to convince myself that we did not need to gate on !when_then_expr.empty() to ensure when_then_expr[0] doesn't panic -- and that is because .any() needs at least one expr to evaluate to true.

TLDR this is fine, I am just recording my thought process for anyone else who is interested

);

// CASE WHEN true THEN col("a") ELSE col("b") END --> col("a")
// CASE WHEN true THEN col('a') ELSE col('b') END --> col('a')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was worried about this change, as it seems to potentially change the intent of the test -- to use literals rather than column references.

However, I see the issue is that the DataTypes need to match and after spending some time rewriting these tests to use column references rather than literals I think the literals are fine

@alamb alamb added this pull request to the merge queue Oct 1, 2025
Merged via the queue into apache:main with commit f1246a9 Oct 1, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify CASE WHEN false THEN ... Expression Simplification forExpr::Case expressions
3 participants