-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-52575][SQL] Introduce contextIndependentFoldable attribute for Expressions #51282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-52575][SQL] Introduce contextIndependentFoldable attribute for Expressions #51282
Conversation
I will create a follow-up to use this new method in the V2 expression conversion
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be super helpful to simplify expressions before converting them to DSv2. I only have a question about marking expressions context-free by default for all unary and binary expressions.
* | ||
* Default is false to ensure explicit marking of context independence. | ||
*/ | ||
def contextIndependentFoldable: Boolean = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, we could consider splitting contextIndependent
and foldable
into separate methods. Are there use cases where we want to know if the expression is context independent without checking it is foldable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of such special cases. I think we can always check foldable
before checking contextIndependentFoldable
, which seems very safe to me. The syntax for this method is when it is foldable, does it depend on context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess some unfoldable expression could be context independent. Such as: 1 + a', the type of attribute a' is int.
But I don't know should we consider these cases.
I'm +1 for @aokolnychyi if we must consider the case I mentioned.
I'm +1 for @gengliangwang if we not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such as: 1 + a', the type of attribute a' is int
@beliefer in such a case, column a'
is not contextIndependent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I means the context that is related to the Spark context or Session context. The value of attribute is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
lazy val contextIndependentFoldable: Boolean = children.forall(_. contextIndependentFoldable)
here ?
So we can avoid change everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lazy val contextIndependentFoldable: Boolean = children.forall(_. contextIndependentFoldable)
here ?
So we can avoid change everywhere.
@beliefer we are trying to make it accurate. It is easy to miss overrides on the extended expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a problem. How is deterministic
, _references
and throwable
works ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@beliefer This was already discussed.
Check the discussion in
#51282 (comment)
#51282 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@beliefer I am going to merge this PR since this thread is not a blocker. Feel free to follow-up and enable more overrides on contextIndependentFoldable
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/DataTypeUtils.scala
Outdated
Show resolved
Hide resolved
@@ -493,6 +493,11 @@ case class Cast( | |||
|
|||
final override def nodePatternsInternal(): Seq[TreePattern] = Seq(CAST) | |||
|
|||
override def contextIndependentFoldable: Boolean = { | |||
child.contextIndependentFoldable && !DataTypeUtils.containsTimestampOrUDT(dataType) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think UDT matters here. The CAST implementation only allows pass-through of UDT casting, there is no operation and no context is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, let's be more specific about LTZ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait, can't we simply call needsTimeZone
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, updated
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Outdated
Show resolved
Hide resolved
*/ | ||
def containsTimestampOrUDT(dataType: DataType): Boolean = { | ||
matchesPattern(dataType, | ||
dt => dt.isInstanceOf[TimestampType] || dt.isInstanceOf[UserDefinedType[_]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add another function containsUserDefinedType
and remove the dt.isInstanceOf[UserDefinedType[_]]
from here. It looks weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed containsTimestampOrUDT in the latest code. In the current code, the matchesPattern
method is only used here. I prefer to make it more general.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala
Outdated
Show resolved
Hide resolved
…tFoldable
@@ -559,4 +559,100 @@ class LiteralExpressionSuite extends SparkFunSuite with ExpressionEvalHelper { | |||
assert(lit.sql === s"TIME '$str'") | |||
} | |||
} | |||
|
|||
test("context independent foldable literals") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need these tests? literals are always context independent now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need tests to avoid regressions. Running these tests are fast anyway.
@cloud-fan @aokolnychyi @beliefer thanks for the review, merging to master |
… Expressions ### What changes were proposed in this pull request? Introduces a method to determine whether an expression can be folded without relying on any external context (e.g., time zone, session configurations, or catalogs). If an expression is context-independent foldable, it can be safely evaluated during DDL operations such as creating tables, views, or constraints. This enables systems to store the computed value rather than the original expression, simplifying implementation and improving performance. By default, expressions are not considered context-independent foldable to ensure explicit annotation of context independence. **Examples of context-independent foldable:** * 1 + 1: Always produces 2, regardless of context. * 1 > 0: Always evaluates to true. * CAST('2025-08-01' AS DATE): Always gives the same date value. **Examples of not context-independent foldable:** * CURRENT_DATE: Result depends on the current date and time zone. * CURRENT_USER: Result depends on the session/user context. * CAST('2025-08-01 00:00:00' AS TIMESTAMP): Result can vary depending on the session time zone setting ### Why are the changes needed? This allows catalogs and connectors to store the computed value rather than the expression itself, improving both simplicity and performance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51282 from gengliangwang/contextIndependentFoldable. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
… context-independent-foldable ### What changes were proposed in this pull request? If the input to V2 expression translation is context-independent and foldable (see [PR #51282](#51282)), we perform constant folding on the input and use the evaluated result. For example: * `1 + 1` becomes 2 * `a < log2(8)` becomes a < 3.0 ### Why are the changes needed? This change broadens the coverage of V2 expression translation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #51569 from gengliangwang/v2Fold. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
What changes were proposed in this pull request?
Introduces a method to determine whether an expression can be folded without relying on any external context (e.g., time zone, session configurations, or catalogs). If an expression is context-independent foldable, it can be safely evaluated during DDL operations such as creating tables, views, or constraints. This enables systems to store the computed value rather than the original expression, simplifying implementation and improving performance. By default, expressions are not considered context-independent foldable to ensure explicit annotation of context independence.
Examples of context-independent foldable:
Examples of not context-independent foldable:
Why are the changes needed?
This allows catalogs and connectors to store the computed value rather than the expression itself, improving both simplicity and performance.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests
Was this patch authored or co-authored using generative AI tooling?
No