Skip to content

[SPARK-52575][SQL] Introduce contextIndependentFoldable attribute for Expressions #51282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

gengliangwang
Copy link
Member

What changes were proposed in this pull request?

Introduces a method to determine whether an expression can be folded without relying on any external context (e.g., time zone, session configurations, or catalogs). If an expression is context-independent foldable, it can be safely evaluated during DDL operations such as creating tables, views, or constraints. This enables systems to store the computed value rather than the original expression, simplifying implementation and improving performance. By default, expressions are not considered context-independent foldable to ensure explicit annotation of context independence.

Examples of context-independent foldable:

  • 1 + 1: Always produces 2, regardless of context.
  • 1 > 0: Always evaluates to true.
  • CAST('2025-08-01' AS DATE): Always gives the same date value.

Examples of not context-independent foldable:

  • CURRENT_DATE: Result depends on the current date and time zone.
  • CURRENT_USER: Result depends on the session/user context.
  • CAST('2025-08-01 00:00:00' AS TIMESTAMP): Result can vary depending on the session time zone setting

Why are the changes needed?

This allows catalogs and connectors to store the computed value rather than the expression itself, improving both simplicity and performance.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@gengliangwang
Copy link
Member Author

I will create a follow-up to use this new method in the V2 expression conversion

 class V2ExpressionBuilder(e: Expression, isPredicate: Boolean = false) extends Logging  {
 
-  def build(): Option[V2Expression] = generateExpression(e, isPredicate)
+  def build(): Option[V2Expression] = {
+    val exprToBuild = if (e.foldable && e.contextIndependentFoldable) {
+      ConstantFolding.constantFolding(e)
+    } else {
+      e
+    }
+    generateExpression(exprToBuild, isPredicate)
+  }

cc @cloud-fan @beliefer @aokolnychyi

Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be super helpful to simplify expressions before converting them to DSv2. I only have a question about marking expressions context-free by default for all unary and binary expressions.

*
* Default is false to ensure explicit marking of context independence.
*/
def contextIndependentFoldable: Boolean = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, we could consider splitting contextIndependent and foldable into separate methods. Are there use cases where we want to know if the expression is context independent without checking it is foldable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of such special cases. I think we can always check foldable before checking contextIndependentFoldable, which seems very safe to me. The syntax for this method is when it is foldable, does it depend on context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess some unfoldable expression could be context independent. Such as: 1 + a', the type of attribute a' is int.
But I don't know should we consider these cases.
I'm +1 for @aokolnychyi if we must consider the case I mentioned.
I'm +1 for @gengliangwang if we not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such as: 1 + a', the type of attribute a' is int

@beliefer in such a case, column a' is not contextIndependent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I means the context that is related to the Spark context or Session context. The value of attribute is not.

Copy link
Contributor

@beliefer beliefer Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about
lazy val contextIndependentFoldable: Boolean = children.forall(_. contextIndependentFoldable)
here ?
So we can avoid change everywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazy val contextIndependentFoldable: Boolean = children.forall(_. contextIndependentFoldable)
here ?
So we can avoid change everywhere.

@beliefer we are trying to make it accurate. It is easy to miss overrides on the extended expressions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a problem. How is deterministic, _references and throwable works ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beliefer This was already discussed.
Check the discussion in
#51282 (comment)
#51282 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beliefer I am going to merge this PR since this thread is not a blocker. Feel free to follow-up and enable more overrides on contextIndependentFoldable

@@ -493,6 +493,11 @@ case class Cast(

final override def nodePatternsInternal(): Seq[TreePattern] = Seq(CAST)

override def contextIndependentFoldable: Boolean = {
child.contextIndependentFoldable && !DataTypeUtils.containsTimestampOrUDT(dataType) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think UDT matters here. The CAST implementation only allows pass-through of UDT casting, there is no operation and no context is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, let's be more specific about LTZ

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, can't we simply call needsTimeZone here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, updated

*/
def containsTimestampOrUDT(dataType: DataType): Boolean = {
matchesPattern(dataType,
dt => dt.isInstanceOf[TimestampType] || dt.isInstanceOf[UserDefinedType[_]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add another function containsUserDefinedType and remove the dt.isInstanceOf[UserDefinedType[_]] from here. It looks weird.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed containsTimestampOrUDT in the latest code. In the current code, the matchesPattern method is only used here. I prefer to make it more general.

…tFoldable
@@ -559,4 +559,100 @@ class LiteralExpressionSuite extends SparkFunSuite with ExpressionEvalHelper {
assert(lit.sql === s"TIME '$str'")
}
}

test("context independent foldable literals") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need these tests? literals are always context independent now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need tests to avoid regressions. Running these tests are fast anyway.

@gengliangwang
Copy link
Member Author

@cloud-fan @aokolnychyi @beliefer thanks for the review, merging to master

haoyangeng-db pushed a commit to haoyangeng-db/apache-spark that referenced this pull request Jul 22, 2025
… Expressions

### What changes were proposed in this pull request?

Introduces a method to determine whether an expression can be folded without relying on any external context (e.g., time zone, session configurations, or catalogs). If an expression is context-independent foldable, it can be safely evaluated during DDL operations such as creating tables, views, or constraints. This enables systems to store the computed value rather than the original expression, simplifying implementation and improving performance. By default, expressions are not considered context-independent foldable to ensure explicit annotation of context independence.

**Examples of context-independent foldable:**
* 1 + 1: Always produces 2, regardless of context.
* 1 > 0: Always evaluates to true.
* CAST('2025-08-01' AS DATE): Always gives the same date value.

**Examples of not context-independent foldable:**

* CURRENT_DATE: Result depends on the current date and time zone.
* CURRENT_USER: Result depends on the session/user context.
* CAST('2025-08-01 00:00:00' AS TIMESTAMP): Result can vary depending on the session time zone setting

### Why are the changes needed?

This allows catalogs and connectors to store the computed value rather than the expression itself, improving both simplicity and performance.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests
### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#51282 from gengliangwang/contextIndependentFoldable.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
gengliangwang added a commit that referenced this pull request Jul 22, 2025
… context-independent-foldable

### What changes were proposed in this pull request?

If the input to V2 expression translation is context-independent and foldable (see [PR #51282](#51282)), we perform constant folding on the input and use the evaluated result.

 For example:

* `1 + 1` becomes 2
* `a < log2(8)` becomes a < 3.0

### Why are the changes needed?

This change broadens the coverage of V2 expression translation.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New UT
### Was this patch authored or co-authored using generative AI tooling?

No

Closes #51569 from gengliangwang/v2Fold.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants