Skip to content

[SPARK-52095][SQL] Alter table alter column to pass V2Expression to DSV2 #50864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

szehon-ho
Copy link
Contributor

What changes were proposed in this pull request?

"Alter table alter column" to pass in V2 Expression to DSV2.
Like the similar changes (#50593) and (#50701), the existing logic is rewritten to use the main Analyzer loop to get this expression, instead of manual call to ResolveDefaultColumns to analyze.

We enhance the UpdateColumnValue (TableChanges API) to return DefaultValue (which contains the V2 Expression), in addition to the existing API returning String representation of the default value.

Why are the changes needed?

DSV2 (example, Iceberg/Delta) should get modeled V2 Expression to set default value.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added tests in DataSourceV2DataFrameSuite

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label May 13, 2025
@@ -233,4 +246,13 @@ case class DefaultValueExpression(
case _ =>
throw QueryCompilationErrors.defaultValueNotConstantError(statement, colName, originalSQL)
}

// Convert the default expression to DefaultValue, which is required by DS v2 APIs.
def toV2CurrentDefault(statement: String, colName: String): DefaultValue = child match {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Alter table alter column" doesnt change 'exists default', so we can use the narrower 'DefaultValue' class here that doesnt have that field.

@szehon-ho szehon-ho force-pushed the replace_col_default_val branch 2 times, most recently from 824bd0d to b96182e Compare May 13, 2025 05:38
} else if (action.dropDefault != null) {
Some("")
Some(DefaultValueExpression(Literal(""), ""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use DefaultValueExpression(null, "") to indicate dropping default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I was initially thinking Literal("") to match "" that was used before, but that makes sense too

Copy link
Contributor Author

@szehon-ho szehon-ho May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, i think in DefaultValueExpression, child is not Option, so its expected to be always set?

Cannot invoke "org.apache.spark.sql.catalyst.trees.TreeNode.exists(scala.Function1)" because "x$6" is null
java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.catalyst.trees.TreeNode.exists(scala.Function1)" because "x$6" is null
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$exists$1(TreeNode.scala:240)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$exists$1$adapted(TreeNode.scala:240)
	at scala.collection.IterableOnceOps.exists(IterableOnce.scala:647)
	at scala.collection.IterableOnceOps.exists$(IterableOnce.scala:644)
	at scala.collection.AbstractIterable.exists(Iterable.scala:935)
	at org.apache.spark.sql.catalyst.trees.TreeNode.exists(TreeNode.scala:240)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$exists$1(TreeNode.scala:240)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$exists$1$adapted(TreeNode.scala:240)
	at scala.collection.immutable.List.exists(List.scala:396)
	at org.apache.spark.sql.catalyst.trees.TreeNode.exists(TreeNode.scala:240)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$isLazyAnalysis$2(QueryExecution.scala:74)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$isLazyAnalysis$2$adapted(QueryExecution.scala:74)
	at scala.collection.immutable.List.exists(List.scala:396)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$isLazyAnalysis$1(QueryExecution.scala:74)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$isLazyAnalysis$1$adapted(QueryExecution.scala:74)
	at org.apache.spark.sql.catalyst.trees.TreeNode.exists(TreeNode.scala:237)
	at org.apache.spark.sql.execution.QueryExecution.isLazyAnalysis$lzycompute(QueryExecution.scala:74)
	at org.apache.spark.sql.execution.QueryExecution.isLazyAnalysis(QueryExecution.scala:72)
	at org.apache.spark.sql.classic.Dataset$.$anonfun$ofRows$5(Dataset.scala:139)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
	at org.apache.spark.sql.classic.Dataset$.ofRows(Dataset.scala:136)
	at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$4(SparkSession.scala:570)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, how about Literal(null)? My point is that Literal("") means an empty string literal, which is "''", not ""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done

@@ -176,6 +177,18 @@ object ColumnDefinition {
}
}

case cmd: AlterColumns if cmd.specs.exists(_.newDefaultExpression.isDefined) =>
// Wrap analysis errors for default values in a more user-friendly message.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR but we can probably do the same error message improvement for V2CreateTablePlan

override protected def withNewChildrenInternal(
newChildren: IndexedSeq[Expression]): Expression = {
val newColumn = newChildren(0).asInstanceOf[FieldName]
val newPosition = newChildren collectFirst {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can avoid instance checking

val newPos = if (newPosition.isDefined) {
  Some(newChildren(1).asInstanceOf[FieldPosition])
} else {
  None
}
val newDefault = if (newDefaultExpression.isDefined) {
  Some(newChildren.last.asInstanceOf[DefaultValueExpression])
} else {
  None
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice idea, done

@szehon-ho szehon-ho force-pushed the replace_col_default_val branch from 95a159e to 0341236 Compare May 14, 2025 18:55
@szehon-ho
Copy link
Contributor Author

szehon-ho commented May 14, 2025

also, latest change uses the new exception type added in #50879

@szehon-ho szehon-ho force-pushed the replace_col_default_val branch from 0341236 to 492122f Compare May 14, 2025 19:01
@szehon-ho szehon-ho force-pushed the replace_col_default_val branch from 492122f to f53ecec Compare May 14, 2025 21:15
* is not provided as updating column default values does not need to back-fill existing data.
* Empty string and Null literal means dropping the column default value.
*/
public DefaultValue newModelDefaultValue() { return newDefaultValue; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not interpret null for DefaultValue as drop?

Copy link
Contributor Author

@szehon-ho szehon-ho May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like for original ALTER COLUMN, null string meant no default was set, it was different than "" meaning drop default.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L5246

I was trying to be consistent with that, I could take a look, but it would be more code to check if we are "" and set null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null string meant no default was set

Shouldn't we avoid generating UpdateColumnDefaultValue at all in that case?

Copy link
Contributor Author

@szehon-ho szehon-ho May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes thats the current mechanism here:

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala#L245

if its null, we dont make UpdateColumnDefaultValue. If drop , we make it with "". Else, we make it with the user-specified value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, do you by any chance remember the context on why we used "" instead of null?

Copy link
Contributor Author

@szehon-ho szehon-ho May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed a bit offline, we can probably change catalyst side and add a boolean/enum to AlterColumnSpec to indicate whether its Drop or Unchanged.

In DSV2 side, we should ideally have no UpdateColumnDefaultValue for unchanged, have UpdateColumnDefaultValue new API return null for Drop case, and have it return the default if there's actually one. Returning DefaultValue(Literal(null), "") is a bit hard for connectors

We will preserve the old UpdateColumnDefaultValue string API to return "" though for backward compatibility.

@cloud-fan if you have thoughts

* is not provided as updating column default values does not need to back-fill existing data.
* Empty string and Null literal means dropping the column default value.
*/
public DefaultValue newModelDefaultValue() { return newDefaultValue; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I like the name here. I get that the ideal name newDefaultValue is already taken, but let's brainstorm a bit more. What about any of these?

  • newValue()
  • updatedDefaultValue()
  • newColumnDefaultValue()
  • newCurrentDefaultValue()

Any other suggestions, @szehon-ho @cloud-fan @gengliangwang?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea agree its not the greatest name. Renamed it newCurrentDefault, let me know if strong opinion otherwise.

} else if (action.dropDefault != null) {
Some("")
Some(DefaultValueExpression(Literal(null), ""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit fragile. I wonder whether there is a better way to handle it. We can have an enum or something with unchanged, drop, and set options. It only makes sense, however, if we want to pass null for DefaultValue to indicate drop in the public connector API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest pr, add a boolean dropDefault in AlterColumnSpec

@@ -726,22 +746,29 @@ public String[] fieldNames() {
* is not provided as updating column default values does not need to back-fill existing data.
* Empty string means dropping the column default value.
*/
public String newDefaultValue() { return newDefaultValue; }
public String newDefaultValue() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we mark this function as deprecated and encourage users to call newCurrentDefault?

* is not provided as updating column default values does not need to back-fill existing data.
* Null means dropping the column default value.
*/
public DefaultValue newCurrentDefault() { return newCurrentDefault; }

@Override
public boolean equals(Object o) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you generate equals and hashCode using IDEA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, whats the recommendation?

Copy link
Contributor Author

@szehon-ho szehon-ho May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry i used wrong template in Intellij, fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants