-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-52095][SQL] Alter table alter column to pass V2Expression to DSV2 #50864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@@ -233,4 +246,13 @@ case class DefaultValueExpression( | |||
case _ => | |||
throw QueryCompilationErrors.defaultValueNotConstantError(statement, colName, originalSQL) | |||
} | |||
|
|||
// Convert the default expression to DefaultValue, which is required by DS v2 APIs. | |||
def toV2CurrentDefault(statement: String, colName: String): DefaultValue = child match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Alter table alter column" doesnt change 'exists default', so we can use the narrower 'DefaultValue' class here that doesnt have that field.
824bd0d
to
b96182e
Compare
} else if (action.dropDefault != null) { | ||
Some("") | ||
Some(DefaultValueExpression(Literal(""), "")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we use DefaultValueExpression(null, "")
to indicate dropping default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I was initially thinking Literal("") to match "" that was used before, but that makes sense too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, i think in DefaultValueExpression, child is not Option, so its expected to be always set?
Cannot invoke "org.apache.spark.sql.catalyst.trees.TreeNode.exists(scala.Function1)" because "x$6" is null
java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.catalyst.trees.TreeNode.exists(scala.Function1)" because "x$6" is null
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$exists$1(TreeNode.scala:240)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$exists$1$adapted(TreeNode.scala:240)
at scala.collection.IterableOnceOps.exists(IterableOnce.scala:647)
at scala.collection.IterableOnceOps.exists$(IterableOnce.scala:644)
at scala.collection.AbstractIterable.exists(Iterable.scala:935)
at org.apache.spark.sql.catalyst.trees.TreeNode.exists(TreeNode.scala:240)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$exists$1(TreeNode.scala:240)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$exists$1$adapted(TreeNode.scala:240)
at scala.collection.immutable.List.exists(List.scala:396)
at org.apache.spark.sql.catalyst.trees.TreeNode.exists(TreeNode.scala:240)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$isLazyAnalysis$2(QueryExecution.scala:74)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$isLazyAnalysis$2$adapted(QueryExecution.scala:74)
at scala.collection.immutable.List.exists(List.scala:396)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$isLazyAnalysis$1(QueryExecution.scala:74)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$isLazyAnalysis$1$adapted(QueryExecution.scala:74)
at org.apache.spark.sql.catalyst.trees.TreeNode.exists(TreeNode.scala:237)
at org.apache.spark.sql.execution.QueryExecution.isLazyAnalysis$lzycompute(QueryExecution.scala:74)
at org.apache.spark.sql.execution.QueryExecution.isLazyAnalysis(QueryExecution.scala:72)
at org.apache.spark.sql.classic.Dataset$.$anonfun$ofRows$5(Dataset.scala:139)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
at org.apache.spark.sql.classic.Dataset$.ofRows(Dataset.scala:136)
at org.apache.spark.sql.classic.SparkSession.$anonfun$sql$4(SparkSession.scala:570)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:804)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, how about Literal(null)
? My point is that Literal("")
means an empty string literal, which is "''"
, not ""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done
@@ -176,6 +177,18 @@ object ColumnDefinition { | |||
} | |||
} | |||
|
|||
case cmd: AlterColumns if cmd.specs.exists(_.newDefaultExpression.isDefined) => | |||
// Wrap analysis errors for default values in a more user-friendly message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR but we can probably do the same error message improvement for V2CreateTablePlan
override protected def withNewChildrenInternal( | ||
newChildren: IndexedSeq[Expression]): Expression = { | ||
val newColumn = newChildren(0).asInstanceOf[FieldName] | ||
val newPosition = newChildren collectFirst { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can avoid instance checking
val newPos = if (newPosition.isDefined) {
Some(newChildren(1).asInstanceOf[FieldPosition])
} else {
None
}
val newDefault = if (newDefaultExpression.isDefined) {
Some(newChildren.last.asInstanceOf[DefaultValueExpression])
} else {
None
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice idea, done
95a159e
to
0341236
Compare
also, latest change uses the new exception type added in #50879 |
0341236
to
492122f
Compare
492122f
to
f53ecec
Compare
* is not provided as updating column default values does not need to back-fill existing data. | ||
* Empty string and Null literal means dropping the column default value. | ||
*/ | ||
public DefaultValue newModelDefaultValue() { return newDefaultValue; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not interpret null for DefaultValue
as drop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like for original ALTER COLUMN, null string meant no default was set, it was different than "" meaning drop default.
I was trying to be consistent with that, I could take a look, but it would be more code to check if we are "" and set null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null string meant no default was set
Shouldn't we avoid generating UpdateColumnDefaultValue
at all in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes thats the current mechanism here:
if its null, we dont make UpdateColumnDefaultValue. If drop , we make it with "". Else, we make it with the user-specified value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan, do you by any chance remember the context on why we used "" instead of null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed a bit offline, we can probably change catalyst side and add a boolean/enum to AlterColumnSpec to indicate whether its Drop or Unchanged.
In DSV2 side, we should ideally have no UpdateColumnDefaultValue for unchanged, have UpdateColumnDefaultValue new API return null for Drop case, and have it return the default if there's actually one. Returning DefaultValue(Literal(null), "") is a bit hard for connectors
We will preserve the old UpdateColumnDefaultValue string API to return "" though for backward compatibility.
@cloud-fan if you have thoughts
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableChange.java
Show resolved
Hide resolved
* is not provided as updating column default values does not need to back-fill existing data. | ||
* Empty string and Null literal means dropping the column default value. | ||
*/ | ||
public DefaultValue newModelDefaultValue() { return newDefaultValue; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I like the name here. I get that the ideal name newDefaultValue
is already taken, but let's brainstorm a bit more. What about any of these?
newValue()
updatedDefaultValue()
newColumnDefaultValue()
newCurrentDefaultValue()
Any other suggestions, @szehon-ho @cloud-fan @gengliangwang?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea agree its not the greatest name. Renamed it newCurrentDefault, let me know if strong opinion otherwise.
} else if (action.dropDefault != null) { | ||
Some("") | ||
Some(DefaultValueExpression(Literal(null), "")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit fragile. I wonder whether there is a better way to handle it. We can have an enum or something with unchanged, drop, and set options. It only makes sense, however, if we want to pass null for DefaultValue
to indicate drop in the public connector API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in the latest pr, add a boolean dropDefault in AlterColumnSpec
@@ -726,22 +746,29 @@ public String[] fieldNames() { | |||
* is not provided as updating column default values does not need to back-fill existing data. | |||
* Empty string means dropping the column default value. | |||
*/ | |||
public String newDefaultValue() { return newDefaultValue; } | |||
public String newDefaultValue() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we mark this function as deprecated and encourage users to call newCurrentDefault
?
* is not provided as updating column default values does not need to back-fill existing data. | ||
* Null means dropping the column default value. | ||
*/ | ||
public DefaultValue newCurrentDefault() { return newCurrentDefault; } | ||
|
||
@Override | ||
public boolean equals(Object o) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you generate equals
and hashCode
using IDEA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, whats the recommendation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry i used wrong template in Intellij, fixed
What changes were proposed in this pull request?
"Alter table alter column" to pass in V2 Expression to DSV2.
Like the similar changes (#50593) and (#50701), the existing logic is rewritten to use the main Analyzer loop to get this expression, instead of manual call to ResolveDefaultColumns to analyze.
We enhance the UpdateColumnValue (TableChanges API) to return DefaultValue (which contains the V2 Expression), in addition to the existing API returning String representation of the default value.
Why are the changes needed?
DSV2 (example, Iceberg/Delta) should get modeled V2 Expression to set default value.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added tests in DataSourceV2DataFrameSuite
Was this patch authored or co-authored using generative AI tooling?
No