[SPARK-54172][SQL] Merge Into Schema Evolution should only add referenced columns #52866

szehon-ho · 2025-11-04T01:36:12Z

What changes were proposed in this pull request?

Change MERGE INTO schema evolution scope. Limit the scope of schema evolution to only add columns/nested fields that are referenced in the MERGE INTO query via UPDATE or INSERT statements.

Why are the changes needed?

#51698 added schema evolution support for MERGE INTO statements. However, it is a bit too broad. In some instances, source table may have many more fields than target tables. But user may only need a few new ones to be added to the target for the MERGE INTO statement.

Does this PR introduce any user-facing change?

No, MERGE INTO schema evolution is not yet released in Spark 4.1.

How was this patch tested?

Added many unit tests in MergeIntoTableSuiteBase

Was this patch authored or co-authored using generative AI tooling?

No

…nced columns

szehon-ho · 2025-11-04T20:02:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

              notMatchedActions = newNotMatchedActions,
-              notMatchedBySourceActions = newNotMatchedBySourceActions)
+              notMatchedBySourceActions = newNotMatchedBySourceActions,
+              originalSourceActions = newMatchedActions ++ newNotMatchedActions)


note: because (not)matchedActions actions get changed by rule: ResolveRowLevelCommandAssignments, i need to preserve the original user actions/assignments so that MergeIntoTable.referencedSourceSchema and needsSchemaEvolution is idempotent

let's turn it into a code comment.

shall we keep referencedSourceSchema directly?

Hm, i tried referencedSchema, like:

def apply( targetTable: LogicalPlan, sourceTable: LogicalPlan, mergeCondition: Expression, matchedActions: Seq[MergeAction], notMatchedActions: Seq[MergeAction], notMatchedBySourceActions: Seq[MergeAction], withSchemaEvolution: Boolean): MergeIntoTable = { MergeIntoTable( targetTable, sourceTable, mergeCondition, matchedActions, notMatchedActions, notMatchedBySourceActions, withSchemaEvolution, referencedSourceSchema( matchedActions ++ notMatchedActions, sourceTable.schema))

However, it tries to call the schema method a bit too early, before the "stars" are resolved:

[INTERNAL_ERROR] Invalid call to toAttribute on unresolved object SQLSTATE: XX000 org.apache.spark.sql.catalyst.analysis.UnresolvedException: [INTERNAL_ERROR] Invalid call to toAttribute on unresolved object SQLSTATE: XX000 at org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:460) at org.apache.spark.sql.catalyst.analysis.Star.toAttribute$(unresolved.scala:460) at org.apache.spark.sql.catalyst.analysis.UnresolvedStar.toAttribute(unresolved.scala:864) at org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:75) at scala.collection.immutable.List.map(List.scala:236) at scala.collection.immutable.List.map(List.scala:79) at org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:75) at org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias.output(basicLogicalOperators.scala:1739) at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$_schema$1(QueryPlan.scala:467) at org.apache.spark.util.BestEffortLazyVal.apply(BestEffortLazyVal.scala:53) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:464) at org.apache.spark.sql.catalyst.plans.logical.MergeIntoTable$.apply(v2Commands.scala:939) at org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitMergeIntoTable$1(AstBuilder.scala:1152)

szehon-ho · 2025-11-04T20:05:42Z

@cloud-fan @aokolnychyi can you take a look? i think this is an important improvement to get in before we release MERGE INTO WITH SCHEMA EVOLUTION feature in Spark 4.1, thanks!

cloud-fan · 2025-11-04T23:48:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

+      case _ => false
+    }
+
+    def filterSchema(sourceSchema: StructType, basePath: Seq[String]): StructType =


@viirya do you know any existing util functions from nested column pruning to do this work?

No, this looks like particular for merge action, right?

cloud-fan · 2025-11-05T04:43:00Z

sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala

+             |USING source s
+             |ON t.pk = s.pk
+             |WHEN MATCHED THEN
+             | UPDATE SET dep='software'


This test is weird, dep is an existing column in the target table, and we for sure do not need to do schema evolution. What was the behavior before this PR?

oh its because source table has more colunns but they are not used..

cloud-fan · 2025-11-05T18:37:00Z

sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala

+             |ON t.pk = s.pk
+             |WHEN NOT MATCHED THEN
+             | INSERT (pk, info, dep) VALUES (s.pk,
+             |   named_struct('salary', s.info.salary, 'status', 'active'), 'marketing')


why do we trigger schema evolution for this case?

[SPARK-54172][SQL] Merge Into Schema Evolution should only add refere…

3daaf68

…nced columns

github-actions bot added the SQL label Nov 4, 2025

szehon-ho force-pushed the merge_schema_evolution_limit_cols branch 3 times, most recently from 41731d2 to 6c6de51 Compare November 4, 2025 20:02

szehon-ho commented Nov 4, 2025

View reviewed changes

Refactor and add more test

24b1a51

szehon-ho force-pushed the merge_schema_evolution_limit_cols branch from 6c6de51 to 24b1a51 Compare November 4, 2025 20:06

cloud-fan reviewed Nov 4, 2025

View reviewed changes

cloud-fan reviewed Nov 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54172][SQL] Merge Into Schema Evolution should only add referenced columns #52866

[SPARK-54172][SQL] Merge Into Schema Evolution should only add referenced columns #52866

szehon-ho commented Nov 4, 2025

Uh oh!

szehon-ho Nov 4, 2025

Uh oh!

cloud-fan Nov 4, 2025

Uh oh!

cloud-fan Nov 4, 2025

Uh oh!

szehon-ho Nov 5, 2025 •

edited

Loading

Uh oh!

szehon-ho commented Nov 4, 2025

Uh oh!

cloud-fan Nov 4, 2025

Uh oh!

viirya Nov 5, 2025

Uh oh!

cloud-fan Nov 5, 2025

Uh oh!

szehon-ho Nov 5, 2025

Uh oh!

cloud-fan Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-54172][SQL] Merge Into Schema Evolution should only add referenced columns #52866

Are you sure you want to change the base?

[SPARK-54172][SQL] Merge Into Schema Evolution should only add referenced columns #52866

Conversation

szehon-ho commented Nov 4, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Nov 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

szehon-ho Nov 5, 2025 •

edited

Loading