Skip to content

[SPARK-48091][SQL] Preserve aliases inside lambda when ExtractGenerator restructures plan#55892

Open
shrirangmhalgi wants to merge 2 commits into
apache:masterfrom
shrirangmhalgi:SPARK-48091-explode-transform-alias
Open

[SPARK-48091][SQL] Preserve aliases inside lambda when ExtractGenerator restructures plan#55892
shrirangmhalgi wants to merge 2 commits into
apache:masterfrom
shrirangmhalgi:SPARK-48091-explode-transform-alias

Conversation

@shrirangmhalgi
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Fix ExtractGenerator to preserve aliases inside lambda functions when restructuring the plan.

Previously, ExtractGenerator called trimNonTopLevelAliases on all expressions in the project list before extracting the generator. This stripped aliases inside lambda functions (e.g., struct(x.as("data"))) before CreateStruct could resolve them into struct field names.

The fix uses trimNonTopLevelAliases only for pattern matching (to detect generators via AliasedGenerator), but preserves the original untrimmed expression for non-generator project items.

Why are the changes needed?

When using explode together with transform in the same select statement, aliases used inside the transformed column's struct() are ignored. Field names become auto-generated (x_1, x_2) instead of the user-specified alias. This only happens with the DataFrame/Dataset API, not with SQL.

Does this PR introduce any user-facing change?

Yes. Struct field aliases inside transform lambdas are now correctly preserved when explode (or any generator) is in the same select.

How was this patch tested?

Added a test in GeneratorFunctionSuite verifying that struct field aliases are preserved when explode and transform are used together, including single and multiple aliases.

Was this patch authored or co-authored using generative AI tooling?

Yes.

…or restructures plan

ExtractGenerator called trimNonTopLevelAliases on all project list items before extracting the generator. This stripped aliases inside lambda functions (e.g., struct(x.as("data"))) before they could be resolved into struct field names by CreateStruct.

Now only uses trimNonTopLevelAliases for pattern matching to detect generators, but preserves the original untrimmed expression for non-generator project items.
@shrirangmhalgi
Copy link
Copy Markdown
Contributor Author

shrirangmhalgi commented May 15, 2026

@cloud-fan / @dongjoon-hyun / @sarutak could you please review this PR.

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior state and problem. When a project list contains a generator (e.g., explode) alongside transform(arr, x => struct(x.as("data"))), the resulting struct field comes out as col1 instead of data. Root cause is a timing interaction: ExtractGenerator runs at resolution-rule position 530, before ResolveFunctions (532), so the inner struct(...) is still UnresolvedFunction("struct", Seq(Alias(x, "data"))). The pre-PR code applied .map(trimNonTopLevelAliases) to the entire project list before pattern matching. trimAliases has a special case for CreateNamedStruct that preserves alias-carried metadata, but that case requires the expression to already be CreateNamedStruct -- while it is still UnresolvedFunction, the generic case other => other.mapChildren(trimAliases) branch descends into the lambda body and strips Alias(x, "data"). The alias is the only carrier of the name "data" at this stage (the Literal("data") field-name slot inside CreateNamedStruct is produced by CreateStruct.apply later, during ResolveFunctions). Once stripped, the resolved form becomes CreateNamedStruct(Seq(Literal("col1"), x)).

Design approach. Localized workaround in ExtractGenerator's Project case: trim only for AliasedGenerator pattern detection, and splice the original (untrimmed) e into the new project list. CleanupAliases at end-of-analysis still trims later, after ResolveFunctions has captured the alias name.

Concern -- the fix is local, the bug is in trimAliases. The same upfront .map(trimNonTopLevelAliases) exists in the sibling Aggregate-with-generator branch at Analyzer.scala:3211-3253, with the same case (other, idx) shape that propagates the trimmed other. The same struct-field-name regression is reachable for queries that route through that branch. More generally, the root cause is that trimAliases (and via it trimNonTopLevelAliases) descends into unresolved subtrees and strips aliases whose semantic role has not been determined yet -- UnresolvedFunction("struct", ...) is one case, but the pattern is broader.

Would you consider an alternate fix that addresses the timing issue directly in trimAliases? Adding a leading case e if !e.resolved => e clause would (a) cover the Aggregate path without a parallel edit, (b) leave the existing .map(trimNonTopLevelAliases) call sites in place, and (c) protect future callers that hand trimAliases a partially-resolved tree from the same trap. Curious whether you tried that direction and ran into issues, or whether the local fix was preferred for risk containment.

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated
Comment thread sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala Outdated
…r workaround

Per cloud-fan suggestion, moved the fix from ExtractGenerator to AliasHelper.trimAliases. Added UnresolvedFunction skip case to preserve alias children that carry struct field names.

Also fixed ArrayType import nit in test.
@shrirangmhalgi
Copy link
Copy Markdown
Contributor Author

Thanks for the suggestion! I tried case e if !e.resolved => e first - it broke 5 posexplode alias-chaining tests + 1 AnalysisSuite test. The issue is that posexplode(col).as("a").as("b") creates Alias(Alias(unresolved_generator, "a"), "b") where the inner alias is unresolved but still needs trimming for chained-alias handling to work correctly.

Narrowed it to case u: UnresolvedFunction => u - this targets the actual problem class (unresolved function calls like struct(...) whose alias children carry field names for CreateStruct.apply) without affecting alias trimming on other unresolved expressions. Reverted the local ExtractGenerator workaround - the fix now lives entirely in AliasHelper.trimAliases. Also addressed the ArrayType import nit. Pushed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants