-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fix struct casts to align fields by name (prevent positional mis-casts) #19674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
kosiew
wants to merge
11
commits into
apache:main
Choose a base branch
from
kosiew:struct-casting-17285
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Route ColumnarValue::cast_to through a name-based struct casting path for both array and scalar struct values. Introduce a helper to reorder struct children by target field names, insert nulls for missing fields, and recursively cast each child with Arrow options. Add unit tests to verify struct field reordering and null-filling for missing fields when casting between struct schemas.
…::cast_to Add comprehensive documentation explaining that struct casting uses field name matching rather than positional matching. This clarifies the behavior change for struct types while preserving existing documentation for other types. Addresses PR review recommendation #4 about documenting public API changes.
Replace .clone() with Arc::clone() to address clippy warning about clone_on_ref_ptr. This makes the ref-counting operation explicit and follows Rust best practices. Fixes clippy error from rust_lint.sh.
## Problem The PR to fix struct casting (issue apache#14396) introduced regressions where struct casting with field additions/removals was failing, and name-based field matching wasn't working correctly in all scenarios. ## Root Causes Identified 1. **Field index mismatch**: cast_struct_array_by_name was using field indices from the DataType instead of the actual StructArray, causing wrong column access when field names didn't match physical layout. 2. **Missing fallback logic**: When source and target had no overlapping field names (e.g. {c0, c1} → {a, b}), name-based matching failed silently, returning NULLs. Added fallback to positional casting for non-overlapping fields. 3. **Optimizer const-folding issue**: ScalarValue::cast_to_with_options was calling Arrow's cast_with_options directly, which doesn't support struct field count changes. The optimizer's simplify_expressions rule would fail when trying to fold struct casts at compile time. 4. **Validation rejection**: The logical planner's can_cast_types check rejected struct-to-struct casts with mismatched field counts before execution. Added special handling to allow all struct-to-struct casts (validation at runtime). ## Solution - Created datafusion/common/src/struct_cast.rs with shared name-based struct casting logic for both runtime (ColumnarValue) and optimization-time (ScalarValue) - Updated ScalarValue::cast_to_with_options to use the name-based struct casting - Updated ColumnarValue::cast_to to use the shared logic - Updated Expr::cast_to validation to allow struct-to-struct casts - Added fallback to positional casting when field names don't overlap - Fixed struct array field access to use actual StructArray fields, not DataType fields - Updated tests to reflect new behavior and correct syntax issues ## Behavior Changes Struct casts now work correctly with: - Field reordering: {b: 3, a: 4} → STRUCT(a INT, b INT) → {a: 4, b: 3} - Field additions: {a: 1} → STRUCT(a INT, b INT) → {a: 1, b: NULL} - Field removals: {a: 1, b: 2, c: 3} → STRUCT(a INT, b INT) → {a: 1, b: 2} - Fallback to positional casting when no field names overlap ## Files Modified - datafusion/common/src/lib.rs: Added struct_cast module - datafusion/common/src/struct_cast.rs: New shared struct casting logic - datafusion/common/src/scalar/mod.rs: Use name-based struct casting - datafusion/expr-common/src/columnar_value.rs: Delegate to shared casting logic - datafusion/expr/src/expr_schema.rs: Allow struct-to-struct casts through validation - datafusion/sqllogictest/test_files/struct.slt: Fixed tests and added new ones
…ype comparison This aligns the type coercion logic with the new name-based struct casting semantics introduced for explicit CAST operations in issue apache#14396. Changes: - struct_coercion in binary.rs now attempts name-based field matching first - Falls back to positional matching when no field names overlap - Requires matching field counts for successful coercion - Preserves left-side field names and order when using name-based matching This fixes cases where CASE expressions with structs having the same fields in different orders now correctly match fields by name rather than position. Note: Several CASE expression tests with struct field reordering are disabled due to const-folding optimizer hang when evaluating struct literals with different field orders. This requires separate investigation and fix.
…-folding
This fixes the TODO tests in struct.slt that were causing optimizer hangs
when attempting to const-fold struct literal casts with field count mismatches.
Changes:
1. Modified expr_simplifier.rs can_evaluate() to skip const-folding for
struct CAST/TryCAST expressions where source and target have different
field counts. This prevents the optimizer from attempting to evaluate
these at plan time, deferring to execution time instead.
2. Modified cast.rs cast_with_options() to allow all struct-to-struct casts
at physical planning time, even when Arrow's can_cast_types rejects them.
These casts are handled by name-based casting at execution time via
ColumnarValue::cast_to.
3. Uncommented and fixed TODO tests in struct.slt:
- CAST({a: 1} AS STRUCT(a INT, b INT)) - adds NULL for missing field b
- CAST({a: 1, b: 2, extra: 3} AS STRUCT(a INT, b INT)) - ignores extra field
The fix ensures that:
- Optimizer doesn't hang trying to const-fold unsupported struct casts
- Physical planner accepts struct-to-struct casts with field count changes
- Execution uses name-based casting to handle field reordering and NULLs
This uncomments and fixes the TODO tests for struct coercion in CASE expressions
that were previously disabled due to concerns about optimizer hangs.
Changes:
1. Uncommented the TODO test section for struct coercion with different field orders.
Tests now verify that name-based struct coercion works correctly in CASE expressions.
2. Updated test expectations to match actual behavior:
- When THEN branch executes, result uses THEN branch's field order
- When ELSE branch executes, result uses ELSE branch's field order
- Struct coercion requires equal field counts - mismatch causes planning error
3. Added explicit test for field count mismatch case:
- Verifies that coercing structs with different field counts (2 fields vs 1 field)
correctly fails during type coercion with appropriate error message
The tests now pass because:
- The optimizer fix from the previous commit prevents const-folding hangs
- Name-based struct coercion in struct_coercion function handles field reordering
- Type coercion correctly rejects field count mismatches during planning
Remove duplicate struct_cast.rs module and use the existing nested_struct::cast_struct_column implementation instead. This eliminates code duplication and provides a single source of truth for struct field-by-name casting logic. Changes: - Add public cast_struct_array_by_name wrapper in nested_struct.rs - Update columnar_value.rs to use nested_struct::cast_struct_array_by_name - Update scalar/mod.rs to use nested_struct::cast_struct_array_by_name - Remove struct_cast module from lib.rs - Delete datafusion/common/src/struct_cast.rs Benefits: - Single implementation to maintain and test - Consistent behavior across all struct casting operations - Reduced maintenance burden for future bug fixes - Better code cohesion in nested_struct module
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
common
Related to common crate
logical-expr
Logical plan and expressions
optimizer
Optimizer rules
physical-expr
Changes to the physical-expr crates
sqllogictest
SQL Logic Tests (.slt)
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
Casting between
STRUCTtypes in DataFusion previously relied on physical field order. When a struct literal or expression had the same fields but in a different order than the target schema, DataFusion could silently assign values to the wrong fields (positional matching), causing incorrect results and potential data corruption.This PR changes struct-to-struct casting behavior to match and reorder fields by name, ensuring that:
{b: 3, a: 4}::STRUCT(a INT, b INT)yields{a: 4, b: 3}This addresses the bug described in #17285 (and aligns with the discussion in #14396 / #17281).
What changes are included in this PR?
Name-based struct casting at runtime
cast_struct_array_by_nameindatafusion_common::nested_structas a small wrapper around existing struct casting logic.ScalarValue::cast_to_with_optionsto use name-based struct casting when both source and target are structs.ColumnarValue::cast_to(array path) to route struct casts through name-based matching/reordering and fall back to Arrow casting for non-struct types.Struct type coercion improvements for binary expressions
struct_coercionto attempt name-based field alignment first (when there is name overlap), and to fallback to positional coercion when names don’t match (preserves backward compatibility for unnamed/positional patterns).Planner / physical cast permissiveness for struct-to-struct
ExprSchemablecast checks to allow struct-to-struct casts during planning even when Arrow’scan_cast_typeswould reject them, deferring detailed matching to runtime.cast_with_optionsto similarly allow struct-to-struct casts (including field-count mismatches) so execution can apply name-based casting.Optimizer safety: avoid const-folding problematic struct casts
simplify_expressionsto skip const-folding struct casts when field counts mismatch, preventing potential optimizer hangs.Tests and sqllogictest coverage
Added unit tests validating:
Updated/extended SQL logic tests (
struct.slt,case.slt) to cover:Are these changes tested?
Yes.
Added Rust unit tests in
datafusion/expr-common/src/columnar_value.rs:cast_struct_by_field_namecast_struct_missing_field_inserts_nullsUpdated and expanded sqllogictest files:
datafusion/sqllogictest/test_files/struct.sltdatafusion/sqllogictest/test_files/case.sltThese tests cover field reordering, missing/extra fields, nested structs, and ensure behavior matches expectations across planner + execution.
Are there any user-facing changes?
Yes.
Potential considerations:
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.