Support recursive schema compatibility validation for container types wrapping evolved structs#20840
Open
kosiew wants to merge 5 commits intoapache:mainfrom
Open
Support recursive schema compatibility validation for container types wrapping evolved structs#20840kosiew wants to merge 5 commits intoapache:mainfrom
kosiew wants to merge 5 commits intoapache:mainfrom
Conversation
Enhance validation for nested structures by adding a recursive compatibility validator in nested_struct.rs. This update allows the validator to comprehend Struct, List, LargeList, FixedSizeList, and Map wrappers around evolved structs. Update schema_rewriter.rs to utilize this validator for determining permissible CastColumnExpr for schema adaptation, extending support beyond just top-level Structs. Include focused tests for: - List<Struct> additive nullable-field evolution acceptance - Rejection of nested incompatible changes - Fixed-size list size mismatch handling - Map entries containing evolved structs
Replace validation-only logic in nested_struct.rs with runtime container adaptation for List, LargeList, FixedSizeList, and Map. Utilize a shared helper for recursive compatibility validation to avoid code duplication. Update schema_rewriter.rs to leverage the shared helper, ensuring alignment between planner validation and runtime behavior. Add execution-level Parquet regression tests in expr_adapter.rs for List<Struct> additive evolution, and introduce a direct runtime unit test for cast_column on List<Struct> in nested_struct.rs.
…proved clarity and efficiency
Extract a generic `cast_container` helper to unify the previously duplicated logic among `cast_list_column`, `cast_fixed_size_list_column`, and `cast_map_column`. This streamlines `nested_struct.rs` and simplifies the process for adding future container types.
Inline concrete cases for improved clarity and maintainability. Remove redundant Ok(()) tail in validate_field_compatibility. Preserve downcast errors, recursive casts, rebuilt offsets/nulls, and map entry handling.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
List<Struct>/ nested container types in Parquet scans #20835Rationale for this change
DataFusion currently implements additive schema-evolution compatibility for
Structcolumns usingvalidate_struct_compatibility. However, this logic only applies when the column itself is aStruct. When aStructis wrapped inside container types such asList,LargeList,FixedSizeList, orMap, the planner falls back to Arrow'scan_cast_types.This behavior treats the container as opaque and causes legitimate schema evolutions (for example, adding a nullable field to a struct) to be rejected if the struct is nested inside a container.
This PR introduces a recursive datatype compatibility validator that recognizes when container types wrap a
Structand applies the same additive schema-evolution semantics to the nested structure. This ensures consistent behavior between top-level structs and structs nested inside supported container types.What changes are included in this PR?
Recursive compatibility validation
Introduce
validate_data_type_compatibilityto recursively validate compatibility between source and target datatypes.Support recursive validation for:
StructListLargeListFixedSizeListMapAdd helper
requires_recursive_compatibility_validationto determine when recursive validation should be applied.Planner integration
schema_rewriter.rsto use the new recursive compatibility validation during physical expression adaptation.Runtime casting support for containers
Extend nested column casting to support:
ListLargeListFixedSizeListMapIntroduce reusable helper
cast_containerto unify container casting logic.Add specific casting helpers:
cast_list_columncast_fixed_size_list_columncast_map_columnStruct casting refactor
cast_struct_columnimplementation by iterating over target fields and mapping source fields by name.Error message improvements
"Cannot cast struct field"to"Cannot cast field"for consistency across nested contexts.Tests
Added comprehensive tests covering:
Recursive compatibility validation
List<Struct>with additive schema evolutionMap<_, Struct>nested struct compatibilityFixedSizeListsize mismatch detectionNested casting behavior
List<Struct>where the target struct adds a nullable fieldPlanner integration
List<Struct>compatibilityParquet integration tests
List<Struct>from ParquetAre these changes tested?
Yes.
The PR adds both unit tests and integration tests:
nested_struct.rsvalidating recursive compatibility logic and container casting behavior.schema_rewriter.rsensuring correct expression rewriting.List<Struct>.These tests cover both compatible additive schema evolution and failure scenarios.
Are there any user-facing changes?
Yes, but they are improvements to existing functionality rather than breaking changes.
DataFusion will now correctly support additive schema evolution for structs nested inside container types such as:
List<Struct>LargeList<Struct>FixedSizeList<Struct>Map<_, Struct>Previously these cases could fail validation even when the schema change was valid.
No API changes are introduced.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.