Skip to content

Support recursive schema compatibility validation for container types wrapping evolved structs#20840

Open
kosiew wants to merge 5 commits intoapache:mainfrom
kosiew:schema-01-20835
Open

Support recursive schema compatibility validation for container types wrapping evolved structs#20840
kosiew wants to merge 5 commits intoapache:mainfrom
kosiew:schema-01-20835

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Mar 10, 2026

Which issue does this PR close?

Rationale for this change

DataFusion currently implements additive schema-evolution compatibility for Struct columns using validate_struct_compatibility. However, this logic only applies when the column itself is a Struct. When a Struct is wrapped inside container types such as List, LargeList, FixedSizeList, or Map, the planner falls back to Arrow's can_cast_types.

This behavior treats the container as opaque and causes legitimate schema evolutions (for example, adding a nullable field to a struct) to be rejected if the struct is nested inside a container.

This PR introduces a recursive datatype compatibility validator that recognizes when container types wrap a Struct and applies the same additive schema-evolution semantics to the nested structure. This ensures consistent behavior between top-level structs and structs nested inside supported container types.

What changes are included in this PR?

Recursive compatibility validation

  • Introduce validate_data_type_compatibility to recursively validate compatibility between source and target datatypes.

  • Support recursive validation for:

    • Struct
    • List
    • LargeList
    • FixedSizeList
    • Map
  • Add helper requires_recursive_compatibility_validation to determine when recursive validation should be applied.

Planner integration

  • Update schema_rewriter.rs to use the new recursive compatibility validation during physical expression adaptation.
  • Preserve existing Arrow casting behavior for datatypes that do not require recursive validation.

Runtime casting support for containers

  • Extend nested column casting to support:

    • List
    • LargeList
    • FixedSizeList
    • Map
  • Introduce reusable helper cast_container to unify container casting logic.

  • Add specific casting helpers:

    • cast_list_column
    • cast_fixed_size_list_column
    • cast_map_column

Struct casting refactor

  • Simplify cast_struct_column implementation by iterating over target fields and mapping source fields by name.
  • Preserve struct null buffers while filling missing fields with null arrays when allowed.

Error message improvements

  • Standardize error messages from "Cannot cast struct field" to "Cannot cast field" for consistency across nested contexts.

Tests

Added comprehensive tests covering:

  • Recursive compatibility validation

    • List<Struct> with additive schema evolution
    • Map<_, Struct> nested struct compatibility
    • FixedSizeList size mismatch detection
  • Nested casting behavior

    • Casting List<Struct> where the target struct adds a nullable field
    • Ensuring new fields are filled with null values
    • Numeric type promotion inside nested structs
  • Planner integration

    • Expression rewrite behavior for List<Struct> compatibility
    • Failure scenarios for incompatible nested field casts
  • Parquet integration tests

    • End-to-end schema evolution validation when reading List<Struct> from Parquet

Are these changes tested?

Yes.

The PR adds both unit tests and integration tests:

  • Unit tests in nested_struct.rs validating recursive compatibility logic and container casting behavior.
  • Planner-level tests in schema_rewriter.rs ensuring correct expression rewriting.
  • Parquet integration tests verifying that schema evolution works correctly when reading nested container types such as List<Struct>.

These tests cover both compatible additive schema evolution and failure scenarios.

Are there any user-facing changes?

Yes, but they are improvements to existing functionality rather than breaking changes.

DataFusion will now correctly support additive schema evolution for structs nested inside container types such as:

  • List<Struct>
  • LargeList<Struct>
  • FixedSizeList<Struct>
  • Map<_, Struct>

Previously these cases could fail validation even when the schema change was valid.

No API changes are introduced.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew added 5 commits March 10, 2026 14:56
Enhance validation for nested structures by adding a recursive
compatibility validator in nested_struct.rs. This update allows
the validator to comprehend Struct, List, LargeList,
FixedSizeList, and Map wrappers around evolved structs.
Update schema_rewriter.rs to utilize this validator for
determining permissible CastColumnExpr for schema adaptation,
extending support beyond just top-level Structs.

Include focused tests for:
- List<Struct> additive nullable-field evolution acceptance
- Rejection of nested incompatible changes
- Fixed-size list size mismatch handling
- Map entries containing evolved structs
Replace validation-only logic in nested_struct.rs with
runtime container adaptation for List, LargeList,
FixedSizeList, and Map. Utilize a shared helper for
recursive compatibility validation to avoid code duplication.
Update schema_rewriter.rs to leverage the shared helper,
ensuring alignment between planner validation and runtime
behavior. Add execution-level Parquet regression tests in
expr_adapter.rs for List<Struct> additive evolution,
and introduce a direct runtime unit test for cast_column
on List<Struct> in nested_struct.rs.
Extract a generic `cast_container` helper to unify the
previously duplicated logic among `cast_list_column`,
`cast_fixed_size_list_column`, and `cast_map_column`.
This streamlines `nested_struct.rs` and simplifies
the process for adding future container types.
Inline concrete cases for improved clarity and maintainability.
Remove redundant Ok(()) tail in validate_field_compatibility.
Preserve downcast errors, recursive casts, rebuilt offsets/nulls,
and map entry handling.
@github-actions github-actions bot added core Core DataFusion crate common Related to common crate labels Mar 10, 2026
@kosiew kosiew marked this pull request as ready for review March 10, 2026 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant