Skip to content

Support additive schema evolution for List<Struct> / nested container types in Parquet scans #20835

@TheBuilderJR

Description

@TheBuilderJR

Describe the bug

Summary

DataFusion currently supports additive schema evolution reasonably well for plain Struct columns, but it fails when the evolved struct is nested inside a container type such as List<Struct>.

This shows up in Parquet scans with a logical schema newer than some physical files. If a nested struct inside a list gains a new nullable field, DataFusion fails planning or execution instead of adapting the older files by filling the new field with nulls.

Version

Observed on DataFusion 52.1.0.

Problem

Given:

  • older parquet files with a field shaped like List(Struct(...))
  • newer parquet files where the struct inside that list has additional nullable fields
  • a scan using the latest logical schema across both old and new files

DataFusion fails with an error like:

Cannot cast struct field 'messages' from type List(Struct(...old shape...)) to type List(Struct(...new shape...))

In my case, the concrete drift is:

  • old physical files:
    • inputAsset: Struct(type, token, amount)
    • outputAsset: Struct(type, token)
  • new logical schema:
    • inputAsset: Struct(type, token, amount, chain)
    • outputAsset: Struct(type, token, chain)

where both chain fields are nullable additions.

Expected behavior

For additive schema evolution, DataFusion should treat nested container cases similarly to plain Struct evolution:

  • missing fields in older files should be filled with nulls if the target field is nullable
  • extra fields in older or newer files should be ignored when not present in the target
  • recursive adaptation should work through:
    • List
    • LargeList
    • FixedSizeList
    • Map
    • combinations like Struct -> List(Struct) -> Struct

This should allow both narrow projections and SELECT * across schema-drifted parquet files without application-side rewriting.

Actual behavior

DataFusion succeeds for some plain Struct evolution scenarios, but fails when the evolved struct is nested in a list or map-like container.

The failure appears during schema rewriting or cast validation for Parquet scan expressions.

Why this seems like a gap in the current implementation

From reading the current code:

  • DefaultPhysicalExprAdapterRewriter::rewrite_column special-cases (Struct, Struct) compatibility and otherwise falls back to generic can_cast_types
  • datafusion_common::nested_struct::cast_column special-cases target Struct and otherwise falls back to generic Arrow casting
  • as a result, Struct evolution gets custom handling, but List<Struct> does not

So the current behavior looks like:

  • supported: Struct -> Struct with missing or extra fields
  • not supported: List<Struct> -> List<Struct> with additive nested fields

Relevant code paths

These are the places that seem most relevant:

  • datafusion-common/src/nested_struct.rs
    • cast_column
    • validate_struct_compatibility
  • datafusion-physical-expr-adapter/src/schema_rewriter.rs
    • DefaultPhysicalExprAdapterRewriter::rewrite_column
  • datafusion-physical-expr/src/expressions/cast_column.rs
    • CastColumnExpr::evaluate

Minimal shape of the repro

Logical schema:

data: Struct(
  messages: List(
    Struct(
      kwargs: Struct(
        tool_calls: List(
          Struct(
            args: Struct(
              swaps: List(
                Struct(
                  inputAsset: Struct(
                    amount: Struct(type, value),
                    token: Struct(identifier_type, value),
                    type,
                    chain
                  ),
                  outputAsset: Struct(
                    token: Struct(identifier_type, value),
                    type,
                    chain
                  )
                )
              )
            )
          )
        )
      )
    )
  )
)

Older physical files have the same shape except inputAsset.chain and outputAsset.chain are absent.

Suggested fix direction

A clean fix seems to be:

  1. Generalize compatibility checking from plain struct fields to recursive nested type compatibility.
  2. Extend cast_column to recursively adapt container types whose child or value type contains evolved structs.
  3. Use that recursive compatibility logic from the default physical expression adapter as well.

Concretely, this likely means adding support for recursive adaptation of:

  • List
  • LargeList
  • FixedSizeList
  • Map

instead of only Struct.

Proposed semantics

For nested container evolution:

  • matching fields should still be cast using existing cast rules
  • missing target fields should become null arrays when nullable
  • nullable source to non-nullable target should still fail
  • extra source fields should still be ignored
  • incompatible primitive type changes should still error

Tests that would be useful

I think the missing coverage is around:

  • List<Struct> where target adds a nullable nested field
  • LargeList<Struct> with the same pattern
  • FixedSizeList<Struct> with the same pattern
  • Map<_, Struct> or map entries containing evolved structs
  • recursive case like Struct(messages: List(Struct(...)))

Impact

This currently forces application-level workarounds such as preprocessing or rewriting parquet files to the latest schema before querying, even though the evolution is additive and nullable.

It would be much better if the default Parquet scan path handled this directly, the same way plain Struct evolution is already handled.

To Reproduce

No response

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions