Skip to content

Support row group pruning for struct field predicates #20871

@friendlymatthew

Description

@friendlymatthew

Related to the work on struct array handling:

When filtering on struct fields (e.g. WHERE s['value'] > 5), Datafusion currently can not prune row groups using Parquet column statistics, even though the underlying leaf columns have valid min/max statistics stored in the parquet metadata

The issue is in the pruning predicate system. When it encounters a GetField expr like GetField(Column("s"), "value"), the column extraction logic only sees the parent struct Column(s) and doesn't resolve through to the nested field

Fixing this would mean teaching the pruning system to resolve GetField expressions down to their leaf columns, then look up the corresponding Parquet column stats. Note, the stats themselves are already there in the Parquet metadata, they're just never consulted for nested field access

On tables with many row groups, this could significantly reduce the amount of data read for struct field predicates

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions