-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Related to the work on struct array handling:
- Teach Datafusion to project only accessed struct leaves in row filter pushdown #20854
- Allow filters on struct fields to be pushed down into Parquet scan #20822
- Add benchmark for struct field filter pushdown in Parquet #20829
When filtering on struct fields (e.g. WHERE s['value'] > 5), Datafusion currently can not prune row groups using Parquet column statistics, even though the underlying leaf columns have valid min/max statistics stored in the parquet metadata
The issue is in the pruning predicate system. When it encounters a GetField expr like GetField(Column("s"), "value"), the column extraction logic only sees the parent struct Column(s) and doesn't resolve through to the nested field
Fixing this would mean teaching the pruning system to resolve GetField expressions down to their leaf columns, then look up the corresponding Parquet column stats. Note, the stats themselves are already there in the Parquet metadata, they're just never consulted for nested field access
On tables with many row groups, this could significantly reduce the amount of data read for struct field predicates