Description
Is your feature request related to a problem or challenge?
As we continue to make progress landing dynamic filters it opens up the opportunity for new optimizations.
This one deals with late evaluation of file-level statistics.
In particular, we may have file level statics available at planning time (see datafusion.execution.collect_statistics
- our system does a similar thing in a different way).
Before dynamic filters there was no point in re-evaluating these right before scanning a file but now it's possible that e.g. a TopK operator passed down a ts > '2025-05-08T00:00:00Z'
filter -> we may be able to exclude the entire file based on this filter + file level statistics -> we avoid reading any Parquet metadata, etc.
In particular, change:
trait FileSource {
fn open(file_meta: FileMeta) ...
To:
trait FileSource {
fn open(file_meta: FileMeta, file: &PartitionedFile) ...
And call it as so from here:
Then we can implement PruningStatitics for Statistics et. voilá!
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response