Add late pruning of file based on file level statistics #16014

adriangb · 2025-05-10T00:45:52Z

Closes Pass PartitionedFile into FileSource for late file stats based pruning #16000

adriangb · 2025-05-10T00:46:51Z

A couple of thoughts:

Needs cleanup.
Not sure how to construct the empty stream.
It might be nice to implement pruning for Vec<Statistics> where each statistic represents an arbitrary container (e.g. partition or file).

alamb · 2025-05-11T12:00:30Z

It might be nice to implement pruning for Vec where each statistic represents an arbitrary container (e.g. partition or file).

Yes this would be super nice -- the more we can do to consolidate statistics / pruning the better off the code will be I think. Right now it is kind of scattered in several places

alamb · 2025-05-11T12:01:17Z

Not sure how to construct the empty stream.

You can use something like https://docs.rs/futures/latest/futures/stream/fn.iter.html perhaps -- like futures::stream::iter(vec![]) for example 🤔

berkaysynnada · 2025-05-11T13:17:06Z

datafusion/datasource/src/file_stream.rs

@@ -367,7 +368,7 @@ impl Default for OnError {
 pub trait FileOpener: Unpin + Send + Sync {
    /// Asynchronously open the specified file and return a stream
    /// of [`RecordBatch`]
-    fn open(&self, file_meta: FileMeta) -> Result<FileOpenFuture>;
+    fn open(&self, file_meta: FileMeta, file: PartitionedFile) -> Result<FileOpenFuture>;


Isn't it sufficient to provide only file statistics? PartitionedFile seems like an overkill to me

Maybe? But I feel like we have the partitioned file we might as well pass it in. Maybe we use it in the future to enable optimizations that use the partition values (eg late pruning based on partition values, including partition values in the scan so that more filters can be evaluated, etc)

I think using PartitionedFile as the "data we have at plan time" including statistics and potentially information about size, encryption, special indexes, etc makes a lot of sense

Maybe? But I feel like we have the partitioned file we might as well pass it in. Maybe we use it in the future to enable optimizations that use the partition values (eg late pruning based on partition values, including partition values in the scan so that more filters can be evaluated, etc)

I believe these can also be inferred from statistics in a more generalized fashion(don't know partition columns exist in column_statistics now) but not a big deal, we can keep this 👍🏻

Can you please update the documetnation for open() to mention that file has plan time per-file information (such as statistics) and leave a doc link back?

berkaysynnada

The idea makes a lot sense. I've one implementation suggestion. Thanks again @adriangb

adriangb · 2025-05-11T23:10:29Z

@alamb please review again I implemented and added a test 😄

xudong963 · 2025-05-12T09:55:13Z

datafusion/datasource-parquet/src/opener.rs

+            (Some(stats), Some(predicate)) => {
+                let pruning_predicate = build_pruning_predicate(
+                    Arc::clone(predicate),
+                    &self.table_schema,


Should it use table_schema here?

xudong963 · 2025-05-12T09:56:45Z

datafusion/datasource-parquet/src/opener.rs

+        match (&file.statistics, &self.predicate) {
+            (Some(stats), Some(predicate)) => {


Given that there is only one branch, I suggest using if let (Some(_), Some(_)) = xxx {} here.

alamb

Very cool -- I think this is very close

alamb · 2025-05-12T14:38:43Z

datafusion/datasource/src/file_stream.rs

@@ -367,7 +368,7 @@ impl Default for OnError {
 pub trait FileOpener: Unpin + Send + Sync {
    /// Asynchronously open the specified file and return a stream
    /// of [`RecordBatch`]
-    fn open(&self, file_meta: FileMeta) -> Result<FileOpenFuture>;
+    fn open(&self, file_meta: FileMeta, file: PartitionedFile) -> Result<FileOpenFuture>;


Can you please update the documetnation for open() to mention that file has plan time per-file information (such as statistics) and leave a doc link back?

alamb · 2025-05-12T14:45:09Z

datafusion/physical-optimizer/src/pruning.rs

+        }
+    }
+
+    /// Returns [`BooleanArray`] where each row represents information known


this comment cna probably be trimmed with a link back to the original trait source

alamb · 2025-05-12T14:56:56Z

datafusion/physical-optimizer/src/pruning.rs

@@ -995,6 +996,184 @@ fn build_statistics_record_batch<S: PruningStatistics>(
    })
 }

+/// Prune a set of containers represented by their statistics.


This is a nice structure -- I think it makes lots of sense and is 100%

Specifically, I thought there was already code that pruned individual files based on statistics but I cound not find any in LIstingTable (we have something like this in influxdb_iox).

My opinion is if we are going to this code it into the DataFusion codebase we should

Ensure that it helps a as many users as possble

Make sure it is executed as much as possible (to ensure test coverage)

Thus, what do you think about using the PrunableStatistics to prune the FileGroup in ListingTable here:

https://github.com/apache/datafusion/blob/55ba4cadce5ea99de4361929226f1c99cfc94450/datafusion/core/src/datasource/listing/table.rs#L1117-L1116

?

Pruning on statistics during plan time would potentially be redundant with also trying to prune again during opening, but it would reduce the files earlier int he plan

How about I bundle in the PartitionValues somehow and then we can re-use and compose that?
Specifically:

TableProvider's use just the partition values

ParquetOpener combines both

Something else can use just the stats

Pruning on statistics during plan time would potentially be redundant with also trying to prune again during opening, but it would reduce the files earlier int he plan

Yeah I don't think it's redundant: you either prune or you don't. If we prune earlier the files don't make it this far. If we don't we may now be able to prune them. What's redundant is if there are no changes to the filters (i.e. no dynamic filters), but that sounds both hard to track and like a possible future optimization 😄

alamb · 2025-05-12T14:57:16Z

datafusion/physical-optimizer/src/pruning.rs

+    /// [`Self::min_values`], [`Self::max_values`], [`Self::null_counts`],
+    /// and [`Self::row_counts`].
+    fn num_containers(&self) -> usize {
+        1


this should be self.statistics.len(), right?

adriangb · 2025-05-13T14:50:15Z

@alamb I pushed 4607643 which adds some nice APIs for partition values. In particular I think it's important to have a way to prune based on partition values + file level statistics (#15935).

However I can't implement it for ListingTable since the trait is defined in physical-optimizer. Can we move the trait somewhere upstream?

alamb · 2025-05-13T18:57:41Z

However I can't implement it for ListingTable since the trait is defined in physical-optimizer. Can we move the trait somewhere upstream?

Maybe it is time to make a datafusion-pruning crate that has all the PruningPredicate and related infrastructure 🤔

alamb · 2025-05-13T18:57:57Z

FYI @xudong963 I think this is relevant to your work on statistics / partition pruning as well

adriangb · 2025-05-13T20:06:52Z

However I can't implement it for ListingTable since the trait is defined in physical-optimizer. Can we move the trait somewhere upstream?

Maybe it is time to make a datafusion-pruning crate that has all the PruningPredicate and related infrastructure 🤔

Seems reasonable to me. I guess it'd be at the same level as PhysicalExpr and such.

adriangb · 2025-05-14T04:55:56Z

Moving to datafusion_common works pretty well, I think that's easier than making a new crate.

Next hurdle: at this point we've long lost information on the actual table schema / partition files. ParquetOpener::table_schema is actually the file schema and we have no way to back out the partition columns.
Given that PartitionedFile carries around partition_values: Vec<ScalarValue> I'd recommend either:

Changing PartitionedFile::partition_values to Vec<String, ScalarValue>.
Adding PartitionedFile::partition_schema.
Piping down table_schema into ParquetSource and later ParquetOpener.

I think any of these also sets us up to refactor how the partition filters actually get applied (i.e. we don't have to inject them in the FileScan. But maybe that's not desirable because every format would have to implement this on their own then. In that case we pipe them in to ParquetOpener for pruning and still inject them in the scan (it should be cheapish).

@alamb any preference?

xudong963

Generally LGTM, thank you

xudong963 · 2025-05-14T14:39:00Z

datafusion/datasource-parquet/src/opener.rs

+        if let (Some(stats), Some(predicate)) = (&file.statistics, &self.predicate) {
+            let pruning_predicate = build_pruning_predicate(
+                Arc::clone(predicate),
+                &self.table_schema,


Is it reasonable to use table_schema here?

It's the only schema we have. And it's not even really the table schema, the name is misleading for historical reasons.

It'd be better to add some notes about it. (I often confused when I reading the parquet part code, all kinds of schema, lol)

datafusion/datafusion/datasource-parquet/src/opener.rs

Lines 182 to 185 in 4607643

// Note about schemas: we are actually dealing with **3 different schemas** here:

// - The table schema as defined by the TableProvider. This is what the user sees, what they get when they `SELECT * FROM table`, etc.

// - The "virtual" file schema: this is the table schema minus any hive partition columns and projections. This is what the file schema is coerced to.

// - The physical file schema: this is the schema as defined by the parquet file. This is what the parquet file actually contains.

😄

adriangb · 2025-05-14T14:56:19Z

I think the next step here is to resolve #16014 (comment)

In my mind it makes sense to both push down the information and continue to have the ability to do it after the scan.
The direction DataFusion seems to be heading in is to add whatever functionality is needed to specialize readers for the most optimal performance (in this case by doing late pruning of files / partitions and being able to evaluate filters that mix partition columns and file columns during the scan) but preserving the ability to fall back to more general approaches (FilterExec, evaluating mixed filters after the scan) for sources that don't support this advanced functionality.

alamb · 2025-05-14T20:43:29Z

Moving to datafusion_common works pretty well, I think that's easier than making a new crate.

I think we should try and avoid moving everything to datafusion_common. Since the pruning stuff relies on PhysicalExpr I don't think we can directly put it in datafusion_common

Next hurdle: at this point we've long lost information on the actual table schema / partition files. ParquetOpener::table_schema is actually the file schema and we have no way to back out the partition columns. Given that PartitionedFile carries around partition_values: Vec<ScalarValue> I'd recommend either:

Changing PartitionedFile::partition_values to Vec<String, ScalarValue>.

Adding PartitionedFile::partition_schema.

Piping down table_schema into ParquetSource and later ParquetOpener.

I think any of these also sets us up to refactor how the partition filters actually get applied (i.e. we don't have to inject them in the FileScan. But maybe that's not desirable because every format would have to implement this on their own then. In that case we pipe them in to ParquetOpener for pruning and still inject them in the scan (it should be cheapish).

@alamb any preference?

Changing PartitionedFile::partition_values to Vec<String, ScalarValue>.

I think this sounds like the most straightforward thing to me and the easiest way to get the required information

Seems like FileScanConfig already has table_partition_cols,

Maybe we can do something like this (change to use a FieldRef rather than Field to avoid copies):

pub struct PartitionedFile {
...
    pub partition_values: Vec<ScalarValue>,
...
}

to

pub struct PartitionedFile {
...
    pub partition_values: Vec<(FieldRef, ScalarValue)>,
...
}

alamb · 2025-05-14T20:44:55Z

BTW the other thing I somewhat worry about reapplying pruning during file opening is that it is in the critical path and directly will add to the query latency. I wonder if there is some way to ensure we have hidden it behind IO if possible (aka make sure we are applying the extra pruning while the next file is opened rather than waiting to do it before starting that IO

adriangb · 2025-05-14T21:31:08Z

Since the pruning stuff relies on PhysicalExpr I don't think we can directly put it in datafusion_common
The stuff I'm moving doesn't 😄. It's basically just the PruningStatistics trait.

Maybe we can do something like this (change to use a FieldRef rather than Field to avoid copies):

That sounds good to me. It kinda makes sense that if you're carrying around partition values you'd carry around info on what columns they belong to. Maybe it will help resolve #13270 as well in the future.

BTW the other thing I somewhat worry about reapplying pruning during file opening is that it is in the critical path and directly will add to the query latency. I wonder if there is some way to ensure we have hidden it behind IO if possible (aka make sure we are applying the extra pruning while the next file is opened rather than waiting to do it before starting that IO

I think we can move it a couple lines lower into Ok(Box::pin(async move { and that will do the trick? As long as it happens before we load the Parquet metadata the overhead is minimal. There's probably other stuff we could move into there if that's a concern.

adriangb · 2025-05-15T03:39:41Z

@alamb @xudong963 I've pushed a change that:

Moves PruningStatistics into common.
Adds composable helpers to prune based on Vec<Statistics> (multiple files / partitions) and Vec<Vec<ScalarValue>> (multiple containers of partition values).
Adds partition_fields: Vec<FieldRef> to ParquetOpener, with slight tweaks to FileScanConfig (the latter is a bit of a PITA because of how it's both a struct and it's own builder).
Implements the pruning inside of the the IO work so that it's deferred as Andrew asked for.
Sets us up nicely to pipe the partition values into the other stages of pruning (row group stats, page stats and row filters). Leaving this for future work though.

alamb · 2025-05-15T11:13:31Z

I will review this more carefully later today

adriangb · 2025-05-15T13:39:03Z

@alamb I took a look at the test failures and it seems to me that what is happening is the tests expected pruning at the row group stats level but it's now happening at the file level, which is a good thing 😄 ! But making the tests fail 😢. They're macro generated tests which is a bit wonky... may need a more careful eye to figure out how to rejig the tests to accept the new pruning. I'm at a conference all of this week and on PTO next week so not sure I'll be able to get to it but I'll try; if someone else can look that'd be great.

alamb

TLDR is I think this is really nice and very powerful @adriangb. The only thing I think is needed prior to merge is to figure out some way to avoid trying to prune files when we know an attempt has already been made at planning time.

Maybe we can break this PR up into some smaller chunks:

Move PartitionStatistics to datafusion-common (that is an easy one)
Potentially one for CompositePruningStatistics and PrunableStatistics
The final one that hooks it all up on the FileOpener

alamb · 2025-05-16T18:29:30Z

datafusion/common/src/pruning.rs

+    }
+}
+
+pub struct CompositePruningStatistics {


This is a very fancy idea -- it probably needs some more comments about what it does (namely combines multiple sources together where if one pruning statistics doesn't have information for a particular column, tries the other PruningStatistics in turn

I added some more docs 😄

alamb · 2025-05-16T18:34:21Z

datafusion/datasource-parquet/src/opener.rs

        let enable_page_index = self.enable_page_index;

        Ok(Box::pin(async move {
+            // Prune this file using the file level statistics.


I worry that in the case when there aren't any dynamic predicates, trying to prune again on file opening is simply going to be pure overhead / wasteful.

Therefore, I think it would be good if we could somehow control / disable trying to apply this extra filtering when it is known it would not help

For example, maybe we can have a field on ParquetOpener with something like prune_on_open which can be set to true if there are dynamic predicates present.

This would also likely ensure the tests can pass again

which can be set to true if there are dynamic predicates present

the issue is: how do we know the filters are dynamic? we've hidden dynamic filters behind PhysicalExpr so that the system can treat them as normal filters. we could do any filter pushdown but that doesn't seem like much of an improvement.

I also think this pruning should be quite cheap / the record batches being filtered are just a couple rows

alamb · 2025-05-16T18:37:10Z

datafusion/common/src/pruning.rs

+            vec![vec![]; partition_schema.fields().len()];
+        for partition_value in partition_values.iter() {
+            for (i, value) in partition_value.iter().enumerate() {
+                partition_valeus_by_column[i].push(value.clone());


it would be great to avoid these clones if possible

alamb · 2025-05-16T18:39:11Z

datafusion/datasource-parquet/src/opener.rs

+                parquet_file_reader_factory: Arc::new(
+                    DefaultParquetFileReaderFactory::new(Arc::clone(&store)),
+                ),
+                partition_fields: vec![],


we probably need a test for pruning on partition_fields as well

alamb · 2025-05-16T18:40:46Z

datafusion/common/src/pruning.rs

+    ) -> Option<BooleanArray>;
+}
+
+pub struct PartitionPruningStatistics {


I think we should document this struct, specifically including information about how the partition values are mapped to the main schema

alamb · 2025-05-16T18:41:55Z

datafusion/common/src/pruning.rs

+
+/// Prune a set of containers represented by their statistics.
+/// Each [`Statistics`] represents a container (e.g. a file or a partition of files).
+pub struct PrunableStatistics {


I think this is a good pattern -- it turns out we have something very similar in influxdb_iox:

https://github.com/influxdata/influxdb3_core/blob/af9fabea05e2135a094a69dc5b7d549e713420f9/iox_query/src/pruning.rs#L157

alamb · 2025-05-16T18:43:17Z

datafusion/common/src/pruning.rs

+/// Each [`Statistics`] represents a container (e.g. a file or a partition of files).
+pub struct PrunableStatistics {
+    /// Statistics for each container.
+    statistics: Vec<Arc<Statistics>>,


I suspect we could just use references here and save a bunch of arcs (not a big deal) but something like

Suggested change

statistics: Vec<Arc<Statistics>>,

statistics: Vec<&'a Statistics>,

I tried this but it turned out to be tricky given that all of this ends up in a boxed future, etc.

adriangb · 2025-05-16T19:26:04Z

Let's start with #16069

adriangb · 2025-05-18T02:34:49Z

My plan for this PR now is to first resolve blockers. In particular:

Move PruningStatistics into datafusion::common #16069
Make ListingTable obey collect_statistics config #16080
PR to add the new structs into datafusion-common

And then come back here and resolve the rest of the points of discussion.

adriangb mentioned this pull request May 10, 2025

Pass PartitionedFile into FileSource for late file stats based pruning #16000

Open

github-actions bot added optimizer Optimizer rules core Core DataFusion crate datasource Changes to the datasource crate labels May 10, 2025

berkaysynnada reviewed May 11, 2025

View reviewed changes

adriangb marked this pull request as ready for review May 11, 2025 23:09

adriangb force-pushed the late-pruning-files branch from 0e03bdc to 94726cc Compare May 11, 2025 23:10

xudong963 reviewed May 12, 2025

View reviewed changes

alamb reviewed May 12, 2025

View reviewed changes

xudong963 self-requested a review May 14, 2025 14:24

xudong963 reviewed May 14, 2025

View reviewed changes

github-actions bot added common Related to common crate proto Related to proto crate labels May 15, 2025

adriangb added 2 commits May 14, 2025 23:29

Add late pruning of file based on file level statistics

5116ea8

lints

cc120d0

adriangb force-pushed the late-pruning-files branch from e8eb87f to cc120d0 Compare May 15, 2025 03:30

add to upgrade guide

76bedb8

github-actions bot added the documentation Improvements or additions to documentation label May 15, 2025

more fixes

a3b4ec7

alamb mentioned this pull request May 15, 2025

Weekly Plan: Andrew Lamb 2025-05-12 #16022

Open

24 tasks

fix

3b90256

touch up docs

7b08ed6

alamb reviewed May 16, 2025

View reviewed changes

adriangb added 4 commits May 17, 2025 09:45

optimize impls

2d3e0e7

fix

063e68a

fix

49422b2

fx

5f4c5ae

adriangb mentioned this pull request May 18, 2025

Make ListingTable obey collect_statistics config #16080

Open

adriangb added 2 commits May 17, 2025 22:17

fmt

9b3b641

Revert bits

de0590c

		match (&file.statistics, &self.predicate) {
		(Some(stats), Some(predicate)) => {

	// Note about schemas: we are actually dealing with 3 different schemas here:
	// - The table schema as defined by the TableProvider. This is what the user sees, what they get when they `SELECT * FROM table`, etc.
	// - The "virtual" file schema: this is the table schema minus any hive partition columns and projections. This is what the file schema is coerced to.
	// - The physical file schema: this is the schema as defined by the parquet file. This is what the parquet file actually contains.

	statistics: Vec<Arc<Statistics>>,
	statistics: Vec<&'a Statistics>,

Add late pruning of file based on file level statistics #16014

Are you sure you want to change the base?

Add late pruning of file based on file level statistics #16014

Conversation

adriangb commented May 10, 2025 • edited by alamb Loading

adriangb commented May 10, 2025

alamb commented May 11, 2025

alamb commented May 11, 2025

Choose a reason for hiding this comment

adriangb May 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

berkaysynnada left a comment

Choose a reason for hiding this comment

adriangb commented May 11, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb commented May 13, 2025

alamb commented May 13, 2025

alamb commented May 13, 2025

adriangb commented May 13, 2025

adriangb commented May 14, 2025

xudong963 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb commented May 14, 2025

alamb commented May 14, 2025

alamb commented May 14, 2025

adriangb commented May 14, 2025

adriangb commented May 15, 2025

alamb commented May 15, 2025

adriangb commented May 15, 2025

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb commented May 16, 2025

adriangb commented May 18, 2025

adriangb commented May 10, 2025 •

edited by alamb

Loading

adriangb May 11, 2025 •

edited

Loading