feat: adaptive filter selectivity tracking for Parquet row filters #19639

adriangb · 2026-01-04T21:59:07Z

Summary

This PR implements cross-file tracking of filter selectivity in ParquetSource to adaptively reorder and demote low-selectivity filters, as discussed in #3463 (comment).

Key changes:

Add SelectivityTracker to track filter effectiveness across files using ExprKey wrapper for structural equality
Each ParquetOpener queries shared stats to partition filters into row filters (push down) vs post-scan filters (inline application)
Post-scan filters are added to projection, applied inline in stream via apply_post_scan_filters(), then filter columns are removed from output
SelectivityUpdatingStream wrapper updates tracker when stream completes
build_row_filter_with_metrics() returns per-filter metrics for selectivity tracking
Filters are reordered by observed effectiveness (most selective first)

Configuration:

parquet_options.filter_effectiveness_threshold (default: 0.8)
Effectiveness = 1 - (rows_matched / rows_total) = fraction of rows filtered out
Filters with effectiveness < threshold are demoted to post-scan

Files added:

datafusion/datasource-parquet/src/selectivity.rs - Core tracking infrastructure

Files modified:

opener.rs - Filter partitioning, post-scan application, SelectivityUpdatingStream
row_filter.rs - FilterMetrics, RowFilterWithMetrics, effectiveness-based reordering
source.rs - selectivity_tracker field and builder methods
config.rs - filter_effectiveness_threshold config option

Test plan

Unit tests for ExprKey hash/eq consistency
Unit tests for SelectivityStats::effectiveness() edge cases
Unit tests for SelectivityTracker::partition_filters() threshold logic
Existing test suite passes
Integration tests for post-scan filter application
End-to-end tests for adaptive behavior across files
Performance benchmarks

🤖 Generated with Claude Code

adriangb · 2026-01-04T22:53:50Z

run benchmarks

alamb-ghbot · 2026-01-04T22:53:54Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing filter-pushdown-dynamic (3065a0e) to 955fd41 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

adriangb · 2026-01-04T22:54:04Z

show benchmark queue

alamb-ghbot · 2026-01-04T22:54:09Z

🤖 Hi @adriangb, you asked to view the benchmark queue (#19639 (comment)).

Job	User	Benchmarks	Comment
`19639_3708500887.sh`	adriangb	default	`https://github.com/apache/datafusion/pull/19639#issuecomment-3708500887`

alamb-ghbot · 2026-01-04T23:31:53Z

🤖: Benchmark completed

Details

Comparing HEAD and filter-pushdown-dynamic
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ filter-pushdown-dynamic ┃         Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2417.81 ms │              2359.51 ms │      no change │
│ QQuery 1     │   909.35 ms │               930.52 ms │      no change │
│ QQuery 2     │  1873.59 ms │              1891.90 ms │      no change │
│ QQuery 3     │  1150.33 ms │              1160.92 ms │      no change │
│ QQuery 4     │  2297.04 ms │              2302.27 ms │      no change │
│ QQuery 5     │ 28141.18 ms │             28259.93 ms │      no change │
│ QQuery 6     │  3995.68 ms │               230.39 ms │ +17.34x faster │
│ QQuery 7     │  3748.71 ms │              3945.39 ms │   1.05x slower │
└──────────────┴─────────────┴─────────────────────────┴────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 44533.69ms │
│ Total Time (filter-pushdown-dynamic)   │ 41080.84ms │
│ Average Time (HEAD)                    │  5566.71ms │
│ Average Time (filter-pushdown-dynamic) │  5135.11ms │
│ Queries Faster                         │          1 │
│ Queries Slower                         │          1 │
│ Queries with No Change                 │          6 │
│ Queries with Failure                   │          0 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ filter-pushdown-dynamic ┃         Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ QQuery 0     │     1.41 ms │                 1.46 ms │      no change │
│ QQuery 1     │    50.80 ms │                51.07 ms │      no change │
│ QQuery 2     │   133.96 ms │               133.81 ms │      no change │
│ QQuery 3     │   153.01 ms │               151.42 ms │      no change │
│ QQuery 4     │  1056.19 ms │              1088.14 ms │      no change │
│ QQuery 5     │  1351.78 ms │              1406.79 ms │      no change │
│ QQuery 6     │     1.45 ms │                 1.44 ms │      no change │
│ QQuery 7     │    55.47 ms │                69.68 ms │   1.26x slower │
│ QQuery 8     │  1441.54 ms │              1478.63 ms │      no change │
│ QQuery 9     │  1874.93 ms │              1844.17 ms │      no change │
│ QQuery 10    │   340.79 ms │               479.23 ms │   1.41x slower │
│ QQuery 11    │   398.06 ms │               547.82 ms │   1.38x slower │
│ QQuery 12    │  1254.38 ms │              1489.86 ms │   1.19x slower │
│ QQuery 13    │  2007.28 ms │              2138.37 ms │   1.07x slower │
│ QQuery 14    │  1233.55 ms │              1461.97 ms │   1.19x slower │
│ QQuery 15    │  1253.12 ms │              1280.44 ms │      no change │
│ QQuery 16    │  2582.68 ms │              2557.03 ms │      no change │
│ QQuery 17    │  2577.73 ms │              2584.75 ms │      no change │
│ QQuery 18    │  5810.75 ms │              4857.70 ms │  +1.20x faster │
│ QQuery 19    │   122.85 ms │               142.04 ms │   1.16x slower │
│ QQuery 20    │  1938.15 ms │              1881.92 ms │      no change │
│ QQuery 21    │  2253.41 ms │              2312.49 ms │      no change │
│ QQuery 22    │  3794.87 ms │              3262.80 ms │  +1.16x faster │
│ QQuery 23    │ 19266.34 ms │              1268.50 ms │ +15.19x faster │
│ QQuery 24    │   213.96 ms │               298.77 ms │   1.40x slower │
│ QQuery 25    │   473.16 ms │               621.61 ms │   1.31x slower │
│ QQuery 26    │   232.57 ms │               328.22 ms │   1.41x slower │
│ QQuery 27    │  2702.33 ms │              2516.97 ms │  +1.07x faster │
│ QQuery 28    │ 23553.89 ms │             21856.71 ms │  +1.08x faster │
│ QQuery 29    │   976.86 ms │               972.07 ms │      no change │
│ QQuery 30    │  1328.89 ms │              1332.73 ms │      no change │
│ QQuery 31    │  1373.19 ms │              1355.09 ms │      no change │
│ QQuery 32    │  5179.01 ms │              4582.04 ms │  +1.13x faster │
│ QQuery 33    │  5626.50 ms │              5289.73 ms │  +1.06x faster │
│ QQuery 34    │  5864.31 ms │              5660.80 ms │      no change │
│ QQuery 35    │  1928.59 ms │              1940.03 ms │      no change │
│ QQuery 36    │    66.34 ms │                14.39 ms │  +4.61x faster │
│ QQuery 37    │    43.87 ms │                14.31 ms │  +3.07x faster │
│ QQuery 38    │    64.91 ms │                13.90 ms │  +4.67x faster │
│ QQuery 39    │   104.78 ms │                12.40 ms │  +8.45x faster │
│ QQuery 40    │    26.49 ms │                15.59 ms │  +1.70x faster │
│ QQuery 41    │    22.49 ms │                14.04 ms │  +1.60x faster │
│ QQuery 42    │    19.10 ms │                13.66 ms │  +1.40x faster │
└──────────────┴─────────────┴─────────────────────────┴────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 100755.75ms │
│ Total Time (filter-pushdown-dynamic)   │  79344.60ms │
│ Average Time (HEAD)                    │   2343.16ms │
│ Average Time (filter-pushdown-dynamic) │   1845.22ms │
│ Queries Faster                         │          14 │
│ Queries Slower                         │          10 │
│ Queries with No Change                 │          19 │
│ Queries with Failure                   │           0 │
└────────────────────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ filter-pushdown-dynamic ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 121.20 ms │               115.82 ms │     no change │
│ QQuery 2     │  29.95 ms │                27.73 ms │ +1.08x faster │
│ QQuery 3     │  38.04 ms │                32.55 ms │ +1.17x faster │
│ QQuery 4     │  29.65 ms │                29.55 ms │     no change │
│ QQuery 5     │  87.21 ms │                86.04 ms │     no change │
│ QQuery 6     │  19.65 ms │                19.50 ms │     no change │
│ QQuery 7     │ 215.75 ms │               214.50 ms │     no change │
│ QQuery 8     │  33.85 ms │                32.06 ms │ +1.06x faster │
│ QQuery 9     │  93.47 ms │                96.83 ms │     no change │
│ QQuery 10    │  62.68 ms │                63.08 ms │     no change │
│ QQuery 11    │  17.67 ms │                18.66 ms │  1.06x slower │
│ QQuery 12    │  50.57 ms │                50.22 ms │     no change │
│ QQuery 13    │  46.61 ms │                46.59 ms │     no change │
│ QQuery 14    │  13.05 ms │                13.65 ms │     no change │
│ QQuery 15    │  24.12 ms │                24.02 ms │     no change │
│ QQuery 16    │  24.23 ms │                23.95 ms │     no change │
│ QQuery 17    │ 148.81 ms │               149.81 ms │     no change │
│ QQuery 18    │ 279.28 ms │               269.82 ms │     no change │
│ QQuery 19    │  37.62 ms │                37.50 ms │     no change │
│ QQuery 20    │  50.02 ms │                50.24 ms │     no change │
│ QQuery 21    │ 312.84 ms │               305.36 ms │     no change │
│ QQuery 22    │  17.35 ms │                17.24 ms │     no change │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 1753.60ms │
│ Total Time (filter-pushdown-dynamic)   │ 1724.72ms │
│ Average Time (HEAD)                    │   79.71ms │
│ Average Time (filter-pushdown-dynamic) │   78.40ms │
│ Queries Faster                         │         3 │
│ Queries Slower                         │         1 │
│ Queries with No Change                 │        18 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

adriangb · 2026-01-04T23:39:26Z

run benchmarks tpch

alamb-ghbot · 2026-01-04T23:39:31Z

🤖 Hi @adriangb, thanks for the request (#19639 (comment)).

scrape_comments.py only supports whitelisted benchmarks.

Standard: clickbench_1, clickbench_extended, clickbench_partitioned, clickbench_pushdown, external_aggr, tpcds, tpch, tpch10, tpch_mem, tpch_mem10
Criterion: aggregate_query_sql, aggregate_vectorized, case_when, character_length, in_list, range_and_generate_series, sort, sql_planner, strpos, with_hashes

Please choose one or more of these with run benchmark <name> or run benchmark <name1> <name2>...

adriangb · 2026-01-04T23:42:49Z

run benchmark tpch

alamb-ghbot · 2026-01-04T23:42:57Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing filter-pushdown-dynamic (3065a0e) to 955fd41 diff using: tpch
Results will be posted here when complete

alamb-ghbot · 2026-01-04T23:43:43Z

🤖: Benchmark completed

Details

Comparing HEAD and filter-pushdown-dynamic
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ filter-pushdown-dynamic ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 203.54 ms │               196.22 ms │    no change │
│ QQuery 2     │  91.58 ms │               128.48 ms │ 1.40x slower │
│ QQuery 3     │ 126.46 ms │               195.37 ms │ 1.54x slower │
│ QQuery 4     │  77.57 ms │               125.63 ms │ 1.62x slower │
│ QQuery 5     │ 172.74 ms │               334.33 ms │ 1.94x slower │
│ QQuery 6     │  67.70 ms │               109.04 ms │ 1.61x slower │
│ QQuery 7     │ 216.22 ms │               256.53 ms │ 1.19x slower │
│ QQuery 8     │ 162.62 ms │               266.87 ms │ 1.64x slower │
│ QQuery 9     │ 228.36 ms │               396.32 ms │ 1.74x slower │
│ QQuery 10    │ 182.90 ms │               273.24 ms │ 1.49x slower │
│ QQuery 11    │  73.68 ms │                97.62 ms │ 1.32x slower │
│ QQuery 12    │ 115.03 ms │               259.93 ms │ 2.26x slower │
│ QQuery 13    │ 212.04 ms │               202.51 ms │    no change │
│ QQuery 14    │  91.21 ms │               105.22 ms │ 1.15x slower │
│ QQuery 15    │ 119.99 ms │               167.31 ms │ 1.39x slower │
│ QQuery 16    │  56.96 ms │                86.84 ms │ 1.52x slower │
│ QQuery 17    │ 281.93 ms │               276.68 ms │    no change │
│ QQuery 18    │ 317.53 ms │               664.90 ms │ 2.09x slower │
│ QQuery 19    │ 135.10 ms │               179.47 ms │ 1.33x slower │
│ QQuery 20    │ 125.82 ms │               153.15 ms │ 1.22x slower │
│ QQuery 21    │ 264.64 ms │               328.90 ms │ 1.24x slower │
│ QQuery 22    │  42.98 ms │                67.05 ms │ 1.56x slower │
└──────────────┴───────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 3366.60ms │
│ Total Time (filter-pushdown-dynamic)   │ 4871.61ms │
│ Average Time (HEAD)                    │  153.03ms │
│ Average Time (filter-pushdown-dynamic) │  221.44ms │
│ Queries Faster                         │         0 │
│ Queries Slower                         │        19 │
│ Queries with No Change                 │         3 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

adriangb · 2026-01-04T23:47:33Z

run benchmark tpch

alamb-ghbot · 2026-01-04T23:47:42Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing filter-pushdown-dynamic (6af7b28) to 955fd41 diff using: tpch
Results will be posted here when complete

alamb-ghbot · 2026-01-05T00:00:45Z

🤖: Benchmark completed

Details

Comparing HEAD and filter-pushdown-dynamic
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ filter-pushdown-dynamic ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 201.63 ms │                    FAIL │  incomparable │
│ QQuery 2     │  95.11 ms │                    FAIL │  incomparable │
│ QQuery 3     │ 124.15 ms │                    FAIL │  incomparable │
│ QQuery 4     │  76.53 ms │                    FAIL │  incomparable │
│ QQuery 5     │ 173.60 ms │                    FAIL │  incomparable │
│ QQuery 6     │  68.07 ms │                    FAIL │  incomparable │
│ QQuery 7     │ 214.34 ms │              9433.89 ms │ 44.01x slower │
│ QQuery 8     │ 161.99 ms │                    FAIL │  incomparable │
│ QQuery 9     │ 225.75 ms │                    FAIL │  incomparable │
│ QQuery 10    │ 187.86 ms │                    FAIL │  incomparable │
│ QQuery 11    │  73.04 ms │                    FAIL │  incomparable │
│ QQuery 12    │ 118.45 ms │                    FAIL │  incomparable │
│ QQuery 13    │ 212.77 ms │                    FAIL │  incomparable │
│ QQuery 14    │  91.33 ms │                    FAIL │  incomparable │
│ QQuery 15    │ 120.45 ms │                    FAIL │  incomparable │
│ QQuery 16    │  56.10 ms │                    FAIL │  incomparable │
│ QQuery 17    │ 271.89 ms │                    FAIL │  incomparable │
│ QQuery 18    │ 317.17 ms │              9294.69 ms │ 29.30x slower │
│ QQuery 19    │ 134.01 ms │                    FAIL │  incomparable │
│ QQuery 20    │ 127.50 ms │                    FAIL │  incomparable │
│ QQuery 21    │ 263.41 ms │                    FAIL │  incomparable │
│ QQuery 22    │  43.88 ms │                    FAIL │  incomparable │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │   531.52ms │
│ Total Time (filter-pushdown-dynamic)   │ 18728.57ms │
│ Average Time (HEAD)                    │   265.76ms │
│ Average Time (filter-pushdown-dynamic) │  9364.29ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          2 │
│ Queries with No Change                 │          0 │
│ Queries with Failure                   │         20 │
└────────────────────────────────────────┴────────────┘

adriangb · 2026-01-05T04:43:41Z

This is probably not a good issue to pick up. This is a draft PR for an unproven idea.

adriangb · 2026-01-05T05:08:47Z

run benchmark tpch

alamb-ghbot · 2026-01-05T05:08:53Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing filter-pushdown-dynamic (435e83f) to 955fd41 diff using: tpch
Results will be posted here when complete

alamb-ghbot · 2026-01-05T05:20:40Z

🤖: Benchmark completed

Details

Comparing HEAD and filter-pushdown-dynamic
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ filter-pushdown-dynamic ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 199.58 ms │               195.16 ms │     no change │
│ QQuery 2     │  94.23 ms │               149.31 ms │  1.58x slower │
│ QQuery 3     │ 122.60 ms │               193.79 ms │  1.58x slower │
│ QQuery 4     │  76.28 ms │                80.36 ms │  1.05x slower │
│ QQuery 5     │ 170.05 ms │               446.09 ms │  2.62x slower │
│ QQuery 6     │  66.15 ms │                56.58 ms │ +1.17x faster │
│ QQuery 7     │ 213.98 ms │               389.72 ms │  1.82x slower │
│ QQuery 8     │ 167.91 ms │               522.12 ms │  3.11x slower │
│ QQuery 9     │ 223.74 ms │               997.35 ms │  4.46x slower │
│ QQuery 10    │ 178.95 ms │               473.07 ms │  2.64x slower │
│ QQuery 11    │  74.02 ms │                92.17 ms │  1.25x slower │
│ QQuery 12    │ 113.71 ms │               191.25 ms │  1.68x slower │
│ QQuery 13    │ 214.39 ms │               213.82 ms │     no change │
│ QQuery 14    │  93.74 ms │               328.15 ms │  3.50x slower │
│ QQuery 15    │ 114.82 ms │               123.16 ms │  1.07x slower │
│ QQuery 16    │  56.52 ms │                80.61 ms │  1.43x slower │
│ QQuery 17    │ 270.57 ms │               302.50 ms │  1.12x slower │
│ QQuery 18    │ 312.20 ms │               678.27 ms │  2.17x slower │
│ QQuery 19    │ 132.32 ms │               172.23 ms │  1.30x slower │
│ QQuery 20    │ 123.55 ms │               257.72 ms │  2.09x slower │
│ QQuery 21    │ 267.67 ms │               386.39 ms │  1.44x slower │
│ QQuery 22    │  42.97 ms │                60.79 ms │  1.41x slower │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 3329.95ms │
│ Total Time (filter-pushdown-dynamic)   │ 6390.60ms │
│ Average Time (HEAD)                    │  151.36ms │
│ Average Time (filter-pushdown-dynamic) │  290.48ms │
│ Queries Faster                         │         1 │
│ Queries Slower                         │        19 │
│ Queries with No Change                 │         2 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

adriangb · 2026-01-05T06:07:19Z

run benchmark tpch

alamb-ghbot · 2026-01-05T06:07:25Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing filter-pushdown-dynamic (78a587d) to 955fd41 diff using: tpch
Results will be posted here when complete

alamb-ghbot · 2026-01-05T06:19:16Z

🤖: Benchmark completed

Details

Comparing HEAD and filter-pushdown-dynamic
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ filter-pushdown-dynamic ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 199.59 ms │               198.00 ms │     no change │
│ QQuery 2     │  93.78 ms │               149.46 ms │  1.59x slower │
│ QQuery 3     │ 123.68 ms │               195.58 ms │  1.58x slower │
│ QQuery 4     │  75.93 ms │                80.22 ms │  1.06x slower │
│ QQuery 5     │ 166.95 ms │               444.15 ms │  2.66x slower │
│ QQuery 6     │  68.71 ms │                58.23 ms │ +1.18x faster │
│ QQuery 7     │ 208.39 ms │               401.46 ms │  1.93x slower │
│ QQuery 8     │ 161.03 ms │               542.26 ms │  3.37x slower │
│ QQuery 9     │ 227.68 ms │               962.95 ms │  4.23x slower │
│ QQuery 10    │ 180.52 ms │               469.65 ms │  2.60x slower │
│ QQuery 11    │  74.00 ms │                95.13 ms │  1.29x slower │
│ QQuery 12    │ 114.07 ms │               190.16 ms │  1.67x slower │
│ QQuery 13    │ 215.88 ms │               217.12 ms │     no change │
│ QQuery 14    │  91.12 ms │               323.06 ms │  3.55x slower │
│ QQuery 15    │ 119.72 ms │               118.02 ms │     no change │
│ QQuery 16    │  55.77 ms │                85.25 ms │  1.53x slower │
│ QQuery 17    │ 267.21 ms │               306.83 ms │  1.15x slower │
│ QQuery 18    │ 308.46 ms │               660.48 ms │  2.14x slower │
│ QQuery 19    │ 132.97 ms │               174.91 ms │  1.32x slower │
│ QQuery 20    │ 124.63 ms │               260.62 ms │  2.09x slower │
│ QQuery 21    │ 260.58 ms │               373.49 ms │  1.43x slower │
│ QQuery 22    │  42.65 ms │                58.52 ms │  1.37x slower │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 3313.34ms │
│ Total Time (filter-pushdown-dynamic)   │ 6365.57ms │
│ Average Time (HEAD)                    │  150.61ms │
│ Average Time (filter-pushdown-dynamic) │  289.34ms │
│ Queries Faster                         │         1 │
│ Queries Slower                         │        18 │
│ Queries with No Change                 │         3 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

alamb · 2026-01-05T20:22:19Z

Did you find any evidence that the selectivity of predicates changes over the course of the query (or put another way that reordering them during execution would help?)

adriangb · 2026-01-05T20:24:31Z

Did you find any evidence that the selectivity of predicates changes over the course of the query (or put another way that reordering them during execution would help?)

One case where that for sure happens is dynamic filters 😉 (although we treat each version of it as a different filter, the point is that we need to be dynamic about new filters showing up mid query). But I think it's also not unusual to have unevenly distributed data across files. I do agree that if we change how we apply a filter between files that probably captures 95% of the benefit (as opposed to within a scan of a single file).

But the main point is that we start from "we know nothing" which we're treating as "nothing is selective" and then once we know a filter is selective enough we move it over to be a row filter.

sdf-jkl · 2026-01-05T21:35:09Z

Hey @adriangb, I've been thinking about something like this since the New Year. It's really cool to see you putting together a draft for it.

I haven't had a chance to give a full go at your code, but I wanted to share some research I've done earlier that might be relevant:

Clickhouse release blog about ~~adaptive~~ filter selectivity:
https://clickhouse.com/blog/clickhouse-release-23-11#column-statistics-for-prewhere
How clickhouse measures ~~adaptive~~ filter selectivity:
https://clickhouse.com/docs/optimize/prewhere#how-to-measure-prewhere-impact
First clickhouse column statistics PR:
[RFC] use statistic to order prewhere conditions better ClickHouse/ClickHouse#53240

Before seeing your PR and comments in #3463 I was thinking about using more simple heuristics for sorting predicates.

col type -> size
cardinality of the predicate operator -> = > (>, <) > (>=, <=) > != etc.
how simple/complex the predicate -> how long/ how much CPU it takes to evaluate
col encoding -> if it supports random access, we could filter without decoding (API for filtering / evaluation directly on encoded data arrow-rs#8842)
prioritize using indexes first too

From a quick skim of the clickhouse original PR, they still rely on some simple heuristics when columns statistics aren't available.

I would like to give your PR a proper review once I'm home, but I already love the direction you're taking.

Dandandan · 2026-01-06T20:02:41Z

run benchmark tpch10

alamb-ghbot · 2026-01-06T20:30:13Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing filter-pushdown-dynamic (f6366f0) to 955fd41 diff using: tpch
Results will be posted here when complete

alamb-ghbot · 2026-01-06T20:45:34Z

🤖: Benchmark completed

Details

Comparing HEAD and filter-pushdown-dynamic
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ filter-pushdown-dynamic ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 194.17 ms │               187.43 ms │     no change │
│ QQuery 2     │  92.84 ms │               147.95 ms │  1.59x slower │
│ QQuery 3     │ 127.26 ms │               190.20 ms │  1.49x slower │
│ QQuery 4     │  76.68 ms │                81.55 ms │  1.06x slower │
│ QQuery 5     │ 172.19 ms │               443.66 ms │  2.58x slower │
│ QQuery 6     │  64.76 ms │                55.67 ms │ +1.16x faster │
│ QQuery 7     │ 214.69 ms │               380.93 ms │  1.77x slower │
│ QQuery 8     │ 166.14 ms │               514.80 ms │  3.10x slower │
│ QQuery 9     │ 224.94 ms │               980.79 ms │  4.36x slower │
│ QQuery 10    │ 181.24 ms │               465.81 ms │  2.57x slower │
│ QQuery 11    │  73.08 ms │                88.35 ms │  1.21x slower │
│ QQuery 12    │ 114.38 ms │               189.69 ms │  1.66x slower │
│ QQuery 13    │ 217.80 ms │               212.59 ms │     no change │
│ QQuery 14    │  90.60 ms │               322.25 ms │  3.56x slower │
│ QQuery 15    │ 122.08 ms │               119.42 ms │     no change │
│ QQuery 16    │  56.19 ms │                82.29 ms │  1.46x slower │
│ QQuery 17    │ 267.31 ms │               294.83 ms │  1.10x slower │
│ QQuery 18    │ 316.18 ms │               673.61 ms │  2.13x slower │
│ QQuery 19    │ 130.04 ms │               170.51 ms │  1.31x slower │
│ QQuery 20    │ 126.55 ms │               256.02 ms │  2.02x slower │
│ QQuery 21    │ 258.72 ms │               380.44 ms │  1.47x slower │
│ QQuery 22    │  42.42 ms │                61.95 ms │  1.46x slower │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 3330.25ms │
│ Total Time (filter-pushdown-dynamic)   │ 6300.73ms │
│ Average Time (HEAD)                    │  151.37ms │
│ Average Time (filter-pushdown-dynamic) │  286.40ms │
│ Queries Faster                         │         1 │
│ Queries Slower                         │        18 │
│ Queries with No Change                 │         3 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

alamb-ghbot · 2026-01-06T20:45:36Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing filter-pushdown-dynamic (f6366f0) to 955fd41 diff using: tpcds
Results will be posted here when complete

alamb-ghbot · 2026-01-06T20:45:38Z

Benchmark script failed with exit code 1.

Last 10 lines of output:

Click to expand

BRANCH_NAME: HEAD
DATA_DIR: /home/alamb/arrow-datafusion/benchmarks/data
RESULTS_DIR: /home/alamb/arrow-datafusion/benchmarks/results/HEAD
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************

Please prepare TPC-DS data first by following instructions:
  ./bench.sh data tpcds

adriangb · 2026-01-06T21:14:12Z

run benchmark tpch

adriangb · 2026-01-06T21:14:26Z

run benchmark tpch10

adriangb · 2026-01-06T21:16:33Z

I've pushed a commit which has filter_effectiveness_threshold = 1.0 and pushdown_filters = true. This should have the effect of moving filter evaluation into ParquetOpener but evaluate them like FilterExec does (by reading the columns all together with the projected columns and then applying a mask to the RecordBatch). If this is significantly slower than main I think it demonstrates that the issue comes from different parallelism / batch sizes, not necessarily from the I/O pattern change (at least for TPCH running on SSDs; the story may be different for ClickBench on S3).

adriangb · 2026-01-06T21:47:02Z

show benchmark queue

alamb-ghbot · 2026-01-06T21:47:03Z

🤖 Hi @adriangb, you asked to view the benchmark queue (#19639 (comment)).

Job	User	Benchmarks	Comment
`18868_3716038238.sh`	alamb	sql_planner	`https://github.com/apache/datafusion/pull/18868#issuecomment-3716038238`
`19572_3716078138.sh`	alamb	strpos	`https://github.com/apache/datafusion/pull/19572#issuecomment-3716078138`
`19639_3716120254.sh`	Dandandan	tpch10	`https://github.com/apache/datafusion/pull/19639#issuecomment-3716120254`
`19545_3716252926.sh`	alamb	default	`https://github.com/apache/datafusion/pull/19545#issuecomment-3716252926`
`19639_3716382248.sh`	adriangb	tpch	`https://github.com/apache/datafusion/pull/19639#issuecomment-3716382248`
`19639_3716383026.sh`	adriangb	tpch10	`https://github.com/apache/datafusion/pull/19639#issuecomment-3716383026`

sdf-jkl · 2026-01-06T21:59:30Z

The ClickHouse resources seem to be more in line with parquet row group pruning using statistics, which happens before this process. What we are talking about here is more so how to process the filtering during the scan, which would be after the PREWHERE / row group stats. But their approach of using statistics to plan the order of applying filters is relevant to this work.

@adriangb, upon further reading, PREWHERE is not performing row group pruning; it evaluates predicate expressions right after that at the row filter stage.

However, I don't think it decides which columns to push to scan and which to demote to post-scan, so it's not that relevant here.

Originally, I hadn't read your comments and PR carefully enough, so I was under the impression that you were trying to use row group statistics to improve filter ordering (what PREWHERE in clickhouse is now doing).

I now understand that your approach is more like dynamic programming, where we estimate selectivity at runtime per predicate per file, which is indeed very different from the clickhouse implementation.

alamb-ghbot · 2026-01-06T22:46:05Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing filter-pushdown-dynamic (72b078a) to 955fd41 diff using: tpch10
Results will be posted here when complete

alamb-ghbot · 2026-01-06T23:00:29Z

🤖: Benchmark completed

Details

Comparing HEAD and filter-pushdown-dynamic
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃ HEAD ┃ filter-pushdown-dynamic ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ FAIL │                    FAIL │ incomparable │
│ QQuery 2     │ FAIL │                    FAIL │ incomparable │
│ QQuery 3     │ FAIL │                    FAIL │ incomparable │
│ QQuery 4     │ FAIL │                    FAIL │ incomparable │
│ QQuery 5     │ FAIL │                    FAIL │ incomparable │
│ QQuery 6     │ FAIL │                    FAIL │ incomparable │
│ QQuery 7     │ FAIL │                    FAIL │ incomparable │
│ QQuery 8     │ FAIL │                    FAIL │ incomparable │
│ QQuery 9     │ FAIL │                    FAIL │ incomparable │
│ QQuery 10    │ FAIL │                    FAIL │ incomparable │
│ QQuery 11    │ FAIL │                    FAIL │ incomparable │
│ QQuery 12    │ FAIL │                    FAIL │ incomparable │
│ QQuery 13    │ FAIL │                    FAIL │ incomparable │
│ QQuery 14    │ FAIL │                    FAIL │ incomparable │
│ QQuery 15    │ FAIL │                    FAIL │ incomparable │
│ QQuery 16    │ FAIL │                    FAIL │ incomparable │
│ QQuery 17    │ FAIL │                    FAIL │ incomparable │
│ QQuery 18    │ FAIL │                    FAIL │ incomparable │
│ QQuery 19    │ FAIL │                    FAIL │ incomparable │
│ QQuery 20    │ FAIL │                    FAIL │ incomparable │
│ QQuery 21    │ FAIL │                    FAIL │ incomparable │
│ QQuery 22    │ FAIL │                    FAIL │ incomparable │
└──────────────┴──────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ Benchmark Summary                      ┃        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ Total Time (HEAD)                      │ 0.00ms │
│ Total Time (filter-pushdown-dynamic)   │ 0.00ms │
│ Average Time (HEAD)                    │ 0.00ms │
│ Average Time (filter-pushdown-dynamic) │ 0.00ms │
│ Queries Faster                         │      0 │
│ Queries Slower                         │      0 │
│ Queries with No Change                 │      0 │
│ Queries with Failure                   │     22 │
└────────────────────────────────────────┴────────┘

alamb-ghbot · 2026-01-06T23:42:12Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing filter-pushdown-dynamic (72b078a) to 955fd41 diff using: tpch
Results will be posted here when complete

alamb-ghbot · 2026-01-06T23:57:28Z

🤖: Benchmark completed

Details

Comparing HEAD and filter-pushdown-dynamic
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ filter-pushdown-dynamic ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 194.79 ms │               189.15 ms │     no change │
│ QQuery 2     │  92.49 ms │               147.02 ms │  1.59x slower │
│ QQuery 3     │ 124.28 ms │               191.82 ms │  1.54x slower │
│ QQuery 4     │  76.51 ms │                79.24 ms │     no change │
│ QQuery 5     │ 170.66 ms │               441.98 ms │  2.59x slower │
│ QQuery 6     │  67.36 ms │                51.00 ms │ +1.32x faster │
│ QQuery 7     │ 218.35 ms │               388.71 ms │  1.78x slower │
│ QQuery 8     │ 160.83 ms │               516.00 ms │  3.21x slower │
│ QQuery 9     │ 227.80 ms │               975.01 ms │  4.28x slower │
│ QQuery 10    │ 181.96 ms │               473.78 ms │  2.60x slower │
│ QQuery 11    │  73.41 ms │                87.89 ms │  1.20x slower │
│ QQuery 12    │ 118.45 ms │               188.90 ms │  1.59x slower │
│ QQuery 13    │ 220.26 ms │               222.29 ms │     no change │
│ QQuery 14    │  92.13 ms │               329.20 ms │  3.57x slower │
│ QQuery 15    │ 120.05 ms │               120.18 ms │     no change │
│ QQuery 16    │  56.03 ms │                80.18 ms │  1.43x slower │
│ QQuery 17    │ 270.17 ms │               292.22 ms │  1.08x slower │
│ QQuery 18    │ 316.03 ms │               689.34 ms │  2.18x slower │
│ QQuery 19    │ 131.85 ms │               170.12 ms │  1.29x slower │
│ QQuery 20    │ 124.96 ms │               257.76 ms │  2.06x slower │
│ QQuery 21    │ 264.28 ms │               383.36 ms │  1.45x slower │
│ QQuery 22    │  41.75 ms │                60.47 ms │  1.45x slower │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 3344.40ms │
│ Total Time (filter-pushdown-dynamic)   │ 6335.60ms │
│ Average Time (HEAD)                    │  152.02ms │
│ Average Time (filter-pushdown-dynamic) │  287.98ms │
│ Queries Faster                         │         1 │
│ Queries Slower                         │        17 │
│ Queries with No Change                 │         4 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

alamb-ghbot · 2026-01-06T23:57:31Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing filter-pushdown-dynamic (72b078a) to 955fd41 diff using: tpch10
Results will be posted here when complete

alamb-ghbot · 2026-01-06T23:57:33Z

🤖: Benchmark completed

Details

Comparing HEAD and filter-pushdown-dynamic
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃ HEAD ┃ filter-pushdown-dynamic ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ FAIL │                    FAIL │ incomparable │
│ QQuery 2     │ FAIL │                    FAIL │ incomparable │
│ QQuery 3     │ FAIL │                    FAIL │ incomparable │
│ QQuery 4     │ FAIL │                    FAIL │ incomparable │
│ QQuery 5     │ FAIL │                    FAIL │ incomparable │
│ QQuery 6     │ FAIL │                    FAIL │ incomparable │
│ QQuery 7     │ FAIL │                    FAIL │ incomparable │
│ QQuery 8     │ FAIL │                    FAIL │ incomparable │
│ QQuery 9     │ FAIL │                    FAIL │ incomparable │
│ QQuery 10    │ FAIL │                    FAIL │ incomparable │
│ QQuery 11    │ FAIL │                    FAIL │ incomparable │
│ QQuery 12    │ FAIL │                    FAIL │ incomparable │
│ QQuery 13    │ FAIL │                    FAIL │ incomparable │
│ QQuery 14    │ FAIL │                    FAIL │ incomparable │
│ QQuery 15    │ FAIL │                    FAIL │ incomparable │
│ QQuery 16    │ FAIL │                    FAIL │ incomparable │
│ QQuery 17    │ FAIL │                    FAIL │ incomparable │
│ QQuery 18    │ FAIL │                    FAIL │ incomparable │
│ QQuery 19    │ FAIL │                    FAIL │ incomparable │
│ QQuery 20    │ FAIL │                    FAIL │ incomparable │
│ QQuery 21    │ FAIL │                    FAIL │ incomparable │
│ QQuery 22    │ FAIL │                    FAIL │ incomparable │
└──────────────┴──────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ Benchmark Summary                      ┃        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ Total Time (HEAD)                      │ 0.00ms │
│ Total Time (filter-pushdown-dynamic)   │ 0.00ms │
│ Average Time (HEAD)                    │ 0.00ms │
│ Average Time (filter-pushdown-dynamic) │ 0.00ms │
│ Queries Faster                         │      0 │
│ Queries Slower                         │      0 │
│ Queries with No Change                 │      0 │
│ Queries with Failure                   │     22 │
└────────────────────────────────────────┴────────┘

Dandandan · 2026-01-07T08:44:42Z

I wonder if a good strategy could be:

Keep FilterExec and RepartitionExec and track selectivity there instead of always pushing them down
Only dynamically push down effective filters in parquet (like dynamic hash join)
Only push down filters that are cheap to evaluate / that will save IO (i.e. has multiple columns beside the predicates)
Optimize filter evaluation in parquet level (such as integrate batchcoalescer to avoid small batches / copies)
(Somehow) make filter evaluation in parquet parralelizable

Dandandan · 2026-01-07T20:07:36Z

Another way of testing parallelization is the issue is decreasing DATAFUSION_OPTIMIZER_REPARTITION_FILE_MIN_SIZE to a low value (#19690) - the current value is quite high (10MB), which makes the smaller files for SF=1 not splittable.

Dandandan · 2026-01-08T05:04:27Z

Running some tests in #19694

Dandandan · 2026-01-08T05:27:58Z

Running some tests locally, it seems that the bad performance is mostly related to join pushdown.

Setting DATAFUSION_OPTIMIZER_ENABLE_JOIN_DYNAMIC_FILTER_PUSHDOWN to false recovers most of the performance.

Dandandan · 2026-01-08T05:45:49Z

Ok - that looks better (#19694 (comment)):


🤖: Benchmark completed

Details

Comparing HEAD and filter-pushdown-dynamic
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ filter-pushdown-dynamic ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 198.77 ms │               197.82 ms │     no change │
│ QQuery 2     │  92.39 ms │                91.67 ms │     no change │
│ QQuery 3     │ 124.92 ms │               117.26 ms │ +1.07x faster │
│ QQuery 4     │  75.43 ms │                74.73 ms │     no change │
│ QQuery 5     │ 173.89 ms │               163.51 ms │ +1.06x faster │
│ QQuery 6     │  65.21 ms │                55.80 ms │ +1.17x faster │
│ QQuery 7     │ 205.40 ms │               202.43 ms │     no change │
│ QQuery 8     │ 158.90 ms │               156.69 ms │     no change │
│ QQuery 9     │ 223.72 ms │               209.95 ms │ +1.07x faster │
│ QQuery 10    │ 181.72 ms │               170.33 ms │ +1.07x faster │
│ QQuery 11    │  74.62 ms │                72.01 ms │     no change │
│ QQuery 12    │ 117.74 ms │               101.97 ms │ +1.15x faster │
│ QQuery 13    │ 204.49 ms │               206.26 ms │     no change │
│ QQuery 14    │  94.30 ms │                86.03 ms │ +1.10x faster │
│ QQuery 15    │ 117.63 ms │               117.23 ms │     no change │
│ QQuery 16    │  55.12 ms │                57.90 ms │  1.05x slower │
│ QQuery 17    │ 264.32 ms │               259.44 ms │     no change │
│ QQuery 18    │ 308.88 ms │               302.65 ms │     no change │
│ QQuery 19    │ 133.42 ms │               140.30 ms │  1.05x slower │
│ QQuery 20    │ 121.66 ms │               116.78 ms │     no change │
│ QQuery 21    │ 261.54 ms │               248.35 ms │ +1.05x faster │
│ QQuery 22    │  42.07 ms │                45.47 ms │  1.08x slower │
└──────────────┴───────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 3296.16ms │
│ Total Time (filter-pushdown-dynamic)   │ 3194.59ms │
│ Average Time (HEAD)                    │  149.83ms │
│ Average Time (filter-pushdown-dynamic) │  145.21ms │
│ Queries Faster                         │         8 │
│ Queries Slower                         │         3 │
│ Queries with No Change                 │        11 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

Dandandan · 2026-01-08T06:22:32Z

So I guess the main factor is expressions like this being super expensive to evaluate (query 9):
predicate=DynamicFilter [ l_partkey@1 >= 3 AND l_partkey@1 <= 199962 AND hash_lookup ] AND DynamicFilter [ l_suppkey@2 >= 1 AND l_suppkey@2 <= 10000 AND hash_lookup ] AND DynamicFilter [ CASE hash_repartition % 10 WHEN 0 THEN l_suppkey@2 >= 1 AND l_suppkey@2 <= 10000 AND l_partkey@1 >= 7 AND l_partkey@1 <= 199998 AND hash_lookup WHEN 1 THEN l_suppkey@2 >= 1 AND l_suppkey@2 <= 10000 AND l_partkey@1 >= 2 AND l_partkey@1 <= 199996 AND hash_lookup WHEN 2 THEN l_suppkey@2 >= 1 AND l_suppkey@2 <= 10000 AND l_partkey@1 >= 3 AND l_partkey@1 <= 200000 AND hash_lookup WHEN 3 THEN l_suppkey@2 >= 1 AND l_suppkey@2 <= 10000 AND l_partkey@1 >= 1 AND l_partkey@1 <= 200000 AND hash_lookup WHEN 4 THEN l_suppkey@2 >= 1 AND l_suppkey@2 <= 10000 AND l_partkey@1 >= 3 AND l_partkey@1 <= 199998 AND hash_lookup WHEN 5 THEN l_suppkey@2 >= 1 AND l_suppkey@2 <= 10000 AND l_partkey@1 >= 3 AND l_partkey@1 <= 199998 AND hash_lookup WHEN 6 THEN l_suppkey@2 >= 1 AND l_suppkey@2 <= 10000 AND l_partkey@1 >= 9 AND l_partkey@1 <= 200000 AND hash_lookup WHEN 7 THEN l_suppkey@2 >= 1 AND l_suppkey@2 <= 10000 AND l_partkey@1 >= 1 AND l_partkey@1 <= 200000 AND hash_lookup WHEN 8 THEN l_suppkey@2 >= 1 AND l_suppkey@2 <= 10000 AND l_partkey@1 >= 2 AND l_partkey@1 <= 199998 AND hash_lookup WHEN 9 THEN l_suppkey@2 >= 1 AND l_suppkey@2 <= 10000 AND l_partkey@1 >= 8 AND l_partkey@1 <= 199999 AND hash_lookup ELSE false END ] AND DynamicFilter [ CASE hash_repartition % 10 WHEN 0 THEN l_orderkey@0 >= 5 AND l_orderkey@0 <= 5999847 AND hash_lookup WHEN 1 THEN l_orderkey@0 >= 6 AND l_orderkey@0 <= 5999970 AND hash_lookup WHEN 2 THEN l_orderkey@0 >= 37 AND l_orderkey@0 <= 5999975 AND hash_lookup WHEN 3 THEN l_orderkey@0 >= 1 AND l_orderkey@0 <= 5999971 AND hash_lookup WHEN 4 THEN l_orderkey@0 >= 131 AND l_orderkey@0 <= 5999969 AND hash_lookup WHEN 5 THEN l_orderkey@0 >= 66 AND l_orderkey@0 <= 5999941 AND hash_lookup WHEN 6 THEN l_orderkey@0 >= 34 AND l_orderkey@0 <= 5999974 AND hash_lookup WHEN 7 THEN l_orderkey@0 >= 4 AND l_orderkey@0 <= 5999940 AND hash_lookup WHEN 8 THEN l_orderkey@0 >= 3 AND l_orderkey@0 <= 5999879 AND hash_lookup WHEN 9 THEN l_orderkey@0 >= 71 AND l_orderkey@0 <= 6000000 AND hash_lookup ELSE false END ], pruning_predicate=l_partkey_null_count@1 != row_count@2 AND l_partkey_max@0 >= 3 AND l_partkey_null_count@1 != row_count@2 AND l_partkey_min@3 <= 199962 AND l_suppkey_null_count@5 != row_count@2 AND l_suppkey_max@4 >= 1 AND l_suppkey_null_count@5 != row_count@2 AND l_suppkey_min@6 <= 10000, required_guarantees=[], metrics=[output_rows=319.4 K, elapsed_compute=10ns, output_bytes=21.9 MB, output_batches=733, files_ranges_pruned_statistics=10 total → 10 matched, row_groups_pruned_statistics=6 total → 6 matched, row_groups_pruned_bloom_filter=6 total → 6 matched, page_index_rows_pruned=6.00 M total → 6.00 M matched, batches_split=0, bytes_scanned=66.32 M, file_open_errors=0, file_scan_errors=0, num_predicate_creation_errors=0, predicate_cache_inner_records=0, predicate_cache_records=0, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, bloom_filter_eval_time=451.85µs, filter_apply_time=1.13s, metadata_load_time=246.88ms, page_index_eval_time=602.59µs, row_pushdown_eval_time=20ns, statistics_eval_time=737.84µs, time_elapsed_opening=422.26ms, time_elapsed_processing=21.54s, time_elapsed_scanning_total=21.33s

Dandandan · 2026-01-08T07:25:47Z

#19694 (comment)
A run with join filter pushdown disabled and DATAFUSION_OPTIMIZER_REPARTITION_FILE_MIN_SIZE = 128 * 1024 shows almost no regression for tpch:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ filter-pushdown-dynamic ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 198.34 ms │               186.73 ms │ +1.06x faster │
│ QQuery 2     │  99.30 ms │                92.08 ms │ +1.08x faster │
│ QQuery 3     │ 127.47 ms │               119.80 ms │ +1.06x faster │
│ QQuery 4     │  77.87 ms │                81.47 ms │     no change │
│ QQuery 5     │ 170.97 ms │               168.22 ms │     no change │
│ QQuery 6     │  68.24 ms │                57.94 ms │ +1.18x faster │
│ QQuery 7     │ 217.17 ms │               207.89 ms │     no change │
│ QQuery 8     │ 160.24 ms │               160.56 ms │     no change │
│ QQuery 9     │ 219.80 ms │               224.93 ms │     no change │
│ QQuery 10    │ 186.68 ms │               181.10 ms │     no change │
│ QQuery 11    │  74.33 ms │                71.38 ms │     no change │
│ QQuery 12    │ 118.01 ms │               104.69 ms │ +1.13x faster │
│ QQuery 13    │ 218.63 ms │               209.75 ms │     no change │
│ QQuery 14    │  94.12 ms │                87.85 ms │ +1.07x faster │
│ QQuery 15    │ 120.58 ms │               115.81 ms │     no change │
│ QQuery 16    │  56.60 ms │                58.46 ms │     no change │
│ QQuery 17    │ 269.83 ms │               276.53 ms │     no change │
│ QQuery 18    │ 317.33 ms │               307.73 ms │     no change │
│ QQuery 19    │ 134.61 ms │               137.69 ms │     no change │
│ QQuery 20    │ 126.79 ms │               121.13 ms │     no change │
│ QQuery 21    │ 264.65 ms │               255.95 ms │     no change │
│ QQuery 22    │  42.75 ms │                48.89 ms │  1.14x slower │
└──────────────┴───────────┴─────────────────────────┴───────────────┘

Dandandan · 2026-01-08T07:34:04Z

So I would suggest to take the following steps

Disable predicate pushdown for join filters (keep pruning only)
Run some further tests on other ideas, benchmark them in isolation:

Adaptive filter selectivity
Parallelization of scanning small files
Coalesce small batches
Improve join pushdown performance / make it more adaptive
Do not pushdown predicates when IO benefit is small (no large columns besides predicate cols)
Do some profiling for regressions

adriangb · 2026-01-08T16:33:17Z

So I guess the main factor is expressions like this being super expensive to evaluate (query 9):

I wonder if it's the expression being expensive to evaluate or if evaluating it where it is currently causes the issue. That is, if this was evaluated in a FilterExec right before the HashJoin -> RepartitionExec (and thus lifted work out of the hash join) would it perform better? We should also try with SET datafusion.optimizer.hash_join_inlist_pushdown_max_size = 0.

A run with join filter pushdown disabled and DATAFUSION_OPTIMIZER_REPARTITION_FILE_MIN_SIZE = 128 * 1024 shows almost no regression for tpch

I guess we need to test both of those to understand how each one impacts results...

Dandandan · 2026-01-08T18:40:00Z

So I guess the main factor is expressions like this being super expensive to evaluate (query 9):

I wonder if it's the expression being expensive to evaluate or if evaluating it where it is currently causes the issue. That is, if this was evaluated in a FilterExec right before the HashJoin -> RepartitionExec (and thus lifted work out of the hash join) would it perform better? We should also try with SET datafusion.optimizer.hash_join_inlist_pushdown_max_size = 0.

A run with join filter pushdown disabled and DATAFUSION_OPTIMIZER_REPARTITION_FILE_MIN_SIZE = 128 * 1024 shows almost no regression for tpch

I guess we need to test both of those to understand how each one impacts results...

I tested both, the join filter pushdown has the dramatic (bad) impact on tpch performance, making it overall slightly better than without predicate pushdown , the DATAFUSION_OPTIMIZER_REPARTITION_FILE_MIN_SIZE only has minimal impact.

Dandandan · 2026-01-08T19:02:24Z

I wonder if it's the expression being expensive to evaluate or if evaluating it where it is currently causes the issue. That is, if this was evaluated in a FilterExec right before the HashJoin -> RepartitionExec (and thus lifted work out of the hash join) would it perform better? We should also try with SET datafusion.optimizer.hash_join_inlist_pushdown_max_size = 0.

I am not sure that would help, I guess join pushdown mostly helps if a lot of IO can be avoided and the expression is not super expensive to evaluation, which I think won't be better by moving it inside FilterExec. I guess we need some more heuristics / adaptiveness here as well to only apply it on beneficial cases (and perhaps further reduce the cost of evaluating the expressions).

github-actions bot added sqllogictest SQL Logic Tests (.slt) common Related to common crate datasource Changes to the datasource crate proto Related to proto crate labels Jan 4, 2026

github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate labels Jan 4, 2026

adriangb mentioned this pull request Jan 6, 2026

fix: DynamicFilterPhysicalExpr violates Hash/Eq contract #19659

Merged

feat: adaptive filter selectivity tracking for Parquet row filters

c0b86d7

adriangb force-pushed the filter-pushdown-dynamic branch from 72b078a to c0b86d7 Compare January 8, 2026 18:27

feat: adaptive filter selectivity tracking for Parquet row filters #19639

Are you sure you want to change the base?

feat: adaptive filter selectivity tracking for Parquet row filters #19639

Conversation

adriangb commented Jan 4, 2026

Summary

Test plan

Uh oh!

adriangb commented Jan 4, 2026

Uh oh!

alamb-ghbot commented Jan 4, 2026

Uh oh!

adriangb commented Jan 4, 2026

Uh oh!

alamb-ghbot commented Jan 4, 2026

Uh oh!

alamb-ghbot commented Jan 4, 2026

Uh oh!

adriangb commented Jan 4, 2026

Uh oh!

alamb-ghbot commented Jan 4, 2026

Uh oh!

adriangb commented Jan 4, 2026

Uh oh!

alamb-ghbot commented Jan 4, 2026

Uh oh!

alamb-ghbot commented Jan 4, 2026

Uh oh!

adriangb commented Jan 4, 2026

Uh oh!

alamb-ghbot commented Jan 4, 2026

Uh oh!

alamb-ghbot commented Jan 5, 2026

Uh oh!

adriangb commented Jan 5, 2026

Uh oh!

adriangb commented Jan 5, 2026

Uh oh!

alamb-ghbot commented Jan 5, 2026

Uh oh!

alamb-ghbot commented Jan 5, 2026

Uh oh!

adriangb commented Jan 5, 2026

Uh oh!

alamb-ghbot commented Jan 5, 2026

Uh oh!

alamb-ghbot commented Jan 5, 2026

Uh oh!

alamb commented Jan 5, 2026

Uh oh!

adriangb commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sdf-jkl commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Jan 6, 2026

Uh oh!

alamb-ghbot commented Jan 6, 2026

Uh oh!

alamb-ghbot commented Jan 6, 2026

Uh oh!

alamb-ghbot commented Jan 6, 2026

Uh oh!

alamb-ghbot commented Jan 6, 2026

Uh oh!

adriangb commented Jan 6, 2026

Uh oh!

adriangb commented Jan 6, 2026

Uh oh!

adriangb commented Jan 6, 2026

Uh oh!

adriangb commented Jan 6, 2026

Uh oh!

alamb-ghbot commented Jan 6, 2026

Uh oh!

sdf-jkl commented Jan 6, 2026

Uh oh!

alamb-ghbot commented Jan 6, 2026

Uh oh!

adriangb commented Jan 5, 2026 •

edited

Loading

sdf-jkl commented Jan 5, 2026 •

edited

Loading

Dandandan commented Jan 7, 2026 •

edited

Loading

Dandandan commented Jan 7, 2026 •

edited

Loading

Dandandan commented Jan 8, 2026 •

edited

Loading

Dandandan commented Jan 8, 2026 •

edited

Loading

Dandandan commented Jan 8, 2026 •

edited

Loading