Add microbenchmark for spilling with compression #16512

ding-young · 2025-06-23T11:36:22Z

Which issue does this PR close?

Related to Investigate performance tradeoff in compressing spill files #16367

What changes are included in this PR?

This pr adds some microbenchmarks to compare performance characteristics between different compression codecs. It generates 50 RecordBatch for each case, and run both write & read. It manually prints out compression ratio (mem_bytes / disk_bytes) after each run.

To make benchmark more realistic, this pr generates RecordBatches that resemble the data it spills on AggregateExec (tpc-h) and SortExec (sort-tpch). It covers both thin batch consists of primitive arrays and wide batches with complex data type.

Rationale for this change

Benchmark Case Overview

Below are schema of RecordBatch & original query for each benchmark case.

Q2 [Int64(partkey), Decimal128(min(ps_supplycost))]

select
...
    p_partkey,
...
where
        p_partkey = ps_partkey
...
  and ps_supplycost = (
    select
        min(ps_supplycost)

Q16 [Utf8(p_brand), Utf8(p_type), Int32(p_size), Int64(supplier_cnt)]

select
    p_brand,
    p_type,
    p_size,
    count(distinct ps_suppkey) as supplier_cnt
...
group by
    p_brand, p_type, p_size
...

Q20 [Int64(suppkey), Int64(partkey), Decimal128(sum(l_quantity))]

... select
            ps_suppkey
        from
            partsupp
        where
                ps_partkey in (
                select
                    p_partkey
                    ...
            )
          and ps_availqty > (
            select
                    0.5 * sum(l_quantity)

Sort-tpch Q10 wide [Int32, Int64 * 3, Decimal128 * 4, Date * 3, Utf8 * 4]

SELECT l_orderkey, l_suppkey, l_linenumber, l_comment,
         l_partkey, l_quantity, l_extendedprice, l_discount, l_tax,
         l_returnflag, l_linestatus, l_shipdate, l_commitdate,
      l_receiptdate, l_shipinstruct, l_shipmode
FROM lineitem
ORDER BY l_orderkey, l_suppkey, l_linenumber, l_comment

Are these changes tested?

Are there any user-facing changes?

ding-young · 2025-06-23T11:43:22Z

To run bench, cargo bench --bench spill_io

Results

Case	Compression	Time (ms)	Memory (MB)	Disk (MB)	Compression Ratio
Q2	Uncompressed	51.521	9.4	9.5	0.990
Q2	Zstd	147.360	9.4	1.5461	6.215
Q2	Lz4Frame	97.942	9.4	3.2	2.922
Q16	Uncompressed	78.053	23.5	19.5	1.209
Q16	Zstd	236.480	23.5	4.4	5.373
Q16	Lz4Frame	145.180	23.5	7.8	3.007
Q20	Uncompressed	64.233	12.5	12.7	0.989
Q20	Zstd	190.570	12.5	2.4	5.282
Q20	Lz4Frame	123.430	12.5	4.8	2.629
Wide (Q10)	Uncompressed	215.220	56.4	54.2	1.041
Wide (Q10)	Zstd	443.190	56.4	11.6	4.857
Wide (Q10)	Lz4Frame	255.530	56.4	19.9	2.834

ding-young · 2025-06-24T12:05:56Z

As expected, although lz4_frame has a lower compression ratio than zstd, it runs faster, making it a reasonable tradeoff. However, since it's roughly 2x slower than the uncompressed case, we may need to explore optimization strategies further.

If there are other interesting directions for benchmarking such as measuring write and read times separately, or isolating compression and decompression overhead, or testing on different data types, I’d love to hear suggestions.

2010YOUY01

Thank you for the micro-benches 💯 It's great to know these specific numbers.

Suggestion for this PR

To make the bench result easier to interpret, I suggest to print the bandwidth (calculated as raw-batch-size/time *2):

This way we can compare it with the typical compression bandwidth, then figure out if there are any room to optimize the speed in our implementation.

Case	Compression	Time (ms)	Memory (MB)	Disk (MB)	Compression Ratio	Bandwidth (MB/s)
Q2	Uncompressed	51.521	9.4	9.5	0.990	~400MB/s

For potential future optimizations:

(I set the number of batches to 1000 manually, to make the dataset larger than cache)
On my machine, the bandwidth is roughly:
plain: 2GB/s
lz4: 600MB/s
zstd: 270MB/s

The takeaway for this result is: since plain encoding is a very simple operation that don't require much computation, its bandwidth should be close to disk bandwidth (~5GB/s on my machine), the result shows that there might be some room to make the codec faster?

I'm not sure the bandwidth for compressions are ideal, I'll try to find out later.

2010YOUY01 · 2025-06-26T04:05:21Z

datafusion/physical-plan/benches/spill_io.rs

+            .with_precision_and_scale(15, 2)
+            .unwrap();
+
+        for _ in 0..8192 {


I think we should make this 8192 a global variable or function arg

2010YOUY01 · 2025-06-26T04:06:56Z

datafusion/physical-plan/benches/spill_io.rs

+// using realistic input data inspired by TPC-H aggregate spill scenarios.
+//
+// This function prepares synthetic RecordBatches that mimic the schema and distribution
+// of intermediate aggregate results from representative TPC-H queries (Q2, Q16, Q20) and sort-tpch Q10


Maybe we can also copy the schemas of each workload in implementation comments to here.

2010YOUY01 · 2025-06-26T04:07:38Z

datafusion/physical-plan/benches/spill_io.rs

+    let num_batches = 50;
+
+    // Q2 [Int64, Decimal128]
+    let (schema, batches) = create_q2_like_batches(50);


Suggested change

let (schema, batches) = create_q2_like_batches(50);

let (schema, batches) = create_q2_like_batches(num_batches);

and same for the below two functions

Add microbenchmark for spilling with compression

99d237a

github-actions bot added the physical-plan Changes to the physical-plan crate label Jun 23, 2025

add wide batch

68f6b7a

ding-young marked this pull request as ready for review June 24, 2025 12:12

2010YOUY01 reviewed Jun 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add microbenchmark for spilling with compression #16512

Add microbenchmark for spilling with compression #16512

Uh oh!

ding-young commented Jun 23, 2025 •

edited

Loading

Uh oh!

ding-young commented Jun 23, 2025 •

edited

Loading

Uh oh!

ding-young commented Jun 24, 2025

Uh oh!

2010YOUY01 left a comment

Uh oh!

2010YOUY01 Jun 26, 2025

Uh oh!

2010YOUY01 Jun 26, 2025

Uh oh!

2010YOUY01 Jun 26, 2025

Uh oh!

Uh oh!

	let (schema, batches) = create_q2_like_batches(50);
	let (schema, batches) = create_q2_like_batches(num_batches);

Add microbenchmark for spilling with compression #16512

Are you sure you want to change the base?

Add microbenchmark for spilling with compression #16512

Uh oh!

Conversation

ding-young commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Rationale for this change

Benchmark Case Overview

Are these changes tested?

Are there any user-facing changes?

Uh oh!

ding-young commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

Uh oh!

ding-young commented Jun 24, 2025

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Suggestion for this PR

For potential future optimizations:

Uh oh!

2010YOUY01 Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ding-young commented Jun 23, 2025 •

edited

Loading

ding-young commented Jun 23, 2025 •

edited

Loading