Skip to content

Add microbenchmark for spilling with compression #16512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ding-young
Copy link
Contributor

@ding-young ding-young commented Jun 23, 2025

Which issue does this PR close?

What changes are included in this PR?

This pr adds some microbenchmarks to compare performance characteristics between different compression codecs. It generates 50 RecordBatch for each case, and run both write & read. It manually prints out compression ratio (mem_bytes / disk_bytes) after each run.

To make benchmark more realistic, this pr generates RecordBatches that resemble the data it spills on AggregateExec (tpc-h) and SortExec (sort-tpch). It covers both thin batch consists of primitive arrays and wide batches with complex data type.

Rationale for this change

Benchmark Case Overview

Below are schema of RecordBatch & original query for each benchmark case.

  • Q2 [Int64(partkey), Decimal128(min(ps_supplycost))]
select
...
    p_partkey,
...
where
        p_partkey = ps_partkey
...
  and ps_supplycost = (
    select
        min(ps_supplycost)
  • Q16 [Utf8(p_brand), Utf8(p_type), Int32(p_size), Int64(supplier_cnt)]
select
    p_brand,
    p_type,
    p_size,
    count(distinct ps_suppkey) as supplier_cnt
...
group by
    p_brand, p_type, p_size
... 
  • Q20 [Int64(suppkey), Int64(partkey), Decimal128(sum(l_quantity))]
... select
            ps_suppkey
        from
            partsupp
        where
                ps_partkey in (
                select
                    p_partkey
                    ...
            )
          and ps_availqty > (
            select
                    0.5 * sum(l_quantity)
  • Sort-tpch Q10 wide [Int32, Int64 * 3, Decimal128 * 4, Date * 3, Utf8 * 4]
SELECT l_orderkey, l_suppkey, l_linenumber, l_comment,
         l_partkey, l_quantity, l_extendedprice, l_discount, l_tax,
         l_returnflag, l_linestatus, l_shipdate, l_commitdate,
      l_receiptdate, l_shipinstruct, l_shipmode
FROM lineitem
ORDER BY l_orderkey, l_suppkey, l_linenumber, l_comment

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Jun 23, 2025
@ding-young
Copy link
Contributor Author

ding-young commented Jun 23, 2025

To run bench, cargo bench --bench spill_io

Results

Case Compression Time (ms) Memory (MB) Disk (MB) Compression Ratio
Q2 Uncompressed 51.521 9.4 9.5 0.990
Q2 Zstd 147.360 9.4 1.5461 6.215
Q2 Lz4Frame 97.942 9.4 3.2 2.922
Q16 Uncompressed 78.053 23.5 19.5 1.209
Q16 Zstd 236.480 23.5 4.4 5.373
Q16 Lz4Frame 145.180 23.5 7.8 3.007
Q20 Uncompressed 64.233 12.5 12.7 0.989
Q20 Zstd 190.570 12.5 2.4 5.282
Q20 Lz4Frame 123.430 12.5 4.8 2.629
Wide (Q10) Uncompressed 215.220 56.4 54.2 1.041
Wide (Q10) Zstd 443.190 56.4 11.6 4.857
Wide (Q10) Lz4Frame 255.530 56.4 19.9 2.834

@ding-young
Copy link
Contributor Author

As expected, although lz4_frame has a lower compression ratio than zstd, it runs faster, making it a reasonable tradeoff. However, since it's roughly 2x slower than the uncompressed case, we may need to explore optimization strategies further.

If there are other interesting directions for benchmarking such as measuring write and read times separately, or isolating compression and decompression overhead, or testing on different data types, I’d love to hear suggestions.

@ding-young ding-young marked this pull request as ready for review June 24, 2025 12:12
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the micro-benches 💯 It's great to know these specific numbers.

Suggestion for this PR

To make the bench result easier to interpret, I suggest to print the bandwidth (calculated as raw-batch-size/time *2):

This way we can compare it with the typical compression bandwidth, then figure out if there are any room to optimize the speed in our implementation.

Case Compression Time (ms) Memory (MB) Disk (MB) Compression Ratio Bandwidth (MB/s)
Q2 Uncompressed 51.521 9.4 9.5 0.990 ~400MB/s

For potential future optimizations:

(I set the number of batches to 1000 manually, to make the dataset larger than cache)
On my machine, the bandwidth is roughly:
plain: 2GB/s
lz4: 600MB/s
zstd: 270MB/s

The takeaway for this result is: since plain encoding is a very simple operation that don't require much computation, its bandwidth should be close to disk bandwidth (~5GB/s on my machine), the result shows that there might be some room to make the codec faster?

I'm not sure the bandwidth for compressions are ideal, I'll try to find out later.

.with_precision_and_scale(15, 2)
.unwrap();

for _ in 0..8192 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make this 8192 a global variable or function arg

// using realistic input data inspired by TPC-H aggregate spill scenarios.
//
// This function prepares synthetic RecordBatches that mimic the schema and distribution
// of intermediate aggregate results from representative TPC-H queries (Q2, Q16, Q20) and sort-tpch Q10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can also copy the schemas of each workload in implementation comments to here.

let num_batches = 50;

// Q2 [Int64, Decimal128]
let (schema, batches) = create_q2_like_batches(50);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let (schema, batches) = create_q2_like_batches(50);
let (schema, batches) = create_q2_like_batches(num_batches);

and same for the below two functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-plan Changes to the physical-plan crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants