Skip to content

[SPARK-56908][SQL] Parquet vectorized reader performance improvements (umbrella) #56011

@iemejia

Description

@iemejia

Overview

This is an umbrella issue tracking a series of performance improvements to the Parquet vectorized reader in Spark SQL. The changes target allocation reduction, bulk-read optimizations, and JIT-friendly code patterns across multiple encoding paths.

All PRs are independent and can be reviewed/merged in any order. Together they yield significant throughput gains (1.2x to 7x depending on the encoding and data shape) for Parquet reads with no user-facing behavioral changes.

Note: I know SPARK tickets are managed on JIRA, this is only to have a centralized point to refer the different encoding performance improvements to other parties and avoid creating a parent ticket in JIRA.

Summary

# PR Focus Key Speedup
1 #55919 DELTA_BINARY_PACKED bulk reads up to 7.2x
2 #55920 Dictionary decode hasNull fast path 1.24x
3 #55921 BYTE_STREAM_SPLIT vectorized reader 2.8-4.5x
4 #55922 RLE PACKED batch ByteBuffer slice 2.1-2.4x
5 #55923 Timestamp/date updater bulk reads up to 2.9x
6 #55924 DELTA_BYTE_ARRAY allocation reduction 1.1-1.9x
7 #55932 DELTA_LENGTH_BYTE_ARRAY allocation reduction 1.2-1.4x

Pull Requests

1. DELTA_BINARY_PACKED bulk read optimization

PR: #55919 (SPARK-56892)

Replaces per-element lambda dispatch in readIntegers/readLongs with bulk paths that compute prefix sums in-place and write via putInts/putLongs. Also eliminates 3 allocations per value in readUnsignedLongs by replacing BigInteger(Long.toUnsignedString(v)) with a reusable ByteBuffer.

Type Speedup
INT32 (monotonic) 1.4x
INT64 (monotonic) 3.8x
readUnsignedLongs 7.2x

2. Dictionary decoding hasNull fast path + per-class updater overrides

PR: #55920 (SPARK-56893)

Adds a hasNull() fast path that skips per-element null checks when the column has no nulls (common case). Per-class decodeDictionaryIds overrides give C2 monomorphic call sites, enabling full inlining of type-specific decode expressions.

Scenario Speedup
No nulls (avg across 6 updaters) 1.24x

3. Vectorized BYTE_STREAM_SPLIT reader

PR: #55921 (SPARK-56894)

Adds a new VectorizedByteStreamSplitValuesReader that decodes BSS-encoded pages (FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY) using batch byte-gathering instead of falling back to parquet-mr per-value reads.

Type Speedup vs parquet-mr
INT32 4.5x
INT64 2.8x
FLOAT 4.2x
DOUBLE 2.8x

4. Batch ByteBuffer slice in RLE PACKED decode

PR: #55922 (SPARK-56895)

Replaces per-group in.slice(bitWidth) (one ByteBuffer allocation per 8 values) with a single bulk slice for the entire PACKED run. Eliminates ~128K short-lived ByteBuffer allocations per 1M-value page.

bitWidth Speedup (readIntegers)
4 2.1x
8 2.4x
12 1.6x
20 1.4x

5. Bulk read paths for timestamp/date Parquet vector updaters

PR: #55923 (SPARK-56896)

Replaces per-element readValue loops with two-pass bulk read + in-place conversion for five updaters (LongAsMicrosUpdater, LongAsNanosUpdater, LongAsMicrosRebaseUpdater, DateToTimestampNTZUpdater, DateToTimestampNTZWithRebaseUpdater). Avoids per-element virtual dispatch through VectorizedValuesReader.

Updater Speedup
LongAsMicrosUpdater 2.9x
DateToTimestampNTZUpdater 1.2x

6. Reduce per-value allocations in DELTA_BYTE_ARRAY decoder

PR: #55924 (SPARK-56897)

Replaces ByteBuffer-based state tracking with a reusable byte[] buffer, eliminating 2 ByteBuffer allocations per decoded value (~8K objects per 4096-value page). Also rewrites skipBinary to avoid column vector reset/swap overhead.

Operation Speedup
readBinary 1.1-1.3x
skipBinary 1.5-1.9x

7. Reduce per-value allocation in DELTA_LENGTH_BYTE_ARRAY decoder

PR: #55932 (SPARK-56907)

Replaces per-value in.slice(length) with a single bulk slice for the entire batch. Replaces per-value skip loop with a single bulk skip.

Operation Speedup
readBinary (small payloads) 1.2x
skipBinary 1.4x

Common Themes

  • Allocation reduction: Replace per-value ByteBuffer.slice() / ByteBuffer.wrap() with bulk reads into reusable buffers
  • Bulk vectorized reads: Replace per-element virtual dispatch with single batch calls backed by System.arraycopy
  • JIT-friendly patterns: Per-class method overrides for monomorphic call sites; avoiding megamorphic profile pollution from shared helpers

Benchmarking

All benchmarks were run on AMD EPYC 9V45 with OpenJDK 17/25, comparing upstream master against the patched version on the same machine with identical JVM flags.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions