[SPARK-56908][SQL] Parquet vectorized reader performance improvements (umbrella)

## Overview

This is an umbrella issue tracking a series of performance improvements to the Parquet vectorized reader in Spark SQL. The changes target allocation reduction, bulk-read optimizations, and JIT-friendly code patterns across multiple encoding paths.

All PRs are independent and can be reviewed/merged in any order. Together they yield significant throughput gains (1.2x to 7x depending on the encoding and data shape) for Parquet reads with no user-facing behavioral changes.

*Note:* I know SPARK tickets are managed on JIRA, this is only to have a centralized point to refer the different encoding performance improvements to other parties and avoid creating a parent ticket in JIRA.

## Summary

| # | PR | Focus | Key Speedup |
|---|---|---|---|
| 1 | #55919 | DELTA_BINARY_PACKED bulk reads | up to 7.2x |
| 2 | #55920 | Dictionary decode hasNull fast path | 1.24x |
| 3 | #55921 | BYTE_STREAM_SPLIT vectorized reader | 2.8-4.5x |
| 4 | #55922 | RLE PACKED batch ByteBuffer slice | 2.1-2.4x |
| 5 | #55923 | Timestamp/date updater bulk reads | up to 2.9x |
| 6 | #55924 | DELTA_BYTE_ARRAY allocation reduction | 1.1-1.9x |
| 7 | #55932 | DELTA_LENGTH_BYTE_ARRAY allocation reduction | 1.2-1.4x |

## Pull Requests

### 1. DELTA_BINARY_PACKED bulk read optimization
**PR:** #55919 ([SPARK-56892](https://issues.apache.org/jira/browse/SPARK-56892))

Replaces per-element lambda dispatch in `readIntegers`/`readLongs` with bulk paths that compute prefix sums in-place and write via `putInts`/`putLongs`. Also eliminates 3 allocations per value in `readUnsignedLongs` by replacing `BigInteger(Long.toUnsignedString(v))` with a reusable `ByteBuffer`.

| Type | Speedup |
|------|---------|
| INT32 (monotonic) | 1.4x |
| INT64 (monotonic) | 3.8x |
| readUnsignedLongs | 7.2x |

---

### 2. Dictionary decoding hasNull fast path + per-class updater overrides
**PR:** #55920 ([SPARK-56893](https://issues.apache.org/jira/browse/SPARK-56893))

Adds a `hasNull()` fast path that skips per-element null checks when the column has no nulls (common case). Per-class `decodeDictionaryIds` overrides give C2 monomorphic call sites, enabling full inlining of type-specific decode expressions.

| Scenario | Speedup |
|----------|---------|
| No nulls (avg across 6 updaters) | 1.24x |

---

### 3. Vectorized BYTE_STREAM_SPLIT reader
**PR:** #55921 ([SPARK-56894](https://issues.apache.org/jira/browse/SPARK-56894))

Adds a new `VectorizedByteStreamSplitValuesReader` that decodes BSS-encoded pages (FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY) using batch byte-gathering instead of falling back to parquet-mr per-value reads.

| Type | Speedup vs parquet-mr |
|------|-----------------------|
| INT32 | 4.5x |
| INT64 | 2.8x |
| FLOAT | 4.2x |
| DOUBLE | 2.8x |

---

### 4. Batch ByteBuffer slice in RLE PACKED decode
**PR:** #55922 ([SPARK-56895](https://issues.apache.org/jira/browse/SPARK-56895))

Replaces per-group `in.slice(bitWidth)` (one `ByteBuffer` allocation per 8 values) with a single bulk slice for the entire PACKED run. Eliminates ~128K short-lived ByteBuffer allocations per 1M-value page.

| bitWidth | Speedup (readIntegers) |
|----------|------------------------|
| 4 | 2.1x |
| 8 | 2.4x |
| 12 | 1.6x |
| 20 | 1.4x |

---

### 5. Bulk read paths for timestamp/date Parquet vector updaters
**PR:** #55923 ([SPARK-56896](https://issues.apache.org/jira/browse/SPARK-56896))

Replaces per-element `readValue` loops with two-pass bulk read + in-place conversion for five updaters (`LongAsMicrosUpdater`, `LongAsNanosUpdater`, `LongAsMicrosRebaseUpdater`, `DateToTimestampNTZUpdater`, `DateToTimestampNTZWithRebaseUpdater`). Avoids per-element virtual dispatch through `VectorizedValuesReader`.

| Updater | Speedup |
|---------|---------|
| LongAsMicrosUpdater | 2.9x |
| DateToTimestampNTZUpdater | 1.2x |

---

### 6. Reduce per-value allocations in DELTA_BYTE_ARRAY decoder
**PR:** #55924 ([SPARK-56897](https://issues.apache.org/jira/browse/SPARK-56897))

Replaces `ByteBuffer`-based state tracking with a reusable `byte[]` buffer, eliminating 2 ByteBuffer allocations per decoded value (~8K objects per 4096-value page). Also rewrites `skipBinary` to avoid column vector reset/swap overhead.

| Operation | Speedup |
|-----------|---------|
| readBinary | 1.1-1.3x |
| skipBinary | 1.5-1.9x |

---

### 7. Reduce per-value allocation in DELTA_LENGTH_BYTE_ARRAY decoder
**PR:** #55932 ([SPARK-56907](https://issues.apache.org/jira/browse/SPARK-56907))

Replaces per-value `in.slice(length)` with a single bulk slice for the entire batch. Replaces per-value skip loop with a single bulk skip.

| Operation | Speedup |
|-----------|---------|
| readBinary (small payloads) | 1.2x |
| skipBinary | 1.4x |

---

## Common Themes

- **Allocation reduction**: Replace per-value `ByteBuffer.slice()` / `ByteBuffer.wrap()` with bulk reads into reusable buffers
- **Bulk vectorized reads**: Replace per-element virtual dispatch with single batch calls backed by `System.arraycopy`
- **JIT-friendly patterns**: Per-class method overrides for monomorphic call sites; avoiding megamorphic profile pollution from shared helpers

## Benchmarking

All benchmarks were run on AMD EPYC 9V45 with OpenJDK 17/25, comparing upstream `master` against the patched version on the same machine with identical JVM flags.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56908][SQL] Parquet vectorized reader performance improvements (umbrella) #56011

Overview

Summary

Pull Requests

1. DELTA_BINARY_PACKED bulk read optimization

2. Dictionary decoding hasNull fast path + per-class updater overrides

3. Vectorized BYTE_STREAM_SPLIT reader

4. Batch ByteBuffer slice in RLE PACKED decode

5. Bulk read paths for timestamp/date Parquet vector updaters

6. Reduce per-value allocations in DELTA_BYTE_ARRAY decoder

7. Reduce per-value allocation in DELTA_LENGTH_BYTE_ARRAY decoder

Common Themes

Benchmarking

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

#	PR	Focus	Key Speedup
1	#55919	DELTA_BINARY_PACKED bulk reads	up to 7.2x
2	#55920	Dictionary decode hasNull fast path	1.24x
3	#55921	BYTE_STREAM_SPLIT vectorized reader	2.8-4.5x
4	#55922	RLE PACKED batch ByteBuffer slice	2.1-2.4x
5	#55923	Timestamp/date updater bulk reads	up to 2.9x
6	#55924	DELTA_BYTE_ARRAY allocation reduction	1.1-1.9x
7	#55932	DELTA_LENGTH_BYTE_ARRAY allocation reduction	1.2-1.4x

Type	Speedup
INT32 (monotonic)	1.4x
INT64 (monotonic)	3.8x
readUnsignedLongs	7.2x

[SPARK-56908][SQL] Parquet vectorized reader performance improvements (umbrella) #56011

Description

Overview

Summary

Pull Requests

1. DELTA_BINARY_PACKED bulk read optimization

2. Dictionary decoding hasNull fast path + per-class updater overrides

3. Vectorized BYTE_STREAM_SPLIT reader

4. Batch ByteBuffer slice in RLE PACKED decode

5. Bulk read paths for timestamp/date Parquet vector updaters

6. Reduce per-value allocations in DELTA_BYTE_ARRAY decoder

7. Reduce per-value allocation in DELTA_LENGTH_BYTE_ARRAY decoder

Common Themes

Benchmarking

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions