Skip to content

GH-3530: Optimize PLAIN encoding and decoding with direct ByteBuffer I/O#3565

Open
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:parquet-perf-v2-par1-plain
Open

GH-3530: Optimize PLAIN encoding and decoding with direct ByteBuffer I/O#3565
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:parquet-perf-v2-par1-plain

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented May 17, 2026

Part of #3530 — Apache Parquet Java Performance Improvements

Summary

Replace ByteBufferInputStream and LittleEndianDataInputStream wrappers with direct ByteBuffer access for all PLAIN value readers and writers.

Readers (PlainValuesReader, BooleanPlainValuesReader, BinaryPlainValuesReader, FixedLenByteArrayPlainValuesReader): hold a little-endian ByteBuffer from initFromPage() and call getInt/getLong/getFloat/getDouble directly, eliminating per-value stream overhead.

Writers (PlainValuesWriter, BooleanPlainValuesWriter, FixedLenByteArrayPlainValuesWriter): write through CapacityByteArrayOutputStream's new writeInt/writeLong methods which put values directly into the NIO slab buffer in little-endian order, avoiding temporary byte-array allocation.

Supporting changes:

  • CapacityByteArrayOutputStream: allocate slabs with ByteOrder.LITTLE_ENDIAN, add writeInt(int) and writeLong(long) for single-value NIO writes.
  • BytesInput: add zero-copy writeTo(ByteBuffer) and toByteArray() using bulk ByteBuffer.get() instead of stream copy.
  • LittleEndianDataOutputStream: batch single-byte writes into single write(buf, 0, N) calls for writeShort/writeInt.

Includes JMH benchmarks (PlainEncodingBenchmark, PlainDecodingBenchmark) covering all 7 primitive types for both encoding and decoding.

Benchmark results

Environment: JDK 25.0.3 (Temurin), OpenJDK 64-Bit Server VM, JMH 1.37, Linux x86_64.

Decoding (100K values/iteration, 3 forks x 5 iterations, throughput mode):

Benchmark Master (M ops/s) Branch (M ops/s) Speedup
decodeInt 425 5,427 12.8x
decodeFloat 416 5,440 13.1x
decodeLong 119 4,720 39.5x (*)
decodeDouble 116 6,026 51.8x (*)
decodeBoolean 639 1,642 2.6x
decodeFlba (len=2,12,16) 188 680 3.6x
decodeBinary (len=10,100,1000) 142 225-230 1.6x

Encoding:

Benchmark Master (M ops/s) Branch (M ops/s) Speedup
encodeInt 148 559 3.8x
encodeFloat 150 532 3.5x
encodeLong 193 478 2.5x
encodeDouble 179 439 2.4x
encodeBoolean 850 1,692 2.0x
encodeBinary (len=10) 76 150 2.0x
encodeFlba (len=2-16) 156-184 178-224 1.1-1.2x

(*) decodeLong/Double show JIT variance across forks (error bars >20%); true steady-state likely ~13x consistent with INT32/FLOAT.

…uffer I/O

Replace ByteBufferInputStream and LittleEndianDataInputStream wrappers
with direct ByteBuffer access for all PLAIN value readers and writers.

Readers (PlainValuesReader, BooleanPlainValuesReader,
BinaryPlainValuesReader, FixedLenByteArrayPlainValuesReader) now hold a
little-endian ByteBuffer obtained from initFromPage() and call
getInt/getLong/getFloat/getDouble directly, eliminating per-value stream
overhead.

Writers (PlainValuesWriter, BooleanPlainValuesWriter,
FixedLenByteArrayPlainValuesWriter) write through
CapacityByteArrayOutputStream's new writeInt/writeLong methods, which
put values directly into the NIO slab buffer in little-endian order,
avoiding temporary byte-array allocation.

Supporting changes:
- CapacityByteArrayOutputStream: allocate slabs with ByteOrder.LITTLE_ENDIAN,
  add writeInt(int) and writeLong(long) for single-value NIO writes.
- BytesInput: add zero-copy writeTo(ByteBuffer) and toByteArray() using
  bulk ByteBuffer.get() instead of stream copy.
- LittleEndianDataOutputStream: batch single-byte writes into single
  write(buf, 0, N) calls for writeShort/writeInt.

Includes JMH benchmarks (PlainEncodingBenchmark, PlainDecodingBenchmark)
covering all 7 primitive types for both encoding and decoding.
int length = BytesUtils.readIntLittleEndian(in);
return Binary.fromConstantByteBuffer(in.slice(length));
} catch (IOException | RuntimeException e) {
throw new ParquetDecodingException("could not read bytes at offset " + in.position(), e);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the ParquetDecodingException? Otherwise we're throwing the raw {IOException,RuntimeException} which is a behavioral change.

if (available > 0) {
this.buffer = stream.slice(available).order(ByteOrder.LITTLE_ENDIAN);
} else {
this.buffer = ByteBuffer.allocate(0).order(ByteOrder.LITTLE_ENDIAN);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create a constant for the ByteBuffer.allocate(0).order(ByteOrder.LITTLE_ENDIAN);?

try {
return Binary.fromConstantByteBuffer(in.slice(length));
} catch (IOException | RuntimeException e) {
throw new ParquetDecodingException("could not read bytes at offset " + in.position(), e);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, should we keep the wrapped ParquetDecodingException?

try {
skipBytesFully(n * 8);
} catch (IOException e) {
throw new ParquetDecodingException("could not skip " + n + " double values", e);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, do we want to keep the ParquetDecodingException?

public abstract class PlainValuesReader extends ValuesReader {
private static final Logger LOG = LoggerFactory.getLogger(PlainValuesReader.class);

protected LittleEndianDataInputStream in;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should go through the deprecation cycle here, but is anything using this outside of the project itself?

* mutable {@code BAOS.getBuf()}.
*/
@Override
public byte[] toByteArray() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overrides a deprecated API, as a follow-up we probably should move the internal calls to the new API:

@deprecated Use {@link #toByteBuffer(ByteBufferAllocator, Consumer)}

public int getNextOffset() {
return in.getNextOffset();
public void skip(int n) {
bitIndex += n;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check for bounds, and throw a ParquetDecodingException in case of out of bounds?

} catch (IOException e) {
throw new ParquetDecodingException("could not skip " + n + " double values", e);
}
buffer.position(buffer.position() + n * 8);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use Math.multiplyExact here and below?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants