Skip to content

[C++] Lz4HadoopCodec::Compress writes single oversized block incompatible with Hadoop Lz4Decompressor #49641

@clee704

Description

@clee704

Describe the bug

Lz4HadoopCodec::Compress writes the entire input as a single Hadoop-framed LZ4 block regardless of size. Hadoop's Lz4Decompressor allocates a fixed 256 KiB output buffer per block (IO_COMPRESSION_CODEC_LZ4_BUFFERSIZE_DEFAULT = 256 * 1024), so any block whose decompressed size exceeds 256 KiB causes LZ4Exception on JVM readers (parquet-mr + Hadoop BlockDecompressorStream).

PARQUET-1878 added Lz4HadoopCodec but writes one block per page. ARROW-11301 fixed the reader for multi-block Hadoop data, but the writer was never updated to split large inputs the same way Hadoop's BlockCompressorStream does.

Steps to reproduce

Write a Parquet file with LZ4_HADOOP compression containing a dictionary page >256 KiB (e.g. 40K unique INT64 values = 320 KiB), then read it with a JVM-based Parquet reader (parquet-mr + Hadoop).

Expected behavior

The file should be readable by JVM-based Parquet readers.

Actual behavior

net.jpountz.lz4.LZ4Exception: Error decoding offset 131193 of input buffer
  at net.jpountz.lz4.LZ4JNISafeDecompressor.decompress(LZ4JNISafeDecompressor.java:71)
  at org.apache.hadoop.io.compress.lz4.Lz4Decompressor.decompressDirectBuf(Lz4Decompressor.java:278)
  at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88)
  ...

Severity

Read failure, not data corruption. The bytes on disk are valid LZ4 — Arrow's own C++ reader handles them fine. The JVM reader throws a hard exception; it does not return wrong data.

Component(s)

C++

Related issues

ARROW-9177, PARQUET-1878, ARROW-11301

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions