-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[C++] Lz4HadoopCodec::Compress writes single oversized block incompatible with Hadoop Lz4Decompressor #49641
Description
Describe the bug
Lz4HadoopCodec::Compress writes the entire input as a single Hadoop-framed LZ4 block regardless of size. Hadoop's Lz4Decompressor allocates a fixed 256 KiB output buffer per block (IO_COMPRESSION_CODEC_LZ4_BUFFERSIZE_DEFAULT = 256 * 1024), so any block whose decompressed size exceeds 256 KiB causes LZ4Exception on JVM readers (parquet-mr + Hadoop BlockDecompressorStream).
PARQUET-1878 added Lz4HadoopCodec but writes one block per page. ARROW-11301 fixed the reader for multi-block Hadoop data, but the writer was never updated to split large inputs the same way Hadoop's BlockCompressorStream does.
Steps to reproduce
Write a Parquet file with LZ4_HADOOP compression containing a dictionary page >256 KiB (e.g. 40K unique INT64 values = 320 KiB), then read it with a JVM-based Parquet reader (parquet-mr + Hadoop).
Expected behavior
The file should be readable by JVM-based Parquet readers.
Actual behavior
net.jpountz.lz4.LZ4Exception: Error decoding offset 131193 of input buffer
at net.jpountz.lz4.LZ4JNISafeDecompressor.decompress(LZ4JNISafeDecompressor.java:71)
at org.apache.hadoop.io.compress.lz4.Lz4Decompressor.decompressDirectBuf(Lz4Decompressor.java:278)
at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88)
...
Severity
Read failure, not data corruption. The bytes on disk are valid LZ4 — Arrow's own C++ reader handles them fine. The JVM reader throws a hard exception; it does not return wrong data.
Component(s)
C++
Related issues
ARROW-9177, PARQUET-1878, ARROW-11301