Skip to content

GH-49641: [C++] Fix Lz4HadoopCodec to split large blocks for Hadoop compatibility#49642

Open
clee704 wants to merge 1 commit intoapache:mainfrom
clee704:fix-lz4-hadoop-block-size
Open

GH-49641: [C++] Fix Lz4HadoopCodec to split large blocks for Hadoop compatibility#49642
clee704 wants to merge 1 commit intoapache:mainfrom
clee704:fix-lz4-hadoop-block-size

Conversation

@clee704
Copy link
Copy Markdown
Contributor

@clee704 clee704 commented Apr 2, 2026

Rationale

Lz4HadoopCodec::Compress writes the entire input as a single Hadoop-framed LZ4 block. Hadoop's Lz4Decompressor allocates a fixed 256 KiB output buffer per block (IO_COMPRESSION_CODEC_LZ4_BUFFERSIZE_DEFAULT), so any block whose decompressed size exceeds 256 KiB causes LZ4Exception on JVM readers (parquet-mr + Hadoop BlockDecompressorStream).

This was found when writing Parquet dictionary pages >256 KiB with LZ4 compression (e.g. 40K unique INT64 values = 320 KiB). The file was written successfully but the JVM reader could not decompress the dictionary page.

This is a read failure, not data corruption — the compressed bytes on disk are valid LZ4. Arrow's own C++ reader decompresses them fine. The JVM reader throws a hard exception; it does not silently return wrong data.

PARQUET-1878 added the Hadoop-compatible codec but only writes one block per page. ARROW-11301 fixed the reader for multi-block Hadoop data, but the writer was never updated.

What changes are included in this PR?

Split input into blocks of ≤ 256 KiB in Lz4HadoopCodec::Compress and update MaxCompressedLen for per-block prefix overhead. Arrow's reader (TryDecompressHadoop) already handles multiple blocks. No behavioral change for data ≤ 256 KiB (still produces a single block, identical output to before).

Are these changes tested?

Yes:

  • MultiBlockRoundtrip: compress→decompress round-trip for sizes 0 to 1 MiB
  • BlockSizeLimit: parses compressed output and asserts every block's decompressed_size ≤ 256 KiB. Fails without the fix, passes with it.

Are there any user-facing changes?

Parquet files written with LZ4_HADOOP compression containing pages >256 KiB will now be readable by JVM-based Parquet readers (parquet-mr + Hadoop). Previously these files caused LZ4Exception on the JVM reader. No change for files with pages ≤ 256 KiB.

AI-generated code disclosure

This fix was developed with the assistance of an AI coding assistant (GitHub Copilot). The author has reviewed and verified all changes, including validating the fix with standalone tests that confirm the old code fails and the new code passes. The root cause analysis was done by tracing through Arrow's Lz4HadoopCodec, Hadoop's BlockDecompressorStream / Lz4Decompressor source code, and the Parquet page write path.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

⚠️ GitHub issue #49641 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

⚠️ GitHub issue #49641 has no components, please add labels for components.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

⚠️ GitHub issue #49641 has no components, please add labels for components.

@clee704 clee704 force-pushed the fix-lz4-hadoop-block-size branch 4 times, most recently from e3502b2 to 74d6b41 Compare April 2, 2026 04:44
@clee704 clee704 marked this pull request as draft April 2, 2026 04:46
@clee704 clee704 force-pushed the fix-lz4-hadoop-block-size branch from 74d6b41 to d8b71e9 Compare April 2, 2026 05:01
@clee704 clee704 marked this pull request as ready for review April 2, 2026 05:01
@pitrou
Copy link
Copy Markdown
Member

pitrou commented Apr 2, 2026

Thanks for submitting this PR. Just for the record: I would not consider this a critical fix, as this is just working around a bug/limitation in another Parquet implementation.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

⚠️ GitHub issue #49641 has no components, please add labels for components.

Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM overall, a suggestion below.

CheckCodecRoundtrip(c1, c2, data, /*check_reverse=*/false);
}

TEST(TestCodecLZ4Hadoop, MultiBlockRoundtrip) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the 3 additional tests have essentially the same structure and could folded into one for conciseness, could you do that?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 2, 2026
@pitrou
Copy link
Copy Markdown
Member

pitrou commented Apr 2, 2026

By the way @clee704, since this is about generating new files, why not use the newer LZ4_RAW which completely solves the Hadoop compatibility problem?

…doop compatibility

Arrow's Lz4HadoopCodec::Compress writes the entire input as a single
Hadoop-framed LZ4 block.  Hadoop's Lz4Decompressor allocates a fixed
256 KiB output buffer per block (IO_COMPRESSION_CODEC_LZ4_BUFFERSIZE),
so any block whose decompressed size exceeds 256 KiB causes an
LZ4Exception on the JVM reader.

This is a read failure, not data corruption -- the compressed bytes
are valid, but Hadoop-based JVM readers cannot decompress them.

Fix: split input into blocks of at most 256 KiB uncompressed, each
with its own [decompressed_size, compressed_size] big-endian prefix,
matching Hadoop's BlockCompressorStream behavior.  Arrow's reader
(TryDecompressHadoop) already handles multiple blocks.

Closes apache#49641.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@clee704 clee704 force-pushed the fix-lz4-hadoop-block-size branch from d8b71e9 to 97385c8 Compare April 2, 2026 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants