Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support disabling automatic decompression of gzip files in GCS connector #1060

Open
blackvvine opened this issue Oct 6, 2023 · 0 comments
Open

Comments

@blackvvine
Copy link

Summary

Hadoop's default behaviour is to automatically decompress files with the .gz extension (see here).

When gzip encoding is enabled (fs.gs.inputstream.support.gzip.encoding.enable=true), upon reading gzip-encoded files from GCS, both the GCS connector and Hadoop FS will attempt to decompress the file, leading to errors like:

Caused by: java.io.IOException: incorrect header check
	at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
	at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:227)
	at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
[...]

Expected Behaviour

Since disabling the gzip decompression behaviour in Hadoop is not possible without changing the hadoop-core library, it's helpful if the GCS connector can automatically skip the decompression when the file extension is .gz or at least provide a configuration property for disabling the automatic decompression.

Current Workarounds

Either unset the Content-Encoding: gzip metadata field on the GCS object (so the connector would not decompress it) or remove the .gz extension from the object name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant