Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Liberation] Re-entrant gzip decoder #2002

Open
adamziel opened this issue Nov 18, 2024 · 1 comment
Open

[Data Liberation] Re-entrant gzip decoder #2002

adamziel opened this issue Nov 18, 2024 · 1 comment

Comments

@adamziel
Copy link
Collaborator

Let's build a custom GZip decoder to explore re-entrant decompression.

There are two ways to do re-entrant stream decoding of large zip files:

  1. Seek to the relevant file and decode the gzipped data until a specific offset is reached. For remote files, this requires downloading potentially hundreds of gigabytes of data.
  2. Seek to the last decoded gzip block and start directly there. This has very little overhead.

PHP has build-in functions for working with gzip compression, but they don't expose the internal decoder state such as the last block start or the current Huffman code dictionary. However, the decompression can only continue from an arbitrary offset if we have that state. The first decompression method listed above simply recomputes it. The second method saves that work by restoring the state directly.

Therefore, to support resuming zip / gzip processing, we need a custom-built GZip decoder capable of exposing and restoring its internal state.

A part of #1894

@adamziel
Copy link
Collaborator Author

Here's a buggy GZip decoder generated by AI. It decodes the first block correctly but then it gets confused:

https://gist.github.com/adamziel/9067b209c43fce126bdbdc2106b2c210

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant