Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gunzip truncates concatinated gzip streams #56

Open
lpsmith opened this issue Jun 10, 2016 · 13 comments
Open

gunzip truncates concatinated gzip streams #56

lpsmith opened this issue Jun 10, 2016 · 13 comments

Comments

@lpsmith
Copy link

lpsmith commented Jun 10, 2016

I ran across this issue when working on my ftp-monitor project.

Anyway, here's a fairly minimal reimplementation of zcat, which is sufficient to demonstrate the problem:

module Main (main) where

import qualified System.IO as IO
import qualified System.IO.Streams as Streams
import           System.Environment (getArgs)

main = do
    (filename:_) <- getArgs
    IO.withFile filename IO.ReadMode $ \h -> do
         inS <- Streams.handleToInputStream h >>= Streams.gunzip
         Streams.connect inS Streams.stdout

Now, if I run this on a big file, it truncates output. (Curiously, it seems to be truncating output on a newline on this and several other files).

lpsmith@church:~/src/ftp-monitor/zcat
$ wc <(cat ~/logs/cas/d5941d7e89c0d6306fc7eaa003c6389ad9e1dd7ba8b429fd0a9038344443e346.gz)
  28291  161049 7303701 /dev/fd/63
$ wc <(dist/build/zcat/zcat ~/logs/cas/d5941d7e89c0d6306fc7eaa003c6389ad9e1dd7ba8b429fd0a9038344443e346.gz)
  14345  313055 4402405 /dev/fd/63
$ wc <(zcat ~/logs/cas/d5941d7e89c0d6306fc7eaa003c6389ad9e1dd7ba8b429fd0a9038344443e346.gz)
 454378 9861255 139101003 /dev/fd/63
$ sha256sum <(dist/build/zcat/zcat ~/logs/cas/d5941d7e89c0d6306fc7eaa003c6389ad9e1dd7ba8b429fd0a9038344443e346.gz)
ed347ba497088e0aaeb7a69d5bd694218c93a361b5683edbda73c22ff45dedd8  /dev/fd/63
$ sha256sum <(zcat ~/logs/cas/d5941d7e89c0d6306fc7eaa003c6389ad9e1dd7ba8b429fd0a9038344443e346.gz)
d5941d7e89c0d6306fc7eaa003c6389ad9e1dd7ba8b429fd0a9038344443e346  /dev/fd/63

I'll be investigating this more deeply and see if I can't track down the problem, but it would appear that the bug is with io-streams.

@lpsmith
Copy link
Author

lpsmith commented Jun 10, 2016

Actually, I'm wondering if io-streams zlib integration is incomplete, because it also appears the output is getting truncated on the first day of the month.

@lpsmith
Copy link
Author

lpsmith commented Jun 10, 2016

Ok, it turns out that it's entirely legitimate to concatinate gz streams, and gunzip and zcat will unzip each stream and concatinate the result.

So for example:

$ echo Hello | gzip >> foo.gz
$ echo World | gzip >> foo.gz
$ ./dist/build/zcat/zcat foo.gz 
Hello
$ zcat foo.gz 
Hello
World
$ gunzip foo.gz
$ cat foo
Hello
World

So my "minimal reimplementation" of zcat isn't. One could argue that this isn't a bug in io-streams, however, the semantics of Streams.gunzip is rather different than gunzip, which is horrible UX.

At the very least, this needs to be mentioned in the documentation. And we probably need to implement a proper gunzip analog. Ideally, I think Streams.gunzip should be renamed (perhaps gunzipOne?) and the proper analog should be named Streams.gunzip.

@lpsmith lpsmith changed the title Minimal re-implementation of zcat results in truncated output on big files Semantics of io-streams gunzip doesn't match the semantics of unix gunzip Jun 10, 2016
@lpsmith
Copy link
Author

lpsmith commented Jun 10, 2016

And, as it turns out, this doesn't work:

gunzips :: InputStream ByteString -> IO (InputStream ByteString)
gunzips = multiTransformInputStream Streams.gunzip

multiTransformInputStream :: (InputStream a -> IO (InputStream b))
                          -> (InputStream a -> IO (InputStream b))
multiTransformInputStream trans input = do
    input' <- newIORef =<< trans input
    Streams.makeInputStream (readS ref)
  where
    readS ref = do
      mb <- Streams.read =<< readIORef input'
      case mb of
        (Just _b) -> return mb
        Nothing   -> do
           isAtEOF <- Streams.atEOF input
           if isAtEOF
           then return Nothing
           else do
             writeIORef input' =<< trans input
             readS ref

The issue is that the first time multiTransformInputStream reaches the end of a decompressed Streams.gunzip input stream, Streams.atEOF returns True on the compressed stream, meaning that Streams.gunzip consumed the whole stream (7 MB in my first example), but only produced the first decompressed stream (4 of 140 MB in my first example). Which means I can't simply work around this issue, I have to bypass io-streams zlib integration altogether. I no longer think there's an argument that Streams.gunzip is correct.

@lpsmith lpsmith changed the title Semantics of io-streams gunzip doesn't match the semantics of unix gunzip gunzip truncates multi-gzip streams Jun 10, 2016
@lpsmith lpsmith changed the title gunzip truncates multi-gzip streams gunzip truncates concatinated gzip streams Jun 10, 2016
@lpsmith
Copy link
Author

lpsmith commented Jun 11, 2016

Alright, I inspected the zlib-bindings, and it was clear that the current interface exposed fundamentally does not support this use case. So I jaunted over to the world of conduit to see what the story is there; here are the relevant issues: fpco/streaming-commons#20 and snoyberg/conduit#254

So at the very least, we are going to have to port the patches from streaming-commons to zlib-bindings, or possibly switch to streaming-commons.

@lpsmith
Copy link
Author

lpsmith commented Jun 11, 2016

Ok, as far as streaming-commons's dependencies are concerned, eyeballing it, it looks like the only transitive dependencies it would add to io-streams (that aren't already packaged with GHC) are, async, blaze-builder, random, and stm. These are all smallish libraries with minimal dependencies themselves.

So, I'd be willing to make the changes necessary to zlib-bindings, but I wanted to hear your opinion on moving to streaming-commons.

@hvr
Copy link
Member

hvr commented Jun 11, 2016

What would we need to add to zlib to keep io-streams free of streaming-commons?

@lpsmith
Copy link
Author

lpsmith commented Jun 11, 2016

zlib-bindings needs the ability to signal the end of an gz stream, and return any leftovers that you fed to it but aren't part of the stream.

This appears to be the relevant patch to streaming-commons: fpco/streaming-commons@b466686

@hvr
Copy link
Member

hvr commented Jun 11, 2016

@lpsmith btw, I really meant zlib; I seem to recall that there were some issues/limitations that kept io-streams from switching from zlib-bindings (which was declared deprecated iirc) to zlib.

Not too long ago I imported zlib into github at https://github.com/haskell/zlib in the hopes to get zlib to the point where it can pick up former users of zlib-bindings

@lpsmith
Copy link
Author

lpsmith commented Jun 12, 2016

@hvr, ahh, thank you for the clarification.

Eyeballing it, it would appear that the DecompressStream from the Codec.Compression.Zlib.Internal module from zlib-0.6 onwards should be capable of covering this use case. However, at first glance, I don't see anything in zlib versions < 0.6 that would work.

So it looks like, as of last year, there is nothing preventing io-streams from depending on zlib alone. However, how stable is that interface going to be?

@hvr
Copy link
Member

hvr commented Jun 12, 2016

@lpsmith ah, now I think I remember what the missing thing was: flushing support in the compression-stream. That's where the API I imitated for lzma diverges from zlib's state-machine data-type:

http://hackage.haskell.org/package/lzma-0.0.0.2/docs/Codec-Compression-Lzma.html#t:CompressStream

Note the additional field in CompressInputRequired; I had to add this to implement lzma-streams properly.

I'll talk to @dcoutts, but I think after adding support for flushing (haskell/zlib#6) we should promote the incremental API into a non-.Internal module.

@lpsmith
Copy link
Author

lpsmith commented Jun 12, 2016

Well, I'm looking at just decompression at the moment, and don't anticipate needing compression anytime soon on my current project. I should probably code up a proof of concept though.

@lpsmith
Copy link
Author

lpsmith commented Jun 12, 2016

Hmm, actually, your lzma package looks very nice. I might adopt it in some of my projects; last time I looked at lzma bindings on hackage, I decided to stick to piping my data through xz subprocesses instead.

@lpsmith
Copy link
Author

lpsmith commented Jun 12, 2016

Ok, I have a proof of concept done for the decompression side. Take a look at lpsmith/io-streams@f39178b

lpsmith referenced this issue in lpsmith/io-streams Aug 17, 2017
Needs more testing, review, and work:

1.  Update the test suite
2.  Research the issue of concatinating zlib streams
3.  Review various semantic considerations, especially in
    relation to the unix command "gunzip"

But, it works for me
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants