-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amending annex D to promote zstd instead of gzip #53
Comments
Speaking personally, given WARC's role as an archival standard there should be a high bar to changing the recommendation from gzip to another compression format. For many IIPC members I suspect ubiquity, longevity and format stability are more important criteria than performance. While zstd is currently one of the more promising new generation compression formats it's not clear given its relatively short life so far if it will become anywhere near as widespread and long lasting as gzip. The ISO standard is very slow moving and there seems to be a general consensus only changes with mature implementations should be included. That probably goes double for a tools incompatible change like this would be. That said there may very well be an audience today for better performing WARC compression, particularly for research and analysis use cases. Consequently I suggest a path forward for this would instead be to first create a separate document specifying how to use WARC with zstd. I wouldn't expect it to be very long and like annex D does for gzip it would specify details like per-record compression, the file naming convention, maybe the use of a dictionary etc. This can be published on the warc-specifications website (via pull request). Down the track if there's broad implementation and adoption and it's clear that zstd has staying power then we could consider incorporating a new recommendation into annex D. |
Well said @ato. Repeating for emphasis: "there seems to be a general consensus only changes with mature implementations should be included" |
It should be noted that Archive Team is already using Zstandard for WARC files, perhaps irresponsibly. They're using it with dictionaries, which are prepended to each |
Please also note the release notes of Wget-AT with ZSTD support at https://github.com/ArchiveTeam/wget-lua/releases/tag/v1.20.3-at.20200401.01. |
I haven't looked in some time at my code referenced in the original post, but I would be happy to revisit the code if anyone finds it useful. I would also be happy to write C, C++, and PHP draft implementations. |
Would definitely like to support ZStd in warcio, at least reading existing ZStd WARCs at first. Created an issue here to track: webrecorder/warcio#118 |
Happy to contribute a Java implementation (both reader and writer). |
Doss anyone feel like writing up a specification of how it works that we could add as a document here? Sounds like there's some important details to get right. |
I've been meaning to write one based on our implementation for a while but didn't get around to it yet. I'll try to find some time for it soon. |
I documented Archive Team's dictionary-in-stream format and I'm submitting it to the zstd project: facebook/zstd#2349. I figure other projects might want to use it as well. As far as defining the ZStd WARC format, there are several questions that would need to be resolved first. Can a file have multiple dictionaries? I'd say no for simplicity, which means Should external dictionaries be allowed? I'd say no because it's too easy to end up with a missing dictionary. The compressed dictionaries are <1MB, so including them won't noticeably increase the size of most WARC files. What window sizes should be allowed? This determines the RAM needed for decompression. The Zstandard format itself allows an unlimited size, and recommends that decompressors support at least N=22 (8MB). Should the optional Zstandard features be required? I'd like to require a checksum and a dictionary ID (if applicable) for every frame, so Zstandard bugs can be detected. I don't see any reason to require the content size field. |
Lovely, thank you!
|
Can a file have multiple dictionaries?@JustAnotherArchivist Basically, yes, the only state the zstd decompressor maintains between frames is the current dictionary. Using If we allow multiple dictionaries, the question is how to enable random access. One option is to put all the dictionaries at the start of the file and use dictionary IDs to distinguish between them, but that means you have to load an arbitrary number of dictionaries to access one record. Another option is to leave the dictionaries wherever and add a field to the CDX file that tells you where to find the dictionary for each record. (While we're at it, it would be nice if people added non-response records to their CDX files...) You'd need to do something special when combining a dictionaryless
|
What window sizes should be allowed?I agree about not being restrictive. We could say "decoders must support window sizes up to the standard 8 MiB, but support for larger window sizes is optional." Note that huge windows are only useful for huge records. Should the content size field be required?The problem here is that if the record is too large to fit in RAM, the WARC writer will want to start saving it to disk before it knows the total size. There's no easy way with Can one record have multiple Zstandard frames?I say yes, to make extremely large records more manageable. Multiple frames would also enable parallelism for compression and decompression, even for a single record. (They would also help if someone wanted to experiment with seeking within a record.) |
|
Seems like the two of us, at least, are converging. Good point about the I wrote a draft specification at #69. Let's move discussion there. |
Another good alternative to GZip would be LZ4. In fact, we have already implemented it in our WARC reader library FastWARC: https://resiliparse.chatnoir.eu/en/stable/man/fastwarc.html#benchmarks The downside of LZ4 is slightly worse compression ratios (files about 30% large than GZip), but it is at least 5x as fast to decompress (quite a bit faster than zstd even) and the frame format spec is stable and works just like GZip, so multiple LZ4 WARCs can be concatenated. |
Hello! Not sure if this is the correct forum, so please feel free to redirect me if needed.
At present, the WARC 1.1 specification includes an informative annex D that recommends using gzip to compress individual captures if so desired.
Would the IIPC be open to changing this recommendation to zstd? zstd is an open-source, non-patent encumbered algorithm released by Facebook. It is technically superior to gzip along many axes:
It is comparable to gzip along other important axes, namely being open source and having bindings for all major languages.
Ben Wills has done some analysis on the impact of zstd vs gzip for the Common Crawl. You can read his analysis at https://github.com/benwills/proposal-warc-to-zstandard, or some discussion on the Common Crawl mailing list at https://groups.google.com/forum/?hl=en#!topic/common-crawl/bO6B6xQJnEE. For the portion of the Common Crawl that was analyzed, it results in an ~18% decrease in storage size and ~3x throughput for readers.
Additionally, zstd provides the
zlibWrapper
, which transparently supports decompressing zlib or zstd streams - this should help make the migration path easier for people who have some collection of archives already stored in zlib format.The text was updated successfully, but these errors were encountered: