Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amending annex D to promote zstd instead of gzip #53

Closed
cldellow opened this issue Jun 20, 2019 · 16 comments · Fixed by #69
Closed

Amending annex D to promote zstd instead of gzip #53

cldellow opened this issue Jun 20, 2019 · 16 comments · Fixed by #69

Comments

@cldellow
Copy link

Hello! Not sure if this is the correct forum, so please feel free to redirect me if needed.

At present, the WARC 1.1 specification includes an informative annex D that recommends using gzip to compress individual captures if so desired.

Would the IIPC be open to changing this recommendation to zstd? zstd is an open-source, non-patent encumbered algorithm released by Facebook. It is technically superior to gzip along many axes:

  • compression ratio - for a given amount of CPU time, zstd willl produce a smaller output than gzip
  • compression time - for a given level of compression, zstd will 3-4x faster to compress
  • decompression time - regardless of compression level, zstd is ~3x faster to decompress
  • CPU cost vs storage size tradeoffs - zstd supports a much wider range of compression speed/ratio choices than zlib, allowing people to tune for CPU cost vs long-term storage cost

It is comparable to gzip along other important axes, namely being open source and having bindings for all major languages.

Ben Wills has done some analysis on the impact of zstd vs gzip for the Common Crawl. You can read his analysis at https://github.com/benwills/proposal-warc-to-zstandard, or some discussion on the Common Crawl mailing list at https://groups.google.com/forum/?hl=en#!topic/common-crawl/bO6B6xQJnEE. For the portion of the Common Crawl that was analyzed, it results in an ~18% decrease in storage size and ~3x throughput for readers.

Additionally, zstd provides the zlibWrapper, which transparently supports decompressing zlib or zstd streams - this should help make the migration path easier for people who have some collection of archives already stored in zlib format.

@ato
Copy link
Member

ato commented Jun 22, 2019

Speaking personally, given WARC's role as an archival standard there should be a high bar to changing the recommendation from gzip to another compression format. For many IIPC members I suspect ubiquity, longevity and format stability are more important criteria than performance. While zstd is currently one of the more promising new generation compression formats it's not clear given its relatively short life so far if it will become anywhere near as widespread and long lasting as gzip.

The ISO standard is very slow moving and there seems to be a general consensus only changes with mature implementations should be included. That probably goes double for a tools incompatible change like this would be. That said there may very well be an audience today for better performing WARC compression, particularly for research and analysis use cases.

Consequently I suggest a path forward for this would instead be to first create a separate document specifying how to use WARC with zstd. I wouldn't expect it to be very long and like annex D does for gzip it would specify details like per-record compression, the file naming convention, maybe the use of a dictionary etc. This can be published on the warc-specifications website (via pull request). Down the track if there's broad implementation and adoption and it's clear that zstd has staying power then we could consider incorporating a new recommendation into annex D.

@nlevitt
Copy link
Member

nlevitt commented Jun 24, 2019

Well said @ato. Repeating for emphasis: "there seems to be a general consensus only changes with mature implementations should be included"

@yotann2
Copy link

yotann2 commented Oct 6, 2020

It should be noted that Archive Team is already using Zstandard for WARC files, perhaps irresponsibly. They're using it with dictionaries, which are prepended to each .warc.zst file using an undocumented header. Relevant source code in megawarc, wget-lua, and zstd-dictionary-trainer.

@Arkiver2
Copy link
Contributor

Arkiver2 commented Oct 6, 2020

Please also note the release notes of Wget-AT with ZSTD support at https://github.com/ArchiveTeam/wget-lua/releases/tag/v1.20.3-at.20200401.01.

@benwills
Copy link

benwills commented Oct 6, 2020

I haven't looked in some time at my code referenced in the original post, but I would be happy to revisit the code if anyone finds it useful. I would also be happy to write C, C++, and PHP draft implementations.

@ikreymer
Copy link
Member

ikreymer commented Oct 6, 2020

Would definitely like to support ZStd in warcio, at least reading existing ZStd WARCs at first. Created an issue here to track: webrecorder/warcio#118

@sebastian-nagel
Copy link

Happy to contribute a Java implementation (both reader and writer).

@ato
Copy link
Member

ato commented Oct 7, 2020

Doss anyone feel like writing up a specification of how it works that we could add as a document here? Sounds like there's some important details to get right.

@JustAnotherArchivist
Copy link

I've been meaning to write one based on our implementation for a while but didn't get around to it yet. I'll try to find some time for it soon.

@yotann2
Copy link

yotann2 commented Oct 11, 2020

I documented Archive Team's dictionary-in-stream format and I'm submitting it to the zstd project: facebook/zstd#2349. I figure other projects might want to use it as well.

As far as defining the ZStd WARC format, there are several questions that would need to be resolved first.

Can a file have multiple dictionaries? I'd say no for simplicity, which means .warc.zst files can't be concatenated.

Should external dictionaries be allowed? I'd say no because it's too easy to end up with a missing dictionary. The compressed dictionaries are <1MB, so including them won't noticeably increase the size of most WARC files.

What window sizes should be allowed? This determines the RAM needed for decompression. The Zstandard format itself allows an unlimited size, and recommends that decompressors support at least N=22 (8MB). libzstd itself defaults to N=22 (8MB) but supports up to N=31 (4 GB) if you use the --ultra or --long options.

Should the optional Zstandard features be required? I'd like to require a checksum and a dictionary ID (if applicable) for every frame, so Zstandard bugs can be detected. I don't see any reason to require the content size field.

@JustAnotherArchivist
Copy link

JustAnotherArchivist commented Oct 11, 2020

Lovely, thank you!

  • Can a file have multiple dictionaries? – Loading the dictionary completely resets the state of the zstd decompressor, right? If so, I don't see a reason to prohibit multiple dictionaries in one file. Just like a WARC can have multiple warcinfo records, each of which is used until the next one appears and replaces it (in absence of WARC-Warcinfo-ID), the same could hold for dictionaries: a dictionary would be used until the next one in a skippable frame replaces it, or a frame can explicitly refer to a dictionary by its ID. One potential use case of the latter would be having different dictionaries for text or text-like (e.g. HTML/CSS/JS), image, and video records, each tailored to that type of data; however, I haven't tested whether that actually makes sense in the real world, so this is admittedly somewhat academic.
  • Should external dictionaries be allowed? – I agree, the dictionary should be mandatory. (We actually don't include it internally because it does add up when millions of separate WARCs share the same dictionary and are combined into bigger files anyway, but that should be considered an implementation detail in our system.)
  • What window sizes should be allowed? – I have no strong opinion on this. In general, without a (very) good reason, I think such things should not be unnecessarily restrictive. You know, '640K ought to be enough for anyone'...
  • Should the optional Zstandard features be required? – I'd like to to require the (XXH64) checksum, content size, and dict ID fields. For reference, gzip always includes a CRC32 and the uncompressed size (mod 232). The uncompressed size is actually very useful to have if you need to decompress a file onto disk as you're able to check ahead of time whether you have sufficient disk space to write out the decompressed record.

@yotann2
Copy link

yotann2 commented Oct 12, 2020

Can a file have multiple dictionaries?

@JustAnotherArchivist Basically, yes, the only state the zstd decompressor maintains between frames is the current dictionary. Using libzstd you can load as many dictionaries as you want (into ZSTD_DDict structures) and it's up to you to switch to the correct dictionary for each frame. But a lot of tools (like the zstd program) will assume you only have one dictionary per file.

If we allow multiple dictionaries, the question is how to enable random access. One option is to put all the dictionaries at the start of the file and use dictionary IDs to distinguish between them, but that means you have to load an arbitrary number of dictionaries to access one record.

Another option is to leave the dictionaries wherever and add a field to the CDX file that tells you where to find the dictionary for each record. (While we're at it, it would be nice if people added non-response records to their CDX files...)

You'd need to do something special when combining a dictionaryless .warc.zst with one that has a dictionary; there's no way for a Zstandard frame to explicitly disable the dictionary, and if you use a dictionary with certain settings (non-standard recent offsets) to decompress a frame that isn't supposed to use a dictionary, you'll get incorrect output. Options are:

  • Define that dictionary ID 0 in a frame header means "no dictionary". (Zstandard defines it to mean "no dictionary or unknown dictionary".) You can easily handle this in C by calling ZSTD_getDictID_fromFrame, but I'm not sure about other languages.
  • Add a special dictionary that resets the state just as if there were no dictionary at all.

@yotann2
Copy link

yotann2 commented Oct 12, 2020

What window sizes should be allowed?

I agree about not being restrictive. We could say "decoders must support window sizes up to the standard 8 MiB, but support for larger window sizes is optional." Note that huge windows are only useful for huge records.

Should the content size field be required?

The problem here is that if the record is too large to fit in RAM, the WARC writer will want to start saving it to disk before it knows the total size. There's no easy way with libzstd to go back and patch the content size later on. This could be solved by splitting the record into smaller parts, using either continuation records or multiple Zstandard frames.

Can one record have multiple Zstandard frames?

I say yes, to make extremely large records more manageable. Multiple frames would also enable parallelism for compression and decompression, even for a single record. (They would also help if someone wanted to experiment with seeking within a record.)

@JustAnotherArchivist
Copy link

  • Can a file have multiple dictionaries? – Very good points, and yeah, I think I agree now that only permitting (at most) one dictionary per file is probably best.
  • What window sizes should be allowed? – Yes, that sounds good.
  • Should the content size field be required? – True, but unfortunately, the WARC writer already has to know the total size/buffer the data for the Content-Length WARC header and digests anyway. Support for continuation/segmented records is basically zero in existing tooling to my knowledge.
  • Can one record have multiple Zstandard frames? – I can't think of any significant downsides to allowing this. On the decompression side, it'd be as simple as a while readSize < contentLength: read_frame() loop. But this reminds me of a related thing...
  • Can one frame contain multiple records? – I'd say no, i.e. every record must begin with a new frame. The GZIP annex only recommends but doesn't require this. In practice though, a number of tools don't support WARCs compressed with multiple records in one GZIP member (or worse, an entire WARC as one member), and rightfully so since it destroys random access.

@yotann2
Copy link

yotann2 commented Oct 13, 2020

Seems like the two of us, at least, are converging. Good point about the Content-Length header.

I wrote a draft specification at #69. Let's move discussion there.

@phoerious
Copy link

Another good alternative to GZip would be LZ4. In fact, we have already implemented it in our WARC reader library FastWARC: https://resiliparse.chatnoir.eu/en/stable/man/fastwarc.html#benchmarks

The downside of LZ4 is slightly worse compression ratios (files about 30% large than GZip), but it is at least 5x as fast to decompress (quite a bit faster than zstd even) and the frame format spec is stable and works just like GZip, so multiple LZ4 WARCs can be concatenated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants