-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GISAID: Decompress after download #245
Conversation
The GenBank download was being compared against a non-existent file: s3://nextstrain-data/files/ncov/open/gisaid.ndjson.xz which caused it to always skip notifications.
As the message, params, and run blocks are all distinct between fetch_from_database True vs. False, it seems more readable to me to define parallel rule definitions within a top-level conditional.
Streaming decompression during the network fetch leads to decreased network transfer rates¹ and thus a longer network connection lifetime. As we sometimes see transient network disruptions that cause the transfer to fail, my hypothesis is that reducing connection lifetime by decompressing after download will meaningfully reduce our exposure to transient disruptions. This may come at the cost of potentially increasing overall runtime by some unknown amount, but we could try to address that other ways such as switching to xz (with GISAID's cooperation) for faster decompression and/or performing decompression concurrently with downloading but using a disk spool in between (assuming disk io isn't a bottleneck). ¹ About 75% slower in a single test case I ran on my laptop over the Hutch's wired network.
@tsibley Is there a possibility to incorporate parallel unbzipping also? |
@ivan-aksamentov Yes, certainly, if we do decompression after download. If we do streaming decompression concurrent with download using an on-disk spool, then maybe, but not with the |
So I think the practical questions for a decision are: how much longer than downloading straight to disk does single-threaded decompression from disk take? if longer, how does it compare to parallel decompression from disk? My guess is that concurrent decompression might win out, even though single-threaded, but not sure. Should be easy enough to benchmark. |
At this point we don't have ingest results and so builds are pretty much cancelled. Which probably makes the fastest implementable and the most reliable solution (the one that won't fail on the first run) the best in my view. We could also try to run both on separate branch runs and see which completes first. And then if at least one of them succeeds we could kick start the preprocess and builds manually hopefully if scientists find that they need them. The good new is that it's the first stage in the pipe, so it fails fast. |
I also wonder of there's a third-party tool or if you could adjust your tool, so that it does the same spool trick, but would allow to pipe the output to the parallel bzip (or potentially to xz when it's available later). I.e. to decouple the trick itself from the compression algorithm. Then we could take advantage of both, the concurrent download and parallel decompress. |
Well, none of these are definitely more reliable. The issue appears to be wholly on the network/upstream server, not within our decompression. The guess this is all predicated on is that reducing network transfer time will help avoid the issue.
You could do this with normal |
I'm writing this and will open another PR for it. |
That was a little more involved than I expected, but try #256. |
This PR drafts a change we discussed in Slack and summarized in #242. Drafted now so it's ready in case we end up needing it sooner than later. If we never need it, ok to trash this PR.
Decompress gisaid.ndjson.bz2 after download instead of during
Streaming decompression during the network fetch leads to decreased network transfer rates1 and thus a longer network connection lifetime. As we sometimes see transient network disruptions that cause the transfer to fail, my hypothesis is that reducing connection lifetime by decompressing after download will meaningfully reduce our exposure to transient disruptions.
This may come at the cost of potentially increasing overall runtime by some unknown amount, but we could try to address that other ways such as switching to xz (with GISAID's cooperation) for faster decompression and/or performing decompression concurrently with downloading but using a disk spool in between (assuming disk io isn't a bottleneck).
[Update 10 Dec 2021] The concurrent-decompression-with-disk-spool approach could be implemented using a program like this, dubbed bunzip2-tailf to do the decompression. As it'd be hard to reliably coordinate the concurrent processes (curl and bunzip2-tailf) with separate Snakemake rules/jobs, we'd want to use shell job handling instead. For example, updating
fetch-from-gisaid
to read something like:Footnotes
About 75% slower in a single test case I ran on my laptop over the Hutch's wired network. ↩