Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GISAID: Decompress after download #245

Closed
wants to merge 3 commits into from

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Dec 4, 2021

This PR drafts a change we discussed in Slack and summarized in #242. Drafted now so it's ready in case we end up needing it sooner than later. If we never need it, ok to trash this PR.


Decompress gisaid.ndjson.bz2 after download instead of during

Streaming decompression during the network fetch leads to decreased network transfer rates1 and thus a longer network connection lifetime. As we sometimes see transient network disruptions that cause the transfer to fail, my hypothesis is that reducing connection lifetime by decompressing after download will meaningfully reduce our exposure to transient disruptions.

This may come at the cost of potentially increasing overall runtime by some unknown amount, but we could try to address that other ways such as switching to xz (with GISAID's cooperation) for faster decompression and/or performing decompression concurrently with downloading but using a disk spool in between (assuming disk io isn't a bottleneck).


[Update 10 Dec 2021] The concurrent-decompression-with-disk-spool approach could be implemented using a program like this, dubbed bunzip2-tailf to do the decompression. As it'd be hard to reliably coordinate the concurrent processes (curl and bunzip2-tailf) with separate Snakemake rules/jobs, we'd want to use shell job handling instead. For example, updating fetch-from-gisaid to read something like:

# Download compressed file direct to disk to avoid slowing down the network transfer waiting on decompression.
curl … > gisaid.ndjson.bz2 &

# Decompress from disk to stdout on the fly; will wait for more data to appear in the file if it hits EOF (up to a time limit).
bunzip2-tailf < gisaid.ndjson.bz2 &

# Wait for both background jobs above to complete.
wait

Footnotes

  1. About 75% slower in a single test case I ran on my laptop over the Hutch's wired network.

The GenBank download was being compared against a non-existent file:

    s3://nextstrain-data/files/ncov/open/gisaid.ndjson.xz

which caused it to always skip notifications.
As the message, params, and run blocks are all distinct between
fetch_from_database True vs. False, it seems more readable to me to
define parallel rule definitions within a top-level conditional.
Streaming decompression during the network fetch leads to decreased
network transfer rates¹ and thus a longer network connection lifetime.
As we sometimes see transient network disruptions that cause the
transfer to fail, my hypothesis is that reducing connection lifetime by
decompressing after download will meaningfully reduce our exposure to
transient disruptions.

This may come at the cost of potentially increasing overall runtime by
some unknown amount, but we could try to address that other ways such as
switching to xz (with GISAID's cooperation) for faster decompression
and/or performing decompression concurrently with downloading but using
a disk spool in between (assuming disk io isn't a bottleneck).

¹ About 75% slower in a single test case I ran on my laptop over the
  Hutch's wired network.
@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Dec 16, 2021

@tsibley Is there a possibility to incorporate parallel unbzipping also?
feat: use parallel version of bzip2 to decompress gisaid snapshot #247

@tsibley
Copy link
Member Author

tsibley commented Dec 16, 2021

@ivan-aksamentov Yes, certainly, if we do decompression after download. If we do streaming decompression concurrent with download using an on-disk spool, then maybe, but not with the bunzip2-tailf command I wrote.

@tsibley
Copy link
Member Author

tsibley commented Dec 16, 2021

So I think the practical questions for a decision are: how much longer than downloading straight to disk does single-threaded decompression from disk take? if longer, how does it compare to parallel decompression from disk?

My guess is that concurrent decompression might win out, even though single-threaded, but not sure. Should be easy enough to benchmark.

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Dec 16, 2021

At this point we don't have ingest results and so builds are pretty much cancelled. Which probably makes the fastest implementable and the most reliable solution (the one that won't fail on the first run) the best in my view.

We could also try to run both on separate branch runs and see which completes first. And then if at least one of them succeeds we could kick start the preprocess and builds manually hopefully if scientists find that they need them.

The good new is that it's the first stage in the pipe, so it fails fast.

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Dec 16, 2021

I also wonder of there's a third-party tool or if you could adjust your tool, so that it does the same spool trick, but would allow to pipe the output to the parallel bzip (or potentially to xz when it's available later). I.e. to decouple the trick itself from the compression algorithm. Then we could take advantage of both, the concurrent download and parallel decompress.

@tsibley
Copy link
Member Author

tsibley commented Dec 16, 2021

At this point we don't have ingest results and so builds are pretty much cancelled. Which probably makes the fastest implementable and the most reliable solution (the one that won't fail on the first run) the best in my view.

Well, none of these are definitely more reliable. The issue appears to be wholly on the network/upstream server, not within our decompression. The guess this is all predicated on is that reducing network transfer time will help avoid the issue.

I.e. to decouple the trick itself from the compression algorithm. Then we could take advantage of both, the concurrent download and parallel decompress.

You could do this with normal tail, I think. I combined the spool + decompression into a single process to avoid the downside of more processes having to read/write the whole data stream, but maybe that doesn't matter.

@tsibley
Copy link
Member Author

tsibley commented Dec 16, 2021

You could do this with normal tail, I think.

I'm writing this and will open another PR for it.

@tsibley
Copy link
Member Author

tsibley commented Dec 17, 2021

That was a little more involved than I expected, but try #256.

@tsibley tsibley closed this Dec 17, 2021
@victorlin victorlin deleted the gisaid/decompress-after-download branch February 20, 2024 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants