GISAID: Decompress after download #245

tsibley · 2021-12-04T00:03:10Z

This PR drafts a change we discussed in Slack and summarized in #242. Drafted now so it's ready in case we end up needing it sooner than later. If we never need it, ok to trash this PR.

Decompress gisaid.ndjson.bz2 after download instead of during

Streaming decompression during the network fetch leads to decreased network transfer rates¹ and thus a longer network connection lifetime. As we sometimes see transient network disruptions that cause the transfer to fail, my hypothesis is that reducing connection lifetime by decompressing after download will meaningfully reduce our exposure to transient disruptions.

This may come at the cost of potentially increasing overall runtime by some unknown amount, but we could try to address that other ways such as switching to xz (with GISAID's cooperation) for faster decompression and/or performing decompression concurrently with downloading but using a disk spool in between (assuming disk io isn't a bottleneck).

[Update 10 Dec 2021] The concurrent-decompression-with-disk-spool approach could be implemented using a program like this, dubbed bunzip2-tailf to do the decompression. As it'd be hard to reliably coordinate the concurrent processes (curl and bunzip2-tailf) with separate Snakemake rules/jobs, we'd want to use shell job handling instead. For example, updating fetch-from-gisaid to read something like:

# Download compressed file direct to disk to avoid slowing down the network transfer waiting on decompression.
curl … > gisaid.ndjson.bz2 &

# Decompress from disk to stdout on the fly; will wait for more data to appear in the file if it hits EOF (up to a time limit).
bunzip2-tailf < gisaid.ndjson.bz2 &

# Wait for both background jobs above to complete.
wait

About 75% slower in a single test case I ran on my laptop over the Hutch's wired network. ↩

The GenBank download was being compared against a non-existent file: s3://nextstrain-data/files/ncov/open/gisaid.ndjson.xz which caused it to always skip notifications.

As the message, params, and run blocks are all distinct between fetch_from_database True vs. False, it seems more readable to me to define parallel rule definitions within a top-level conditional.

Streaming decompression during the network fetch leads to decreased network transfer rates¹ and thus a longer network connection lifetime. As we sometimes see transient network disruptions that cause the transfer to fail, my hypothesis is that reducing connection lifetime by decompressing after download will meaningfully reduce our exposure to transient disruptions. This may come at the cost of potentially increasing overall runtime by some unknown amount, but we could try to address that other ways such as switching to xz (with GISAID's cooperation) for faster decompression and/or performing decompression concurrently with downloading but using a disk spool in between (assuming disk io isn't a bottleneck). ¹ About 75% slower in a single test case I ran on my laptop over the Hutch's wired network.

ivan-aksamentov · 2021-12-16T20:46:41Z

@tsibley Is there a possibility to incorporate parallel unbzipping also?
feat: use parallel version of bzip2 to decompress gisaid snapshot #247

tsibley · 2021-12-16T20:53:23Z

@ivan-aksamentov Yes, certainly, if we do decompression after download. If we do streaming decompression concurrent with download using an on-disk spool, then maybe, but not with the bunzip2-tailf command I wrote.

tsibley · 2021-12-16T20:58:53Z

So I think the practical questions for a decision are: how much longer than downloading straight to disk does single-threaded decompression from disk take? if longer, how does it compare to parallel decompression from disk?

My guess is that concurrent decompression might win out, even though single-threaded, but not sure. Should be easy enough to benchmark.

ivan-aksamentov · 2021-12-16T21:04:50Z

At this point we don't have ingest results and so builds are pretty much cancelled. Which probably makes the fastest implementable and the most reliable solution (the one that won't fail on the first run) the best in my view.

We could also try to run both on separate branch runs and see which completes first. And then if at least one of them succeeds we could kick start the preprocess and builds manually hopefully if scientists find that they need them.

The good new is that it's the first stage in the pipe, so it fails fast.

ivan-aksamentov · 2021-12-16T21:07:09Z

I also wonder of there's a third-party tool or if you could adjust your tool, so that it does the same spool trick, but would allow to pipe the output to the parallel bzip (or potentially to xz when it's available later). I.e. to decouple the trick itself from the compression algorithm. Then we could take advantage of both, the concurrent download and parallel decompress.

tsibley · 2021-12-16T21:11:21Z

At this point we don't have ingest results and so builds are pretty much cancelled. Which probably makes the fastest implementable and the most reliable solution (the one that won't fail on the first run) the best in my view.

Well, none of these are definitely more reliable. The issue appears to be wholly on the network/upstream server, not within our decompression. The guess this is all predicated on is that reducing network transfer time will help avoid the issue.

I.e. to decouple the trick itself from the compression algorithm. Then we could take advantage of both, the concurrent download and parallel decompress.

You could do this with normal tail, I think. I combined the spool + decompression into a single process to avoid the downside of more processes having to read/write the whole data stream, but maybe that doesn't matter.

tsibley · 2021-12-16T21:40:22Z

You could do this with normal tail, I think.

I'm writing this and will open another PR for it.

tsibley · 2021-12-17T01:26:52Z

That was a little more involved than I expected, but try #256.

tsibley added 3 commits December 3, 2021 13:13

Compare against relevant remote file for new record notification

e04ad9b

The GenBank download was being compared against a non-existent file: s3://nextstrain-data/files/ncov/open/gisaid.ndjson.xz which caused it to always skip notifications.

Reorganize how download_main_ndjson rule is conditionally defined

fb814d4

As the message, params, and run blocks are all distinct between fetch_from_database True vs. False, it seems more readable to me to define parallel rule definitions within a top-level conditional.

tsibley mentioned this pull request Dec 17, 2021

Spool GISAID download to disk while still decompressing concurrently #256

Closed

tsibley closed this Dec 17, 2021

victorlin deleted the gisaid/decompress-after-download branch February 20, 2024 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GISAID: Decompress after download #245

GISAID: Decompress after download #245

tsibley commented Dec 4, 2021 •

edited

Loading

ivan-aksamentov commented Dec 16, 2021 •

edited

Loading

tsibley commented Dec 16, 2021

tsibley commented Dec 16, 2021

ivan-aksamentov commented Dec 16, 2021 •

edited

Loading

ivan-aksamentov commented Dec 16, 2021 •

edited

Loading

tsibley commented Dec 16, 2021

tsibley commented Dec 16, 2021

tsibley commented Dec 17, 2021

GISAID: Decompress after download #245

GISAID: Decompress after download #245

Conversation

tsibley commented Dec 4, 2021 • edited Loading

Footnotes

ivan-aksamentov commented Dec 16, 2021 • edited Loading

tsibley commented Dec 16, 2021

tsibley commented Dec 16, 2021

ivan-aksamentov commented Dec 16, 2021 • edited Loading

ivan-aksamentov commented Dec 16, 2021 • edited Loading

tsibley commented Dec 16, 2021

tsibley commented Dec 16, 2021

tsibley commented Dec 17, 2021

tsibley commented Dec 4, 2021 •

edited

Loading

ivan-aksamentov commented Dec 16, 2021 •

edited

Loading

ivan-aksamentov commented Dec 16, 2021 •

edited

Loading

ivan-aksamentov commented Dec 16, 2021 •

edited

Loading