Skip to content

Commit 5b19e00

Browse files
committed
Merge branch 'master' into clean-metadata
2 parents 668041a + cfc5009 commit 5b19e00

File tree

9 files changed

+8645
-1577
lines changed

9 files changed

+8645
-1577
lines changed

.github/workflows/ingest-gisaid-branch.yml renamed to .github/workflows/fetch-and-ingest-gisaid-branch.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: '[branch] Ingest 2019-nCov/SARS-CoV-2 data from GISAID for nextstrain.org/ncov'
1+
name: '[branch] Fetch & Ingest 2019-nCov/SARS-CoV-2 data from GISAID for nextstrain.org/ncov'
22

33
on:
44
push:
@@ -18,7 +18,7 @@ jobs:
1818
python3 -m pip install --upgrade pip setuptools
1919
python3 -m pip install pipenv
2020
pipenv sync
21-
pipenv run ./bin/ingest-gisaid
21+
pipenv run ./bin/ingest-gisaid --fetch
2222
env:
2323
AWS_DEFAULT_REGION: ${{ secrets.AWS_DEFAULT_REGION }}
2424
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
name: 'Fetch & Ingest 2019-nCov/SARS-CoV-2 data from GISAID for nextstrain.org/ncov'
2+
3+
on:
4+
# Manually triggered using `./bin/trigger fetch-and-ingest`
5+
repository_dispatch:
6+
types: fetch-and-ingest
7+
8+
jobs:
9+
ingest:
10+
runs-on: ubuntu-latest
11+
steps:
12+
- uses: actions/checkout@v1
13+
- name: ingest
14+
run: |
15+
PATH="$HOME/.local/bin:$PATH"
16+
python3 -m pip install --upgrade pip setuptools
17+
python3 -m pip install pipenv
18+
pipenv sync
19+
pipenv run ./bin/ingest-gisaid --fetch
20+
env:
21+
AWS_DEFAULT_REGION: ${{ secrets.AWS_DEFAULT_REGION }}
22+
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
23+
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
24+
GISAID_API_ENDPOINT: ${{ secrets.GISAID_API_ENDPOINT }}
25+
GISAID_USERNAME_AND_PASSWORD: ${{ secrets.GISAID_USERNAME_AND_PASSWORD }}
26+
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}
27+
SLACK_CHANNELS: ncov-gisaid-updates

.github/workflows/ingest-gisaid-master.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,5 @@ jobs:
2929
AWS_DEFAULT_REGION: ${{ secrets.AWS_DEFAULT_REGION }}
3030
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
3131
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
32-
GISAID_API_ENDPOINT: ${{ secrets.GISAID_API_ENDPOINT }}
33-
GISAID_USERNAME_AND_PASSWORD: ${{ secrets.GISAID_USERNAME_AND_PASSWORD }}
3432
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}
3533
SLACK_CHANNELS: ncov-gisaid-updates

README.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,15 +11,23 @@ If you're using Pipenv (see below), then run commands from `./bin/…` inside a
1111

1212
## Running automatically
1313
The ingest pipeline exists as the GitHub workflows `.github/workflows/ingest-master-*.yml` and `…/ingest-branch-*.yml`.
14-
It is run on pushes to `master` that modify `source-data/annotations.tsv` and on pushes to other branches.
14+
It is run on pushes to `master` that modify `source-data/*-annotations.tsv` and on pushes to other branches.
1515
Pushes to branches other than `master` upload files to branch-specific paths in the S3 bucket, don't send notifications, and don't trigger Nextstrain rebuilds, so that they don't interfere with the production data.
1616

1717
AWS credentials are stored in this repository's secrets and are associated with the `nextstrain-ncov-ingest-uploader` IAM user in the Bedford Lab AWS account, which is locked down to reading and publishing only the `gisaid.ndjson`, `metadata.tsv`, and `sequences.fasta` files and their zipped equivalents in the `nextstrain-ncov-private` S3 bucket.
1818

1919
## Manually triggering the automation
20-
You can manually trigger the full automation by running `./bin/trigger ingest --user <your-github-username>`.
21-
If you want to only trigger a rebuild of [nextstrain/ncov](https://github.com/nextstrain/ncov) without re-ingesting data from GISAID first, run `./bin/trigger rebuild --user <your-github-username>`.
22-
See the output of `./bin/trigger ingest` or `./bin/trigger rebuild` for more information about authentication with GitHub.
20+
A full run is a now done in 3 steps via manual triggers:
21+
1. Fetch new sequences and ingest them by running `./bin/trigger fetch-and-ingest --user <your-github-username>`.
22+
2. Add manual annotations, update location hierarchy as needed, and run ingest without fetching new sequences.
23+
* Pushes of `source-data/*-annotations.tsv` to the master branch will automatically trigger a run of ingest.
24+
* You can also run ingest manually by running `./bin/trigger ingest --user <your-github-username>`.
25+
3. Once all manual fixes are complete, trigger a rebuild of [nextstrain/ncov](https://github.com/nextstrain/ncov) by running `./bin/trigger rebuild --user <your-github-username>`.
26+
27+
See the output of `./bin/trigger fetch-and-ingest --user <your-github-username>`, `./bin/trigger ingest` or `./bin/trigger rebuild` for more information about authentication with GitHub.
28+
29+
Note: running `./bin/trigger` posts a GitHub `repository_dispatch`.
30+
Regardless of which branch you are on, it will trigger the specified action on the master branch.
2331

2432
## Updating manual annotations
2533
Manual annotations should be added to `source-data/gisaid_annotations.tsv`.

bin/ingest-gisaid

Lines changed: 102 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,69 +1,117 @@
11
#!/bin/bash
2+
# usage: ingest-gisaid [--fetch]
3+
# ingest-gisaid --help
4+
#
5+
# Ingest SARS-CoV-2 metadata and sequences from GISAID.
6+
#
7+
# If the --fetch flag is given, new records are fetched from GISAID. Otherwise,
8+
# ingest from the existing GISAID NDJSON file on S3.
9+
#
210
set -euo pipefail
311

412
: "${S3_SRC:=s3://nextstrain-ncov-private}"
513
: "${S3_DST:=$S3_SRC}"
614

7-
# Determine where to save data files based on if we're running as a result of a
8-
# push to master or to another branch (or locally, outside of the GitHub
9-
# workflow). Files are always compared to the default/primary paths in the
10-
# source S3 bucket.
11-
#
12-
silent=
13-
branch=
15+
main() {
16+
local fetch=0
17+
18+
for arg; do
19+
case "$arg" in
20+
-h|--help)
21+
print-help
22+
exit
23+
;;
24+
--fetch)
25+
fetch=1
26+
shift
27+
break
28+
;;
29+
esac
30+
done
31+
32+
# Determine where to save data files based on if we're running as a result of a
33+
# push to master or to another branch (or locally, outside of the GitHub
34+
# workflow). Files are always compared to the default/primary paths in the
35+
# source S3 bucket.
36+
#
37+
local silent=
38+
local branch=
39+
40+
case "${GITHUB_REF:-}" in
41+
refs/heads/master)
42+
# Do nothing different; defaults above are good.
43+
branch=master
44+
;;
45+
refs/heads/*)
46+
# Save data files under a per-branch prefix
47+
silent=yes
48+
branch="${GITHUB_REF##refs/heads/}"
49+
S3_DST="$S3_DST/branch/$branch"
50+
;;
51+
"")
52+
# Save data files under a tmp prefix
53+
silent=yes
54+
S3_DST="$S3_DST/tmp"
55+
;;
56+
*)
57+
echo "Skipping ingest for ref $GITHUB_REF"
58+
exit 0
59+
;;
60+
esac
61+
62+
echo "S3_SRC is $S3_SRC"
63+
echo "S3_DST is $S3_DST"
1464

15-
case "${GITHUB_REF:-}" in
16-
refs/heads/master)
17-
# Do nothing different; defaults above are good.
18-
branch=master
19-
;;
20-
refs/heads/*)
21-
# Save data files under a per-branch prefix
22-
silent=yes
23-
branch="${GITHUB_REF##refs/heads/}"
24-
S3_DST="$S3_DST/branch/$branch"
25-
;;
26-
"")
27-
# Save data files under a tmp prefix
28-
silent=yes
29-
S3_DST="$S3_DST/tmp"
30-
;;
31-
*)
32-
echo "Skipping ingest for ref $GITHUB_REF"
33-
exit 0
34-
;;
35-
esac
65+
cd "$(dirname "$0")/.."
3666

37-
echo "S3_SRC is $S3_SRC"
38-
echo "S3_DST is $S3_DST"
67+
set -x
3968

40-
cd "$(dirname "$0")/.."
69+
if [[ "$fetch" == 1 ]]; then
70+
./bin/fetch-from-gisaid > data/gisaid.ndjson
71+
if [[ "$branch" == master ]]; then
72+
./bin/notify-on-record-change data/gisaid.ndjson "$S3_SRC/gisaid.ndjson.gz" "GISAID"
73+
fi
74+
./bin/upload-to-s3 --quiet data/gisaid.ndjson "$S3_DST/gisaid.ndjson.gz"
75+
else
76+
aws s3 cp --no-progress "$S3_DST/gisaid.ndjson.gz" - | gunzip -cfq > data/gisaid.ndjson
77+
fi
4178

42-
set -x
79+
./bin/transform-gisaid data/gisaid.ndjson \
80+
--output-metadata data/gisaid/metadata.tsv \
81+
--output-fasta data/gisaid/sequences.fasta
4382

44-
./bin/fetch-from-gisaid > data/gisaid.ndjson
45-
if [[ "$branch" == master ]]; then
46-
./bin/notify-on-record-change data/gisaid.ndjson "$S3_SRC/gisaid.ndjson.gz" "GISAID"
47-
fi
48-
./bin/upload-to-s3 --quiet data/gisaid.ndjson "$S3_DST/gisaid.ndjson.gz"
83+
./bin/flag-metadata data/gisaid/metadata.tsv > data/gisaid/flagged_metadata.txt
84+
./bin/check-locations data/gisaid/metadata.tsv \
85+
data/gisaid/location_hierarchy.tsv \
86+
gisaid_epi_isl
4987

50-
./bin/transform-gisaid data/gisaid.ndjson \
51-
--output-metadata data/gisaid/metadata.tsv \
52-
--output-fasta data/gisaid/sequences.fasta
88+
if [[ "$branch" == master ]]; then
89+
./bin/notify-on-metadata-change data/gisaid/metadata.tsv "$S3_SRC/metadata.tsv.gz" gisaid_epi_isl
90+
./bin/notify-on-additional-info-change data/gisaid/additional_info.tsv "$S3_SRC/additional_info.tsv.gz"
91+
./bin/notify-on-flagged-metadata-change data/gisaid/flagged_metadata.txt "$S3_SRC/flagged_metadata.txt.gz"
92+
./bin/notify-on-location-hierarchy-addition data/gisaid/location_hierarchy.tsv source-data/location_hierarchy.tsv
93+
fi
5394

54-
./bin/flag-metadata data/gisaid/metadata.tsv > data/gisaid/flagged_metadata.txt
55-
./bin/check-locations data/gisaid/metadata.tsv \
56-
data/gisaid/location_hierarchy.tsv \
57-
gisaid_epi_isl
95+
./bin/upload-to-s3 ${silent:+--quiet} data/gisaid/metadata.tsv "$S3_DST/metadata.tsv.gz"
96+
./bin/upload-to-s3 ${silent:+--quiet} data/gisaid/additional_info.tsv "$S3_DST/additional_info.tsv.gz"
97+
./bin/upload-to-s3 ${silent:+--quiet} data/gisaid/flagged_metadata.txt "$S3_DST/flagged_metadata.txt.gz"
98+
./bin/upload-to-s3 ${silent:+--quiet} data/gisaid/sequences.fasta "$S3_DST/sequences.fasta.gz"
99+
}
58100

59-
if [[ "$branch" == master ]]; then
60-
./bin/notify-on-metadata-change data/gisaid/metadata.tsv "$S3_SRC/metadata.tsv.gz" gisaid_epi_isl
61-
./bin/notify-on-additional-info-change data/gisaid/additional_info.tsv "$S3_SRC/additional_info.tsv.gz"
62-
./bin/notify-on-flagged-metadata-change data/gisaid/flagged_metadata.txt "$S3_SRC/flagged_metadata.txt.gz"
63-
./bin/notify-on-location-hierarchy-addition data/gisaid/location_hierarchy.tsv source-data/location_hierarchy.tsv
64-
fi
101+
print-help() {
102+
# Print the help comments at the top of this file ($0)
103+
local line
104+
while read -r line; do
105+
if [[ $line =~ ^#! ]]; then
106+
continue
107+
elif [[ $line =~ ^# ]]; then
108+
line="${line/##/}"
109+
line="${line/# /}"
110+
echo "$line"
111+
else
112+
break
113+
fi
114+
done < "$0"
115+
}
65116

66-
./bin/upload-to-s3 ${silent:+--quiet} data/gisaid/metadata.tsv "$S3_DST/metadata.tsv.gz"
67-
./bin/upload-to-s3 ${silent:+--quiet} data/gisaid/additional_info.tsv "$S3_DST/additional_info.tsv.gz"
68-
./bin/upload-to-s3 ${silent:+--quiet} data/gisaid/flagged_metadata.txt "$S3_DST/flagged_metadata.txt.gz"
69-
./bin/upload-to-s3 ${silent:+--quiet} data/gisaid/sequences.fasta "$S3_DST/sequences.fasta.gz"
117+
main "$@"

bin/trigger

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
set -euo pipefail
33

44
bin="$(dirname "$0")"
5-
event_type="${1:?An event type ("ingest" or "rebuild") is required as the first argument.}"
5+
event_type="${1:?An event type ("fetch-and-ingest", "ingest" or "rebuild") is required as the first argument.}"
66
shift
77

88
if [[ $# -eq 0 ]]; then

0 commit comments

Comments
 (0)