Batch ETL pipeline to mirror ClinVar releases first into the Terra Data Repository (TDR), then into ClinGen's Kafka processing pipeline.
The schema used for this dataset was (largely) designed by the ClinGen team. JSON
definitions of the resulting tables can be found under schema/
. Some
pieces of the design are worth calling out here:
- Every table includes a
release_date
column - Most tables contain a
content
column to store unmodeled data - Some tables are used to track processing provenance, instead of primary data
Every table in the ClinVar dataset includes a release_date
column, used as (part of)
the primary key. This means that the base TDR dataset shows the latest data for every
processed release, not just the latest processed release. We do this to enable:
- Parallel processing & ingest of raw archives
- Targeted reprocessing of archives
The downsides of this scheme are:
- A significant amount of data is duplicated from release to release
- Creating a snapshot of a single release is more complicated
Most tables in the ClinVar dataset include a content
column. We use these columns to
store unmodeled data as string-encoded JSON. This allows us to fully ETL the pieces of
the data we're most interested in, without throwing any information away. It also insulates
us against future additions in the NCBI source schema.
The xml_archive
and processing_history
tables both track provenance-related
information in the dataset.
xml_archive
tracks references to raw XML payloads per releaseprocessing_history
tracks the pipeline version and run date of processing per release
The ingest pipeline uses information in these tables to short-circuit when appropriate.
- If a release date is already present in
xml_archive
, the corresponding raw XML is not re-ingested - If a release date / pipeline version pair is already present in
processing_history
, the corresponding raw XML is not re-processed using that pipeline version
The ingest pipeline is orchestration through Argo, with most data processing logic delegated to Dataflow and BigQuery. The high level flow looks like:
As shown, end-to-end ingest is broken into 3 distinct sub-workflows, to mitigate failures and reduce the work needed to run a partial reprocessing. The 3 phases are:
- Raw data ingest
- Data processing and re-ingest
- Export to ClinGen
Stage 1 of ingest prioritizes stashing the raw XML payloads from NCBI in the TDR. We package this as a focused end-to-end workflow because NCBI weekly releases are only available during the same month as their release. If we tightly coupled ingesting these raw payloads with their downstream processing, and that processing suddenly broke, it could take us "too long" to fix the logic and the weekly release could disappear. Separating the XML ingest into its own step minimizes the chances of this scenario.
Stage 2 of ingest runs the bulk of our business logic. For a target release date, it:
- Downloads a raw XML release from the TDR, to a local volume in k8s
- Converts the XML to JSON-list, and uploads it back to GCS
- Runs a Dataflow pipeline to process the JSON into our target schema
- Per-table:
- Uses BigQuery to diff staged data against the current TDR state
- Uses the TDR APIs to ingest/soft-delete data so that the TDR state matches staged data
- Cuts a TDR snapshot for the release
Stage 3 of ingest exports processed data to ClinGen. For a pair of release dates, it:
- Per-table:
- Compares data across the two releases to get a diff of created/updated/deleted rows
- Exports the three "slices" of data from the diff into GCS
- Produces a Kafka message to ClinGen, pointing at the exported data