Skip to content

N3C Data Lifecycle

Stephanie Hong edited this page Sep 12, 2022 · 20 revisions

Overview

Welcome to the N3C Data Ingestion Wiki! This is a general overview of how N3C data is ingested.

1. Data Partner Registration

When a N3C data partner submits data to the N3C sFTP folder, a single row is added to the data partner registration dataset that contains basic information about the data partner, which contains data partner id that is used to set up the site's data pipeline. The unique number between 0 and 1023 is used as the data partner id, as the process of generating N3C primary keys reserves 10 bits for indicating a record's originating site. A dataset with a single row of data partner id, source CDM, sFTP folder name and the site's long and short name is generated and provided as an input to the template to set up the data ingestion pipeline.

2. Setup DataSync to load the zip file

After a new site registration, a job is created to check for new dataset submission at every hour at the site's incoming folder through a Foundry's DataConnection app to pull in new data. sFTP folder is set up using the following sFTP folder convention: <site_abbriv_name>/incoming/.The following regex is used to ensure correct file is ingested: incoming/.<site_name>_<cdm_name>..zip For example, it is possible for site to submit other datasets like viral variant summaries. Using the filesize and last modified date of the file as the filter already synced files are excluded from processing. Using the update transaction mode only the newly submitted files are synced. The submitted site's zip file is placed in the DI&H/raw_data/zip dataset/ folder using the following naming convention for processing, site_<data_partner_id>__raw_zips.

3. Deploy the CDM specific template

4. Templates checks can be configured, if need to error or warn on site specific processing issues. - optional

5. Data health check groups are set up for notification

6. Save Monocle graph for pipeline

7. Pipeline Build schedule

The pipeline build schedule is setup when the new site's template is deployed. The pipeline build schedule is manually run for the first ingestion. It is also set up to run when a new site's raw zip file is synched. This kicks off a build for all of the datasets in the pipeline. This step can take anywhere from 20 minutes to an hour based on the size of the submitted zip file.

8. LDS merge based on data quality checks

After each site's pipeline build is successfully run and all datasets are built without any failures and its data quality checks receive the passing marks, then each site's OMOP instance is unioned to the Limited Data Set and Safe Harbor data repository for research. Data Quality checks are performed via the Data Quality Portal(DQP). DQP is a separated code repository that has collection Khan framework based set of queries that are run to measure various data quality metrics at data conformance, completeness and plausibility level. For example, N3C data has two year look back from the COVID-19 pandemic year, and it checks if the measurement data are within proper temporal time window and within proper thresholds values for each measurement concept.

Clone this wiki locally