-
Notifications
You must be signed in to change notification settings - Fork 13
N3C Data Lifecycle
Welcome to the N3C Data Ingestion Wiki! This is a general overview of how N3C data is ingested.
When a N3C data partner submits data to the N3C sFTP folder, a single row is added to the data partner registration dataset that contains basic information about the data partner, which contains data partner id that is used to set up the site's data pipeline. The unique number between 0 and 1023 is used as the data partner id, as the process of generating N3C primary keys reserves 10 bits for indicating a record's originating site. A dataset with a single row of data partner id, source CDM, sFTP folder name and the site's long and short name is generated and provided as an input to the template to set up the data ingestion pipeline.
After a new site registration, a job is created to check for new dataset submission at every hour at the site's incoming folder through a Foundry's DataConnection app to pull in new data. sFTP folder is set up using the following sFTP folder convention: <site_abbriv_name>/incoming/.The following regex is used to ensure correct file is ingested: incoming/.<site_name>_<cdm_name>..zip For example, it is possible for site to submit other datasets like viral variant summaries. Using the filesize and last modified date of the file as the filter already synced files are excluded from processing. Using the update transaction mode only the newly submitted files are synced. The submitted site's zip file is placed in the DI&H/raw_data/zip dataset/ folder using the following naming convention for processing, site_<data_partner_id>__raw_zips.
4. Templates checks can be configured, if need to error or warn on site specific processing issues. - optional
The pipeline build schedule is setup when the new site's template is deployed. The pipeline build schedule is manually run for the first ingestion. It is also set up to run when a new site's raw zip file is synched. This kicks off a build for all of the datasets in the pipeline. This step can take anywhere from 20 minutes to an hour based on the size of the submitted zip file.
After each site's pipeline build is successfully run and all datasets are built without any failures and its data quality checks receive the passing marks, then each site's OMOP instance is unioned to the Limited Data Set and Safe Harbor data repository for research. Data Quality checks are performed via the Data Quality Portal(DQP). DQP is a separated code repository that has collection Khan framework based set of queries that are run to measure various data quality metrics at data conformance, completeness and plausibility level. For example, N3C data has two year look back from the COVID-19 pandemic year, and it checks if the measurement data are within proper temporal time window and within proper thresholds values for each measurement concept.
- The Enclave API now supports authority property. This property is used to indicate the authority source of the concept set.
DataSet/Manifest Contents and Structures
- CDM minimum required tables
- OMOP Vocabulary Updates
- Process Overview
- DataSet Submission Format
- N3CID Global Domain ID
- Terminology Value Set Mapping Table Structure Explained
- PCORnet terms cross walk table
- ACT term cross walk table
- ACT: Additional Mapping Notes
- TrinetX: Mapping Notes
- Qualitative Results VS Mapping
- Corrected LOINC when COVID test LOINC is missing
- DEMOGRAPHIC VALUSETS
- COVID-19 LOINC corrected
- Safe Harbor Requirement
- type_concept_id default values
- Immunization value-set mappings
- Lab confirmed positive cohort definition
- Long COVID Specialty clinic visit
- Categorical Answer Concept Maps
- BulkImportConceptSetCreation
- Viral Variant Mapping Information
- CMS Data Integration Guidelines
- N3C Data Ingestion
- N3C Data Transformation
- Value Set Mapping table structure
- measurement unit harmonization
- Data Quality Public Worksheet
- PPRL N3C - CMS Release Notes
- N3C Custom Concept Extensions
- Mitigate Concept ID Misalignment in the OMOP CDM Pipeline and Resolve Terminology Drift
Includes Overview, PreRelease Variable Punchlist, References for creating phenotypes.