-
Notifications
You must be signed in to change notification settings - Fork 1
Create study cohort
-
6.2 Overall Training Dataset
- 6.2.0 Deidentify the Data
- 6.2.1 Overview of Cohort Creation
- 6.2.2 Connect to Postgres Database
- 6.2.3 Convert Data to CSV
- 6.2.4 Create Patients Table
- 6.2.5 Create Medevid Table
- 6.2.6 Join Patients to Medevid
- 6.2.7 Create Transplant Waitlist Features
- 6.2.8 Create Partition Data
- 6.2.9 Join patients_medevid_waitlist Table to the Partition Index
- 6.2.10 Get Pre-ESRD Claims Data
- 6.2.11 Create Claims Tables
- 6.2.12 Map Diagnosis Codes (drg_cd) to Primary Diagnosis Codes (pdgns_cd)
- 6.2.13 Get pre-2011 pre-ESRD Claims Data
- 6.2.14 Diagnosis Groupings
- 6.2.15 Aggregate pre-ESRD Claims Data
- 6.2.16 Join the preesrdfeatures Tables to the Partition Index
- 6.2.17 Map ICD-9 to ICD-10
- 6.2.18 Prepare Data for Modeling
- 6.2.19 Impute Missing Values
-
6.2.20 Utility Files
- dx_mappings_ucsf.csv
- 2017_I9gem_map.txt
- icd10_ccs_codes.R
- icd10_dx_codes.txt
- icd9_ccs_codes.R
- icd9_dx_2014.txt
- imputation_rules.xlsx
- pre_esrd_ip_claim_variables.R
- pre_esrd_hh_claim_variables.R
- pre_esrd_hs_claim_variables.R
- pre_esrd_op_claim_variables.R
- pre_esrd_sn_claim_variables.R
- pre_esrd_pre2011_claim_variables.R
- setfieldtypes.R
- 6.2.21 Documentation of the Training Dataset
The source data for building the overall training dataset was obtained from the United States Renal Data System (USRDS), the national data registry developed from resources initiated by the Centers for Medicare & Medicaid Services (CMS) and its funded end-stage kidney disease (ESKD) networks and subsequently maintained by the National Institute for Diabetes and Digestive and Kidney Diseases (NIDDK). USRDS stores and distributes data on the outcomes and treatments of chronic kidney disease (CKD) and ESKD population in the U.S. (Note: to be consistent with USRDS terminology for data tables, this document uses end stage renal disease - ESRD - instead of ESKD.) To better understand the data, data profiling was performed on the demographic variables and the outcome variable of interest (mortality in the first 90 days of dialysis). Information on constructing the outcome variable can be found in Section 6.2.4 Create Patients Table.
Section 6.2 details the methodology used to create the overall training dataset. A high level overview of the tables used for the training dataset can be found in Section 6.2.1 Overview of Cohort Creation and results in a final dataset with 1,150,195 observations and 188 variables. The final dataset used for modeling is stored in PostgreSQL (Postgres) tables called medxpreesrd
for the non-imputed variables and micecomplete_pmm
for the imputed variables (5 sets of imputations were generated; more information on imputations can be found in Section 6.2.19 Impute Missing Values).
The construction of medxpreesrd
involves using more than 20 USRDS data tables, as well as publicly available data, for mapping diagnosis codes to groupings.
All scripts are located in the DataSet/ directory on GitHub.
Two types of files are involved in constructing medxpreesrd
:
- Sequential scripts - these have the prefix S0-, "S1-", etc. to indicate the sequence in which they are run
- Utility scripts - these create the data used by the sequential scripts
Other resources that could be helpful to users include:
The Office of the National Coordinator for Health Information Technology (ONC)