Skip to content

Create study cohort

Summer K. Rankin edited this page Sep 13, 2021 · 1 revision

The source data for building the overall training dataset was obtained from the United States Renal Data System (USRDS), the national data registry developed from resources initiated by the Centers for Medicare & Medicaid Services (CMS) and its funded end-stage kidney disease (ESKD) networks and subsequently maintained by the National Institute for Diabetes and Digestive and Kidney Diseases (NIDDK). USRDS stores and distributes data on the outcomes and treatments of chronic kidney disease (CKD) and ESKD population in the U.S. (Note: to be consistent with USRDS terminology for data tables, this document uses end stage renal disease - ESRD - instead of ESKD.) To better understand the data, data profiling was performed on the demographic variables and the outcome variable of interest (mortality in the first 90 days of dialysis). Information on constructing the outcome variable can be found in Section 6.2.4 Create Patients Table.

6.2 Overall Training Dataset

Section 6.2 details the methodology used to create the overall training dataset. A high level overview of the tables used for the training dataset can be found in Section 6.2.1 Overview of Cohort Creation and results in a final dataset with 1,150,195 observations and 188 variables. The final dataset used for modeling is stored in PostgreSQL (Postgres) tables called medxpreesrdfor the non-imputed variables and micecomplete_pmmfor the imputed variables (5 sets of imputations were generated; more information on imputations can be found in Section 6.2.19 Impute Missing Values).

The construction of medxpreesrd involves using more than 20 USRDS data tables, as well as publicly available data, for mapping diagnosis codes to groupings.

All scripts are located in the DataSet/ directory on GitHub.

Two types of files are involved in constructing medxpreesrd:

  1. Sequential scripts - these have the prefix S0-, "S1-", etc. to indicate the sequence in which they are run
  2. Utility scripts - these create the data used by the sequential scripts

Other resources that could be helpful to users include: