Ben Sabath June 22, 2021
The code in this directory cleans and merges exposure, covariate, and health data to produce combined data sets covering the period from 1999-2016 that can be used to estimate the effects of air quality exposures on health outcomes.
As input to the merge the process takes the following files, produced in the other processes contained in this repository:
-
Denominator_1999_2016.csv
: The medicare beneficiary enrollment file from 1999-2016. The code producing this is in theHealthOutcomes
directory. We are unable to share this due to the private nature of the data. -
census_interpolated_zips.csv
: The interpolated census confounder data covering the period from 1999-2016. The code and source data producing this file is available in theConfounders/census
directory. This file is too large to be shared on github in the form we merge. However, by running the provided code on the available source data, it can be recreated. -
brfss_interpolated.csv
: The interpolated BRFSS (county level smoking rate and mean BMI) data from 1999-2016. The workflow creating this file is inConfounders/brfss
. The source data is too large to share on github; however, we provide instructions for downloading it, the code we use to create the final data set, and the final data set. -
all_years.csv
: This file contains estimates of annual pm2.5 exposure for each zip code covering the period from 2000-2016. The file can be found inExposures/processed_data
. Please see the readme in the Exposures directory for a description of how the file is created and for a link to the source grid point estimates. -
temperature_seasonal_zipcode_combined.csv
: This file contains summer and winter temperature and humidity covering the period from 2000 until the end of winter 2020. The workflow producing this data is available inConfounders/earth_engine
. The source data is again too large to share, but we share the code we run on google earth engine to produce the data, as well as the rest of the code we use, and the final products.
- Base R: 3.5.1, Intel MKL Kernel
- data.table R package: 1.11.4
- fst R package: 0.8.8
The basic workflow is illustrated in the figure below.
The process is as follows. First, in 1_prep_health_data_to_fst.R
, the
CSV file containing the health data is split in to multiple .fst
files, one for each year. Next, we have to deal with duplicate
individuals in the data set. Given that the data we receive from
medicare is administrative in nature and not prepared for research, we
observe some instances of the same individual appearing multiple times
in a single year (which should not happen in the beneficiary summary
data). 2_check_qid_dups.R
quantifies the scale of the duplicates, then
3_remove_dups and missing.R
removes the duplicates. The system for
removing duplicates first removes observations with more missing
information than other observations. Following that, of the remaining
duplicates with equivalent levels of missingness, one observation is
randomly selected to be kept. Next we prepare the confounder and
exposure data sets for merging with the health data in
4_merge_coviariates.R
, where all of the confounders are merged on zip
code and year so that only a single join needs to be performed with the
larger health data set. We additionally calculate individual level
variables here (first year that the individuals appear in the cohort,
age at entry in to the cohort) that are needed in later analysis
(4_score_participant_variables.R
). The next step (5_merge_health.R
)
is the large merge, where we join the participant level variables with
the person-year data in the beneficiary summary file on individual ID
and join the zip code level exposure and confounders with the
beneficiary file on the basis of zip code of residence and year.
Following this step and after initially using this data in our analysis,
we observed that some individuals in the cohort had multiple days of
death on record. The code in 6_remove_varying_deaths.R
excludes these
individuals entirely from our data set, as we cannot be confident about
which is the true date of death. Finally, we realized that seasonal
temperature and humidity data would also improve our analysis, so in
7_merge_seasonal_temperature.R
we add those variables in. The files
produced at the end of this script are what are used as input for the
statistical analysis.