Skip to content

Latest commit

 

History

History
84 lines (73 loc) · 4.07 KB

readme.md

File metadata and controls

84 lines (73 loc) · 4.07 KB

PM2.5 Data Pipeline

Ben Sabath June 22, 2021

PM2.5 Input Data

We receive a series of .RDS files from Dr. Joel Schwarz’s team containing annual estimates of mean pm2.5 for each zip code. We have not been given permission to share these data, but the directory raw_data/annual_pm25 is created as a place holder. Yaguang Wei, a member of Dr Schwartz’s team (who developed the aggregation method) wrote the following description of how the estimates were created:

The daily and annual estimations of ambient PM2.5 at ZIP Codes across the continental US, 2000-2016, were aggregated from the estimations at 1km*1km grid cells. Below is a summary of how these estimations were created:

The daily ambient PM2.5 at 1 km×1 km grid cells across the contiguous US, 2000-2016, were estimated by Qian Di using well-validated ensemble models. Briefly, for each pollutant, they estimated daily concentrations at each grid cell by combining predictions from three machine learning algorithms (i.e., random forest, gradient boosting, and neural network) in a geographically weighted regression. Multiple sources of predictors were fused, including ground monitoring data, satellite-derived measurements of aerosol optical depth, meteorological conditions (e.g., daily air temperature, relative humidity, wind speed, and height of planetary boundary layer), chemical transport model simulations, and land-use variables (e.g., distance to major roads, emission, and land use pattern), etc.

Using the daily predictions at 1 km×1 km grids, Yaguang Wei estimated daily concentrations at ZIP Codes across the contiguous US, 2000-2016. There are two major types of ZIP Codes: standard ZIP Code which is an area surrounding a post office, and PO Box which is used only for a particular facility, such as a large office building, university, bank, etc. For a standard ZIP Code, we estimated its daily concentrations by averaging the estimations at grid cells whose centroids fall within the boundary of that ZIP Code (there are no official boundaries defined by standard ZIP Codes, so we used the ZIP Code polygon data generated by Environmental Systems Research Institute a.k.a. “Esri”); for a PO Box, we estimated the daily concentrations by linking it to the nearest grid.

Please note that we are unable to provide the code and workflow for the initial creation of the pm2.5 (preventing this entire data flow from being producible from true source), as our collaborators were unable to provide their entire process. However, the PM2.5 exposure data we used is available for download here.

Code

The directory code/yaguang_pm25_code contains the code provided to us by Dr. Joel Schwartz’s team that they use to create the zip code level estimates from their grid level data.

After receiving the data from Dr. Schwartz, we convert the provided RDS files to CSVs and combine them in to a single file to prepare them for joining with our other data sources(code/combine_years.R).

Output

We do not have permission to share the pm2.5 data as of yet. However, the output of our processing is a file named processed_data/all_years.csv. The columns in this file are: - ZIP: zip code - year: year - pm25: pm25 estimate, micrograms per cubic meter of air

References:

  • Di Q, Amini H, Shi L, et al. An ensemble-based model of PM2.5 concentration across the contiguous United States with high spatiotemporal resolution. Environ Int. 2019; 130: 104909.
  • Di Q, Amini H, Shi L, et al. Assessing NO2 Concentration and Model Uncertainty with High Spatiotemporal Resolution across the Contiguous United States Using Ensemble Model Averaging. Environ Sci Technol. 2019.
  • Environmental Systems Research Institute (ESRI). Esri Data & Maps 10: An Esri White Paper. 2010. Available from: http://downloads.esri.com/support/whitepapers/ao_/Esri_Data_and_Maps_10.pdf [Accessed 28 Sep. 2019].