v1 ready for public consumption

Joe-Wasserman · Oct 4, 2021 · 3385538 · 3385538
1 parent 8d4e736
commit 3385538
Show file tree

Hide file tree

Showing 8 changed files with 13,298 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -1,2 +1,164 @@
-# covid-19-united-states-county-quarterly-excess-deaths
-Code and data used to produce estimates for United States excess mortality by that's probably during 2020.
+COVID-19 United States Excess Deaths by county and quarter
+================
+
+This repository contains code and data to estimate **expected deaths**
+and **excess deaths** in the United States in 2020 by **county** and
+**quarter**.
+
+Estimated excess deaths are available at
+[`/blob/main/results/united_states_county_quarterly_excess_deaths_estimates.csv`](https://github.com/mymil/covid-19-united-states-county-quarterly-excess-deaths/blob/main/results/united_states_county_quarterly_excess_deaths_estimates.csv).
+
+# Data Sources
+
+-   CDC WONDER all-cause, all-ages deaths by county and month
+
+    <https://wonder.cdc.gov/ucd-icd10.html>
+
+    These public-use historical mortality data were obtained for years
+    2015-2019 and used to estimate the expected deaths model. Because
+    these data have an unusual and specific license, they are not
+    included as part of the repository, but can be freely downloaded
+    from <https://wonder.cdc.gov/ucd-icd10.html>.
+
+    Note: County-months wither fewer than 10 deaths are censored in the
+    source data.
+
+-   NCHS Provisional COVID-19 Deaths by Quarter, County and Age for 2020
+
+    <https://data.cdc.gov/NCHS/AH-Provisional-COVID-19-Deaths-by-Quarter-County-a/ypxr-mz8e>
+
+    These public domain all-cause and COVID-19 mortality data were
+    obtained for 2020 and compared to estimated expected deaths to
+    calculate excess deaths. The age groups in these data are not
+    mutually exclusive. To minimize the prevalence of suppression in
+    these data, values from the two widest age groups, under 65 and 65
+    and older, were summed.
+
+    Note: Values for county-quarter-age group with fewer than 9 deaths
+    are suppressed.
+
+-   US Census Population Estimates
+
+    Population estimates for 2015-2019 were retrieved from the [US
+    Census Bureau Population Estimates
+    API](https://www.census.gov/data/developers/data-sets/popest-popproj.html)
+    via the [tidycensus](https://github.com/walkerke/tidycensus) R
+    package. Population estimates for 2019 were filled forward for 2020.
+
+    Note: This product uses the Census Bureau Data API but is not
+    endorsed or certified by the Census Bureau.
+
+# Excess Deaths Model
+
+The excess deaths model used began as an adaptation of *The Economist*’s
+[excess mortality
+model](https://github.com/TheEconomist/covid-19-excess-deaths-tracker),
+but has since diverged.
+
+Given the large number of counties in the United States (over 3000), a
+linear mixed model with county as a random grouping factor was used to
+make estimation tractable. This random grouping factor enables every
+county to have its own intercept in the final model.
+
+More precisely, the model regresses **total deaths per day** on:
+
+-   county population (z-scored)
+
+-   years since 2015
+
+-   quarter of the year (fixed grouping factor)
+
+-   county (random grouping factor)
+
+This model can be expressed by the equation:
+
+# Model Performance
+
+Because the COVID-19 pandemic only began partway through March, 2020, we
+can evaluate model performance by examining concordance of predicted and
+observed deaths in Q1 2020.
+
+Observed and expected deaths in Q1 2020 are highly correlated, r =
+0.998. As can be seen from the following scatterplot, total deaths
+(x-axis) tended to exceed expected deaths (y-axis). Unsurprisingly, some
+of the counties that were hardest hit early in the pandemic are among
+those with total deaths that most diverge from expected deaths.
+
+![](C:/Users/jwasserman/Documents/covid-19-united-states-county-quarterly-excess-deaths/README_files/figure-gfm/plot%20comparison-1.png)<!-- -->
+
+# Repository Organization
+
+-   Excess deaths estimates and fitted LMM are available in `/results/`.
+
+-   Code to fit the model and estimate excess deaths are available in
+    `/code/`.
+
+-   Data used in modeling are available in `/data/`, when possible.
+
+# License
+
+The code contained in this repository are available under the [MIT
+License](https://opensource.org/licenses/MIT), and the data generated by
+this code are licensed under the [Creative Commons Attribution 4.0
+International License](https://creativecommons.org/licenses/by/4.0/).
+
+    ## R version 4.0.4 (2021-02-15)
+    ## Platform: x86_64-w64-mingw32/x64 (64-bit)
+    ## Running under: Windows 10 x64 (build 19042)
+    ## 
+    ## Matrix products: default
+    ## 
+    ## locale:
+    ## [1] LC_COLLATE=English_United States.1252 
+    ## [2] LC_CTYPE=English_United States.1252   
+    ## [3] LC_MONETARY=English_United States.1252
+    ## [4] LC_NUMERIC=C                          
+    ## [5] LC_TIME=English_United States.1252    
+    ## 
+    ## attached base packages:
+    ## [1] stats     graphics  grDevices utils     datasets  methods   base     
+    ## 
+    ## other attached packages:
+    ##  [1] tinytex_0.34       knitr_1.31         equatiomatic_0.3.0 lme4_1.1-26       
+    ##  [5] Matrix_1.3-2       aweek_1.0.2        lubridate_1.7.10   data.table_1.14.0 
+    ##  [9] tidycensus_0.11.4  forcats_0.5.1      stringr_1.4.0      dplyr_1.0.5       
+    ## [13] purrr_0.3.4        readr_2.0.2        tidyr_1.1.3        tibble_3.1.0      
+    ## [17] ggplot2_3.3.5      tidyverse_1.3.1   
+    ## 
+    ## loaded via a namespace (and not attached):
+    ##   [1] rdocsyntax_0.4.1.9000 minqa_1.2.4           colorspace_2.0-0     
+    ##   [4] ellipsis_0.3.1        class_7.3-18          rgdal_1.5-23         
+    ##   [7] rprojroot_2.0.2       base64enc_0.1-3       fs_1.5.0             
+    ##  [10] rstudioapi_0.13       proxy_0.4-25          farver_2.1.0         
+    ##  [13] texPreview_1.5        ggrepel_0.9.1         fansi_0.4.2          
+    ##  [16] xml2_1.3.2            splines_4.0.4         cachem_1.0.4         
+    ##  [19] jsonlite_1.7.2        nloptr_1.2.2.2        broom_0.7.6          
+    ##  [22] dbplyr_2.1.1          png_0.1-7             broom.mixed_0.2.6    
+    ##  [25] shiny_1.6.0           clipr_0.7.1           compiler_4.0.4       
+    ##  [28] httr_1.4.2            backports_1.2.1       assertthat_0.2.1     
+    ##  [31] fastmap_1.1.0         svgPanZoom_0.3.4      cli_2.5.0            
+    ##  [34] later_1.2.0           htmltools_0.5.1.1     tools_4.0.4          
+    ##  [37] coda_0.19-4           gtable_0.3.0          glue_1.4.2           
+    ##  [40] reshape2_1.4.4        rappdirs_0.3.3        V8_3.4.0             
+    ##  [43] Rcpp_1.0.7            cellranger_1.1.0      pdftools_2.3.1       
+    ##  [46] vctrs_0.3.7           nlme_3.1-152          tigris_1.5           
+    ##  [49] xfun_0.26             rvest_1.0.0           mime_0.10            
+    ##  [52] lifecycle_1.0.0       statmod_1.4.35        MASS_7.3-53          
+    ##  [55] scales_1.1.1          hms_1.0.0             promises_1.2.0.1     
+    ##  [58] TMB_1.7.20            rematch2_2.1.2        yaml_2.2.1           
+    ##  [61] curl_4.3              memoise_2.0.0         stringi_1.5.3        
+    ##  [64] highr_0.8             desc_1.3.0            maptools_1.1-1       
+    ##  [67] e1071_1.7-6           boot_1.3-26           rlang_0.4.10         
+    ##  [70] pkgconfig_2.0.3       evaluate_0.14         lattice_0.20-41      
+    ##  [73] sf_1.0-2              labeling_0.4.2        tidyselect_1.1.0     
+    ##  [76] here_1.0.1            plyr_1.8.6            magrittr_2.0.1       
+    ##  [79] R6_2.5.0              magick_2.7.1          generics_0.1.0       
+    ##  [82] DBI_1.1.1             mgcv_1.8-33           whisker_0.4          
+    ##  [85] pillar_1.6.0          haven_2.3.1           foreign_0.8-81       
+    ##  [88] withr_2.4.2           units_0.7-1           sp_1.4-5             
+    ##  [91] modelr_0.1.8          crayon_1.4.1          uuid_0.1-4           
+    ##  [94] KernSmooth_2.23-18    utf8_1.2.1            tzdb_0.1.2           
+    ##  [97] rmarkdown_2.7         grid_4.0.4            readxl_1.3.1         
+    ## [100] qpdf_1.1              reprex_2.0.0          digest_0.6.27        
+    ## [103] classInt_0.4-3        xtable_1.8-4          httpuv_1.6.0         
+    ## [106] details_0.2.1         munsell_0.5.0         askpass_1.1
diff --git a/README_files/figure-gfm/plot comparison-1.png b/README_files/figure-gfm/plot comparison-1.png
diff --git a/code/clean_data_and_model.R b/code/clean_data_and_model.R
@@ -0,0 +1,219 @@
+# This script conforms historical and recent US mortality data
+# Estimates 2020 expected and excess mortality by county and quarter
+
+# Load libraries and functions ####
+library(tidyverse)
+library(tidycensus)
+library(data.table)
+library(lubridate)
+library(aweek)
+library(lme4)
+options(scipen=999)
+
+source(
+  file.path(here::here(), "code/estimate_excess_deaths.R")
+)
+
+# Import data ####
+
+# NOTE: Obtaining US census data via tidycensus requires a Census API key
+# Obtain a key here http://api.census.gov/data/key_signup.html
+# Store key in .Renviron with: census_api_key("YOUR KEY", install = TRUE)
+
+# Import historical county-monthly data downloaded from CDC WONDER
+# https://wonder.cdc.gov/ucd-icd10.html
+# NOTE: County-quarters with < 10 deaths are censored in these data
+united_states_county_monthly_total_deaths <- list.files(
+  file.path(here::here(), "data"),
+  pattern = "*.txt",
+  full.names = TRUE
+) %>%
+  map_dfr(
+    ~ data.table::fread(
+      .x,
+      na.strings = c("Missing", "Suppressed", "Not Applicable"),
+      keepLeadingZeros = TRUE,
+      colClasses = c("character")
+    )
+  )
+
+# Import NCHS Provisional COVID-19 Deaths by Quarter, County and Age for 2020 data
+# https://data.cdc.gov/NCHS/AH-Provisional-COVID-19-Deaths-by-Quarter-County-a/ypxr-mz8e
+# NOTE: County-quarter-agebands with < 10 deaths are censored in these data
+united_states_county_quarterly_covid_deaths <- data.table::fread(
+  "https://data.cdc.gov/api/views/ypxr-mz8e/rows.csv",
+  keepLeadingZeros = TRUE
+)
+
+# If the above CDC data no longer exists, retrieve the local version instead
+if(!exists("united_states_county_quarterly_covid_deaths")){
+  united_states_county_quarterly_covid_deaths <- data.table::fread(
+    file.path(
+      here::here(),
+      "data/united_states_county_quarterly_covid_deaths.csv"
+    ),
+    keepLeadingZeros = TRUE
+  )
+}
+
+# Import yearly county population estimates from US census
+# https://www.census.gov/programs-surveys/popest.html
+united_states_county_yearly_population <- tidycensus::get_estimates(
+  geography = "county",
+  product = "population",
+  year = 2019,
+  time_series = TRUE,
+  key = Sys.getenv("CENSUS_API_KEY")
+)
+
+# Import FIPS code details table
+data(fips_codes)
+
+states_50 <- (fips_codes %>% distinct(state, state_code))[1:51,]
+
+county_names <- fips_codes %>%
+  transmute(
+    region_code = paste0(state_code, county_code),
+    region = paste(county, state, sep = ", "),
+    state
+  )
+
+# Format and conform data ####
+
+# Format yearly county population data
+# Conform county FIPS codes to match other datasets
+# For details on county FIPS code changes, see:
+# https://www.ddorn.net/data/FIPS_County_Code_Changes.pdf
+# For details on mapping DATE to years see:
+# https://www.census.gov/data/developers/data-sets/popest-popproj/popest/popest-vars/2019.html
+united_states_county_yearly_population_formatted <- united_states_county_yearly_population %>%
+  filter(
+    variable == "POP",
+    DATE >= 3
+  ) %>%
+  transmute(
+    country = "United States",
+    region_code = case_when(
+      GEOID == "02158" ~ "02270",
+      GEOID == "46102" ~ "46113",
+      TRUE ~ GEOID),
+    year = as.integer(DATE) + 2007L,
+    population = as.integer(value)
+  )
+
+# pretend 2020 pop is the same as 2019
+united_states_county_yearly_population_complete <- united_states_county_yearly_population_formatted %>%
+  filter(year == 2019L) %>%
+  mutate(year = 2020L) %>%
+  bind_rows(united_states_county_yearly_population_formatted)
+
+# Summarize historical county-month deaths by county-quarter
+# NOTE: Missing and censored values are treated as 0s
+united_states_county_quarterly_total_deaths <- united_states_county_monthly_total_deaths %>%
+  transmute(
+    country = "United States",
+    region_code = `County Code`,
+    year = as.integer(Year),
+    quarter = quarter(ym(`Month Code`)),
+    start_date = yq(paste(year, quarter, sep = "-")),
+    end_date = ceiling_date(start_date + months(2), unit="month") - 1,
+    days = as.integer(end_date - start_date + 1),
+    total_deaths = as.integer(Deaths),
+    total_deaths = replace_na(total_deaths, 0L)
+  ) %>%
+  group_by(
+    country, region_code, year, quarter, start_date, end_date, days
+  ) %>%
+  summarise(
+    total_deaths = sum(total_deaths, na.rm = TRUE),
+    .groups = "drop"
+  ) %>%
+  dplyr::select(country, region_code, start_date, end_date, days, year,
+                quarter, total_deaths)
+
+# Summarize 2020 county-quarter-ageband total and COVID deaths by county-quarter
+# NOTE: Missing and censored values are treated as 0s
+united_states_county_quarterly_covid_deaths_clean <- united_states_county_quarterly_covid_deaths %>%
+  filter(`Age Group` %in% c("65 years and over", "<65 years")) %>%
+  transmute(
+    country = "United States",
+    region_code = `FIPS Code`,
+    year = Year,
+    quarter = Quarter,
+    start_date = mdy(`Start Date`),
+    end_date = mdy(`End Date`),
+    days = as.integer(end_date - start_date + 1L),
+    total_deaths = as.integer(`Total Deaths`),
+    covid_deaths = as.integer(`COVID-19 Deaths`),
+    across(c(total_deaths, covid_deaths), replace_na, 0L)
+  ) %>%
+  group_by(country, region_code, year, quarter, start_date, end_date,
+           days) %>%
+  summarise(
+    across(c(total_deaths, covid_deaths), sum, na.rm = TRUE),
+    .groups = "drop"
+    )
+
+# Union historical county-quarterly total deaths and
+# 2020 county-quarterly total and covid deaths
+# Limits data to 50 states + Washington DC
+# Consider omitting counties that had zero (or completely censored) deaths in all quarters
+united_states_county_quarterly_deaths <- united_states_county_quarterly_total_deaths %>%
+  bind_rows(united_states_county_quarterly_covid_deaths_clean) %>%
+  mutate(state_code = str_sub(region_code, end = 2)) %>%
+  semi_join(states_50, by = "state_code") %>%
+  left_join(
+    united_states_county_yearly_population_complete,
+    by = c("country", "region_code", "year")
+  ) %>%
+  left_join(
+    county_names,
+    by = "region_code"
+  ) %>%
+  mutate(
+    expected_deaths = "TBC", # To be calculated
+    quarter = as_factor(quarter),
+    population_z = (population - mean(population, na.rm = TRUE)) /
+      sd(population, na.rm = (TRUE))
+  ) %>%
+  # group_by(region) %>%
+  # mutate(total_deaths_max = max(total_deaths)) %>%
+  # ungroup() %>%
+  # filter(total_deaths_max != 0) %>%
+  # ungroup() %>%
+  dplyr::select(country, state, region, region_code, start_date, end_date, days,
+                year, quarter, population, population_z, total_deaths,
+                covid_deaths, expected_deaths)
+
+# Estimate expected and excess deaths for 2020 ####
+
+united_states_county_quarterly_results <- estimate_excess_deaths(
+  df = united_states_county_quarterly_deaths,
+  period = "quarter",
+  calculate = TRUE,
+  train_model = TRUE
+)
+
+saveRDS(
+  united_states_county_quarterly_results[[1]],
+  file.path(
+    here::here(),
+    "results/united_states_county_quarterly_model.RDS"
+  )
+)
+
+data.table::fwrite(
+  united_states_county_quarterly_results[[2]],
+  file.path(
+    here::here(),
+    "results/united_states_county_quarterly_excess_deaths_estimates.csv"
+  )
+)
+
+data.table::fwrite(
+  united_states_county_quarterly_covid_deaths,
+  file.path(
+    here::here(),
+    "data/united_states_county_quarterly_covid_deaths.csv"
+  )
+)