Skip to content

Commit

Permalink
v1 ready for public consumption
Browse files Browse the repository at this point in the history
  • Loading branch information
Joe-Wasserman committed Oct 4, 2021
1 parent 8d4e736 commit 3385538
Show file tree
Hide file tree
Showing 8 changed files with 13,298 additions and 2 deletions.
166 changes: 164 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,164 @@
# covid-19-united-states-county-quarterly-excess-deaths
Code and data used to produce estimates for United States excess mortality by that's probably during 2020.
COVID-19 United States Excess Deaths by county and quarter
================

This repository contains code and data to estimate **expected deaths**
and **excess deaths** in the United States in 2020 by **county** and
**quarter**.

Estimated excess deaths are available at
[`/blob/main/results/united_states_county_quarterly_excess_deaths_estimates.csv`](https://github.com/mymil/covid-19-united-states-county-quarterly-excess-deaths/blob/main/results/united_states_county_quarterly_excess_deaths_estimates.csv).

# Data Sources

- CDC WONDER all-cause, all-ages deaths by county and month

<https://wonder.cdc.gov/ucd-icd10.html>

These public-use historical mortality data were obtained for years
2015-2019 and used to estimate the expected deaths model. Because
these data have an unusual and specific license, they are not
included as part of the repository, but can be freely downloaded
from <https://wonder.cdc.gov/ucd-icd10.html>.

Note: County-months wither fewer than 10 deaths are censored in the
source data.

- NCHS Provisional COVID-19 Deaths by Quarter, County and Age for 2020

<https://data.cdc.gov/NCHS/AH-Provisional-COVID-19-Deaths-by-Quarter-County-a/ypxr-mz8e>

These public domain all-cause and COVID-19 mortality data were
obtained for 2020 and compared to estimated expected deaths to
calculate excess deaths. The age groups in these data are not
mutually exclusive. To minimize the prevalence of suppression in
these data, values from the two widest age groups, under 65 and 65
and older, were summed.

Note: Values for county-quarter-age group with fewer than 9 deaths
are suppressed.

- US Census Population Estimates

Population estimates for 2015-2019 were retrieved from the [US
Census Bureau Population Estimates
API](https://www.census.gov/data/developers/data-sets/popest-popproj.html)
via the [tidycensus](https://github.com/walkerke/tidycensus) R
package. Population estimates for 2019 were filled forward for 2020.

Note: This product uses the Census Bureau Data API but is not
endorsed or certified by the Census Bureau.

# Excess Deaths Model

The excess deaths model used began as an adaptation of *The Economist*’s
[excess mortality
model](https://github.com/TheEconomist/covid-19-excess-deaths-tracker),
but has since diverged.

Given the large number of counties in the United States (over 3000), a
linear mixed model with county as a random grouping factor was used to
make estimation tractable. This random grouping factor enables every
county to have its own intercept in the final model.

More precisely, the model regresses **total deaths per day** on:

- county population (z-scored)

- years since 2015

- quarter of the year (fixed grouping factor)

- county (random grouping factor)

This model can be expressed by the equation:

# Model Performance

Because the COVID-19 pandemic only began partway through March, 2020, we
can evaluate model performance by examining concordance of predicted and
observed deaths in Q1 2020.

Observed and expected deaths in Q1 2020 are highly correlated, r =
0.998. As can be seen from the following scatterplot, total deaths
(x-axis) tended to exceed expected deaths (y-axis). Unsurprisingly, some
of the counties that were hardest hit early in the pandemic are among
those with total deaths that most diverge from expected deaths.

![](C:/Users/jwasserman/Documents/covid-19-united-states-county-quarterly-excess-deaths/README_files/figure-gfm/plot%20comparison-1.png)<!-- -->

# Repository Organization

- Excess deaths estimates and fitted LMM are available in `/results/`.

- Code to fit the model and estimate excess deaths are available in
`/code/`.

- Data used in modeling are available in `/data/`, when possible.

# License

The code contained in this repository are available under the [MIT
License](https://opensource.org/licenses/MIT), and the data generated by
this code are licensed under the [Creative Commons Attribution 4.0
International License](https://creativecommons.org/licenses/by/4.0/).

## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tinytex_0.34 knitr_1.31 equatiomatic_0.3.0 lme4_1.1-26
## [5] Matrix_1.3-2 aweek_1.0.2 lubridate_1.7.10 data.table_1.14.0
## [9] tidycensus_0.11.4 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.5
## [13] purrr_0.3.4 readr_2.0.2 tidyr_1.1.3 tibble_3.1.0
## [17] ggplot2_3.3.5 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] rdocsyntax_0.4.1.9000 minqa_1.2.4 colorspace_2.0-0
## [4] ellipsis_0.3.1 class_7.3-18 rgdal_1.5-23
## [7] rprojroot_2.0.2 base64enc_0.1-3 fs_1.5.0
## [10] rstudioapi_0.13 proxy_0.4-25 farver_2.1.0
## [13] texPreview_1.5 ggrepel_0.9.1 fansi_0.4.2
## [16] xml2_1.3.2 splines_4.0.4 cachem_1.0.4
## [19] jsonlite_1.7.2 nloptr_1.2.2.2 broom_0.7.6
## [22] dbplyr_2.1.1 png_0.1-7 broom.mixed_0.2.6
## [25] shiny_1.6.0 clipr_0.7.1 compiler_4.0.4
## [28] httr_1.4.2 backports_1.2.1 assertthat_0.2.1
## [31] fastmap_1.1.0 svgPanZoom_0.3.4 cli_2.5.0
## [34] later_1.2.0 htmltools_0.5.1.1 tools_4.0.4
## [37] coda_0.19-4 gtable_0.3.0 glue_1.4.2
## [40] reshape2_1.4.4 rappdirs_0.3.3 V8_3.4.0
## [43] Rcpp_1.0.7 cellranger_1.1.0 pdftools_2.3.1
## [46] vctrs_0.3.7 nlme_3.1-152 tigris_1.5
## [49] xfun_0.26 rvest_1.0.0 mime_0.10
## [52] lifecycle_1.0.0 statmod_1.4.35 MASS_7.3-53
## [55] scales_1.1.1 hms_1.0.0 promises_1.2.0.1
## [58] TMB_1.7.20 rematch2_2.1.2 yaml_2.2.1
## [61] curl_4.3 memoise_2.0.0 stringi_1.5.3
## [64] highr_0.8 desc_1.3.0 maptools_1.1-1
## [67] e1071_1.7-6 boot_1.3-26 rlang_0.4.10
## [70] pkgconfig_2.0.3 evaluate_0.14 lattice_0.20-41
## [73] sf_1.0-2 labeling_0.4.2 tidyselect_1.1.0
## [76] here_1.0.1 plyr_1.8.6 magrittr_2.0.1
## [79] R6_2.5.0 magick_2.7.1 generics_0.1.0
## [82] DBI_1.1.1 mgcv_1.8-33 whisker_0.4
## [85] pillar_1.6.0 haven_2.3.1 foreign_0.8-81
## [88] withr_2.4.2 units_0.7-1 sp_1.4-5
## [91] modelr_0.1.8 crayon_1.4.1 uuid_0.1-4
## [94] KernSmooth_2.23-18 utf8_1.2.1 tzdb_0.1.2
## [97] rmarkdown_2.7 grid_4.0.4 readxl_1.3.1
## [100] qpdf_1.1 reprex_2.0.0 digest_0.6.27
## [103] classInt_0.4-3 xtable_1.8-4 httpuv_1.6.0
## [106] details_0.2.1 munsell_0.5.0 askpass_1.1
Binary file added README_files/figure-gfm/plot comparison-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
219 changes: 219 additions & 0 deletions code/clean_data_and_model.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
# This script conforms historical and recent US mortality data
# Estimates 2020 expected and excess mortality by county and quarter

# Load libraries and functions ####
library(tidyverse)
library(tidycensus)
library(data.table)
library(lubridate)
library(aweek)
library(lme4)
options(scipen=999)

source(
file.path(here::here(), "code/estimate_excess_deaths.R")
)

# Import data ####

# NOTE: Obtaining US census data via tidycensus requires a Census API key
# Obtain a key here http://api.census.gov/data/key_signup.html
# Store key in .Renviron with: census_api_key("YOUR KEY", install = TRUE)

# Import historical county-monthly data downloaded from CDC WONDER
# https://wonder.cdc.gov/ucd-icd10.html
# NOTE: County-quarters with < 10 deaths are censored in these data
united_states_county_monthly_total_deaths <- list.files(
file.path(here::here(), "data"),
pattern = "*.txt",
full.names = TRUE
) %>%
map_dfr(
~ data.table::fread(
.x,
na.strings = c("Missing", "Suppressed", "Not Applicable"),
keepLeadingZeros = TRUE,
colClasses = c("character")
)
)

# Import NCHS Provisional COVID-19 Deaths by Quarter, County and Age for 2020 data
# https://data.cdc.gov/NCHS/AH-Provisional-COVID-19-Deaths-by-Quarter-County-a/ypxr-mz8e
# NOTE: County-quarter-agebands with < 10 deaths are censored in these data
united_states_county_quarterly_covid_deaths <- data.table::fread(
"https://data.cdc.gov/api/views/ypxr-mz8e/rows.csv",
keepLeadingZeros = TRUE
)

# If the above CDC data no longer exists, retrieve the local version instead
if(!exists("united_states_county_quarterly_covid_deaths")){
united_states_county_quarterly_covid_deaths <- data.table::fread(
file.path(
here::here(),
"data/united_states_county_quarterly_covid_deaths.csv"
),
keepLeadingZeros = TRUE
)
}

# Import yearly county population estimates from US census
# https://www.census.gov/programs-surveys/popest.html
united_states_county_yearly_population <- tidycensus::get_estimates(
geography = "county",
product = "population",
year = 2019,
time_series = TRUE,
key = Sys.getenv("CENSUS_API_KEY")
)

# Import FIPS code details table
data(fips_codes)

states_50 <- (fips_codes %>% distinct(state, state_code))[1:51,]

county_names <- fips_codes %>%
transmute(
region_code = paste0(state_code, county_code),
region = paste(county, state, sep = ", "),
state
)

# Format and conform data ####

# Format yearly county population data
# Conform county FIPS codes to match other datasets
# For details on county FIPS code changes, see:
# https://www.ddorn.net/data/FIPS_County_Code_Changes.pdf
# For details on mapping DATE to years see:
# https://www.census.gov/data/developers/data-sets/popest-popproj/popest/popest-vars/2019.html
united_states_county_yearly_population_formatted <- united_states_county_yearly_population %>%
filter(
variable == "POP",
DATE >= 3
) %>%
transmute(
country = "United States",
region_code = case_when(
GEOID == "02158" ~ "02270",
GEOID == "46102" ~ "46113",
TRUE ~ GEOID),
year = as.integer(DATE) + 2007L,
population = as.integer(value)
)

# pretend 2020 pop is the same as 2019
united_states_county_yearly_population_complete <- united_states_county_yearly_population_formatted %>%
filter(year == 2019L) %>%
mutate(year = 2020L) %>%
bind_rows(united_states_county_yearly_population_formatted)

# Summarize historical county-month deaths by county-quarter
# NOTE: Missing and censored values are treated as 0s
united_states_county_quarterly_total_deaths <- united_states_county_monthly_total_deaths %>%
transmute(
country = "United States",
region_code = `County Code`,
year = as.integer(Year),
quarter = quarter(ym(`Month Code`)),
start_date = yq(paste(year, quarter, sep = "-")),
end_date = ceiling_date(start_date + months(2), unit="month") - 1,
days = as.integer(end_date - start_date + 1),
total_deaths = as.integer(Deaths),
total_deaths = replace_na(total_deaths, 0L)
) %>%
group_by(
country, region_code, year, quarter, start_date, end_date, days
) %>%
summarise(
total_deaths = sum(total_deaths, na.rm = TRUE),
.groups = "drop"
) %>%
dplyr::select(country, region_code, start_date, end_date, days, year,
quarter, total_deaths)

# Summarize 2020 county-quarter-ageband total and COVID deaths by county-quarter
# NOTE: Missing and censored values are treated as 0s
united_states_county_quarterly_covid_deaths_clean <- united_states_county_quarterly_covid_deaths %>%
filter(`Age Group` %in% c("65 years and over", "<65 years")) %>%
transmute(
country = "United States",
region_code = `FIPS Code`,
year = Year,
quarter = Quarter,
start_date = mdy(`Start Date`),
end_date = mdy(`End Date`),
days = as.integer(end_date - start_date + 1L),
total_deaths = as.integer(`Total Deaths`),
covid_deaths = as.integer(`COVID-19 Deaths`),
across(c(total_deaths, covid_deaths), replace_na, 0L)
) %>%
group_by(country, region_code, year, quarter, start_date, end_date,
days) %>%
summarise(
across(c(total_deaths, covid_deaths), sum, na.rm = TRUE),
.groups = "drop"
)

# Union historical county-quarterly total deaths and
# 2020 county-quarterly total and covid deaths
# Limits data to 50 states + Washington DC
# Consider omitting counties that had zero (or completely censored) deaths in all quarters
united_states_county_quarterly_deaths <- united_states_county_quarterly_total_deaths %>%
bind_rows(united_states_county_quarterly_covid_deaths_clean) %>%
mutate(state_code = str_sub(region_code, end = 2)) %>%
semi_join(states_50, by = "state_code") %>%
left_join(
united_states_county_yearly_population_complete,
by = c("country", "region_code", "year")
) %>%
left_join(
county_names,
by = "region_code"
) %>%
mutate(
expected_deaths = "TBC", # To be calculated
quarter = as_factor(quarter),
population_z = (population - mean(population, na.rm = TRUE)) /
sd(population, na.rm = (TRUE))
) %>%
# group_by(region) %>%
# mutate(total_deaths_max = max(total_deaths)) %>%
# ungroup() %>%
# filter(total_deaths_max != 0) %>%
# ungroup() %>%
dplyr::select(country, state, region, region_code, start_date, end_date, days,
year, quarter, population, population_z, total_deaths,
covid_deaths, expected_deaths)

# Estimate expected and excess deaths for 2020 ####

united_states_county_quarterly_results <- estimate_excess_deaths(
df = united_states_county_quarterly_deaths,
period = "quarter",
calculate = TRUE,
train_model = TRUE
)

saveRDS(
united_states_county_quarterly_results[[1]],
file.path(
here::here(),
"results/united_states_county_quarterly_model.RDS"
)
)

data.table::fwrite(
united_states_county_quarterly_results[[2]],
file.path(
here::here(),
"results/united_states_county_quarterly_excess_deaths_estimates.csv"
)
)

data.table::fwrite(
united_states_county_quarterly_covid_deaths,
file.path(
here::here(),
"data/united_states_county_quarterly_covid_deaths.csv"
)
)
Loading

0 comments on commit 3385538

Please sign in to comment.