-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
8d4e736
commit 3385538
Showing
8 changed files
with
13,298 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,164 @@ | ||
# covid-19-united-states-county-quarterly-excess-deaths | ||
Code and data used to produce estimates for United States excess mortality by that's probably during 2020. | ||
COVID-19 United States Excess Deaths by county and quarter | ||
================ | ||
|
||
This repository contains code and data to estimate **expected deaths** | ||
and **excess deaths** in the United States in 2020 by **county** and | ||
**quarter**. | ||
|
||
Estimated excess deaths are available at | ||
[`/blob/main/results/united_states_county_quarterly_excess_deaths_estimates.csv`](https://github.com/mymil/covid-19-united-states-county-quarterly-excess-deaths/blob/main/results/united_states_county_quarterly_excess_deaths_estimates.csv). | ||
|
||
# Data Sources | ||
|
||
- CDC WONDER all-cause, all-ages deaths by county and month | ||
|
||
<https://wonder.cdc.gov/ucd-icd10.html> | ||
|
||
These public-use historical mortality data were obtained for years | ||
2015-2019 and used to estimate the expected deaths model. Because | ||
these data have an unusual and specific license, they are not | ||
included as part of the repository, but can be freely downloaded | ||
from <https://wonder.cdc.gov/ucd-icd10.html>. | ||
|
||
Note: County-months wither fewer than 10 deaths are censored in the | ||
source data. | ||
|
||
- NCHS Provisional COVID-19 Deaths by Quarter, County and Age for 2020 | ||
|
||
<https://data.cdc.gov/NCHS/AH-Provisional-COVID-19-Deaths-by-Quarter-County-a/ypxr-mz8e> | ||
|
||
These public domain all-cause and COVID-19 mortality data were | ||
obtained for 2020 and compared to estimated expected deaths to | ||
calculate excess deaths. The age groups in these data are not | ||
mutually exclusive. To minimize the prevalence of suppression in | ||
these data, values from the two widest age groups, under 65 and 65 | ||
and older, were summed. | ||
|
||
Note: Values for county-quarter-age group with fewer than 9 deaths | ||
are suppressed. | ||
|
||
- US Census Population Estimates | ||
|
||
Population estimates for 2015-2019 were retrieved from the [US | ||
Census Bureau Population Estimates | ||
API](https://www.census.gov/data/developers/data-sets/popest-popproj.html) | ||
via the [tidycensus](https://github.com/walkerke/tidycensus) R | ||
package. Population estimates for 2019 were filled forward for 2020. | ||
|
||
Note: This product uses the Census Bureau Data API but is not | ||
endorsed or certified by the Census Bureau. | ||
|
||
# Excess Deaths Model | ||
|
||
The excess deaths model used began as an adaptation of *The Economist*’s | ||
[excess mortality | ||
model](https://github.com/TheEconomist/covid-19-excess-deaths-tracker), | ||
but has since diverged. | ||
|
||
Given the large number of counties in the United States (over 3000), a | ||
linear mixed model with county as a random grouping factor was used to | ||
make estimation tractable. This random grouping factor enables every | ||
county to have its own intercept in the final model. | ||
|
||
More precisely, the model regresses **total deaths per day** on: | ||
|
||
- county population (z-scored) | ||
|
||
- years since 2015 | ||
|
||
- quarter of the year (fixed grouping factor) | ||
|
||
- county (random grouping factor) | ||
|
||
This model can be expressed by the equation: | ||
|
||
# Model Performance | ||
|
||
Because the COVID-19 pandemic only began partway through March, 2020, we | ||
can evaluate model performance by examining concordance of predicted and | ||
observed deaths in Q1 2020. | ||
|
||
Observed and expected deaths in Q1 2020 are highly correlated, r = | ||
0.998. As can be seen from the following scatterplot, total deaths | ||
(x-axis) tended to exceed expected deaths (y-axis). Unsurprisingly, some | ||
of the counties that were hardest hit early in the pandemic are among | ||
those with total deaths that most diverge from expected deaths. | ||
|
||
<!-- --> | ||
|
||
# Repository Organization | ||
|
||
- Excess deaths estimates and fitted LMM are available in `/results/`. | ||
|
||
- Code to fit the model and estimate excess deaths are available in | ||
`/code/`. | ||
|
||
- Data used in modeling are available in `/data/`, when possible. | ||
|
||
# License | ||
|
||
The code contained in this repository are available under the [MIT | ||
License](https://opensource.org/licenses/MIT), and the data generated by | ||
this code are licensed under the [Creative Commons Attribution 4.0 | ||
International License](https://creativecommons.org/licenses/by/4.0/). | ||
|
||
## R version 4.0.4 (2021-02-15) | ||
## Platform: x86_64-w64-mingw32/x64 (64-bit) | ||
## Running under: Windows 10 x64 (build 19042) | ||
## | ||
## Matrix products: default | ||
## | ||
## locale: | ||
## [1] LC_COLLATE=English_United States.1252 | ||
## [2] LC_CTYPE=English_United States.1252 | ||
## [3] LC_MONETARY=English_United States.1252 | ||
## [4] LC_NUMERIC=C | ||
## [5] LC_TIME=English_United States.1252 | ||
## | ||
## attached base packages: | ||
## [1] stats graphics grDevices utils datasets methods base | ||
## | ||
## other attached packages: | ||
## [1] tinytex_0.34 knitr_1.31 equatiomatic_0.3.0 lme4_1.1-26 | ||
## [5] Matrix_1.3-2 aweek_1.0.2 lubridate_1.7.10 data.table_1.14.0 | ||
## [9] tidycensus_0.11.4 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.5 | ||
## [13] purrr_0.3.4 readr_2.0.2 tidyr_1.1.3 tibble_3.1.0 | ||
## [17] ggplot2_3.3.5 tidyverse_1.3.1 | ||
## | ||
## loaded via a namespace (and not attached): | ||
## [1] rdocsyntax_0.4.1.9000 minqa_1.2.4 colorspace_2.0-0 | ||
## [4] ellipsis_0.3.1 class_7.3-18 rgdal_1.5-23 | ||
## [7] rprojroot_2.0.2 base64enc_0.1-3 fs_1.5.0 | ||
## [10] rstudioapi_0.13 proxy_0.4-25 farver_2.1.0 | ||
## [13] texPreview_1.5 ggrepel_0.9.1 fansi_0.4.2 | ||
## [16] xml2_1.3.2 splines_4.0.4 cachem_1.0.4 | ||
## [19] jsonlite_1.7.2 nloptr_1.2.2.2 broom_0.7.6 | ||
## [22] dbplyr_2.1.1 png_0.1-7 broom.mixed_0.2.6 | ||
## [25] shiny_1.6.0 clipr_0.7.1 compiler_4.0.4 | ||
## [28] httr_1.4.2 backports_1.2.1 assertthat_0.2.1 | ||
## [31] fastmap_1.1.0 svgPanZoom_0.3.4 cli_2.5.0 | ||
## [34] later_1.2.0 htmltools_0.5.1.1 tools_4.0.4 | ||
## [37] coda_0.19-4 gtable_0.3.0 glue_1.4.2 | ||
## [40] reshape2_1.4.4 rappdirs_0.3.3 V8_3.4.0 | ||
## [43] Rcpp_1.0.7 cellranger_1.1.0 pdftools_2.3.1 | ||
## [46] vctrs_0.3.7 nlme_3.1-152 tigris_1.5 | ||
## [49] xfun_0.26 rvest_1.0.0 mime_0.10 | ||
## [52] lifecycle_1.0.0 statmod_1.4.35 MASS_7.3-53 | ||
## [55] scales_1.1.1 hms_1.0.0 promises_1.2.0.1 | ||
## [58] TMB_1.7.20 rematch2_2.1.2 yaml_2.2.1 | ||
## [61] curl_4.3 memoise_2.0.0 stringi_1.5.3 | ||
## [64] highr_0.8 desc_1.3.0 maptools_1.1-1 | ||
## [67] e1071_1.7-6 boot_1.3-26 rlang_0.4.10 | ||
## [70] pkgconfig_2.0.3 evaluate_0.14 lattice_0.20-41 | ||
## [73] sf_1.0-2 labeling_0.4.2 tidyselect_1.1.0 | ||
## [76] here_1.0.1 plyr_1.8.6 magrittr_2.0.1 | ||
## [79] R6_2.5.0 magick_2.7.1 generics_0.1.0 | ||
## [82] DBI_1.1.1 mgcv_1.8-33 whisker_0.4 | ||
## [85] pillar_1.6.0 haven_2.3.1 foreign_0.8-81 | ||
## [88] withr_2.4.2 units_0.7-1 sp_1.4-5 | ||
## [91] modelr_0.1.8 crayon_1.4.1 uuid_0.1-4 | ||
## [94] KernSmooth_2.23-18 utf8_1.2.1 tzdb_0.1.2 | ||
## [97] rmarkdown_2.7 grid_4.0.4 readxl_1.3.1 | ||
## [100] qpdf_1.1 reprex_2.0.0 digest_0.6.27 | ||
## [103] classInt_0.4-3 xtable_1.8-4 httpuv_1.6.0 | ||
## [106] details_0.2.1 munsell_0.5.0 askpass_1.1 |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,219 @@ | ||
# This script conforms historical and recent US mortality data | ||
# Estimates 2020 expected and excess mortality by county and quarter | ||
|
||
# Load libraries and functions #### | ||
library(tidyverse) | ||
library(tidycensus) | ||
library(data.table) | ||
library(lubridate) | ||
library(aweek) | ||
library(lme4) | ||
options(scipen=999) | ||
|
||
source( | ||
file.path(here::here(), "code/estimate_excess_deaths.R") | ||
) | ||
|
||
# Import data #### | ||
|
||
# NOTE: Obtaining US census data via tidycensus requires a Census API key | ||
# Obtain a key here http://api.census.gov/data/key_signup.html | ||
# Store key in .Renviron with: census_api_key("YOUR KEY", install = TRUE) | ||
|
||
# Import historical county-monthly data downloaded from CDC WONDER | ||
# https://wonder.cdc.gov/ucd-icd10.html | ||
# NOTE: County-quarters with < 10 deaths are censored in these data | ||
united_states_county_monthly_total_deaths <- list.files( | ||
file.path(here::here(), "data"), | ||
pattern = "*.txt", | ||
full.names = TRUE | ||
) %>% | ||
map_dfr( | ||
~ data.table::fread( | ||
.x, | ||
na.strings = c("Missing", "Suppressed", "Not Applicable"), | ||
keepLeadingZeros = TRUE, | ||
colClasses = c("character") | ||
) | ||
) | ||
|
||
# Import NCHS Provisional COVID-19 Deaths by Quarter, County and Age for 2020 data | ||
# https://data.cdc.gov/NCHS/AH-Provisional-COVID-19-Deaths-by-Quarter-County-a/ypxr-mz8e | ||
# NOTE: County-quarter-agebands with < 10 deaths are censored in these data | ||
united_states_county_quarterly_covid_deaths <- data.table::fread( | ||
"https://data.cdc.gov/api/views/ypxr-mz8e/rows.csv", | ||
keepLeadingZeros = TRUE | ||
) | ||
|
||
# If the above CDC data no longer exists, retrieve the local version instead | ||
if(!exists("united_states_county_quarterly_covid_deaths")){ | ||
united_states_county_quarterly_covid_deaths <- data.table::fread( | ||
file.path( | ||
here::here(), | ||
"data/united_states_county_quarterly_covid_deaths.csv" | ||
), | ||
keepLeadingZeros = TRUE | ||
) | ||
} | ||
|
||
# Import yearly county population estimates from US census | ||
# https://www.census.gov/programs-surveys/popest.html | ||
united_states_county_yearly_population <- tidycensus::get_estimates( | ||
geography = "county", | ||
product = "population", | ||
year = 2019, | ||
time_series = TRUE, | ||
key = Sys.getenv("CENSUS_API_KEY") | ||
) | ||
|
||
# Import FIPS code details table | ||
data(fips_codes) | ||
|
||
states_50 <- (fips_codes %>% distinct(state, state_code))[1:51,] | ||
|
||
county_names <- fips_codes %>% | ||
transmute( | ||
region_code = paste0(state_code, county_code), | ||
region = paste(county, state, sep = ", "), | ||
state | ||
) | ||
|
||
# Format and conform data #### | ||
|
||
# Format yearly county population data | ||
# Conform county FIPS codes to match other datasets | ||
# For details on county FIPS code changes, see: | ||
# https://www.ddorn.net/data/FIPS_County_Code_Changes.pdf | ||
# For details on mapping DATE to years see: | ||
# https://www.census.gov/data/developers/data-sets/popest-popproj/popest/popest-vars/2019.html | ||
united_states_county_yearly_population_formatted <- united_states_county_yearly_population %>% | ||
filter( | ||
variable == "POP", | ||
DATE >= 3 | ||
) %>% | ||
transmute( | ||
country = "United States", | ||
region_code = case_when( | ||
GEOID == "02158" ~ "02270", | ||
GEOID == "46102" ~ "46113", | ||
TRUE ~ GEOID), | ||
year = as.integer(DATE) + 2007L, | ||
population = as.integer(value) | ||
) | ||
|
||
# pretend 2020 pop is the same as 2019 | ||
united_states_county_yearly_population_complete <- united_states_county_yearly_population_formatted %>% | ||
filter(year == 2019L) %>% | ||
mutate(year = 2020L) %>% | ||
bind_rows(united_states_county_yearly_population_formatted) | ||
|
||
# Summarize historical county-month deaths by county-quarter | ||
# NOTE: Missing and censored values are treated as 0s | ||
united_states_county_quarterly_total_deaths <- united_states_county_monthly_total_deaths %>% | ||
transmute( | ||
country = "United States", | ||
region_code = `County Code`, | ||
year = as.integer(Year), | ||
quarter = quarter(ym(`Month Code`)), | ||
start_date = yq(paste(year, quarter, sep = "-")), | ||
end_date = ceiling_date(start_date + months(2), unit="month") - 1, | ||
days = as.integer(end_date - start_date + 1), | ||
total_deaths = as.integer(Deaths), | ||
total_deaths = replace_na(total_deaths, 0L) | ||
) %>% | ||
group_by( | ||
country, region_code, year, quarter, start_date, end_date, days | ||
) %>% | ||
summarise( | ||
total_deaths = sum(total_deaths, na.rm = TRUE), | ||
.groups = "drop" | ||
) %>% | ||
dplyr::select(country, region_code, start_date, end_date, days, year, | ||
quarter, total_deaths) | ||
|
||
# Summarize 2020 county-quarter-ageband total and COVID deaths by county-quarter | ||
# NOTE: Missing and censored values are treated as 0s | ||
united_states_county_quarterly_covid_deaths_clean <- united_states_county_quarterly_covid_deaths %>% | ||
filter(`Age Group` %in% c("65 years and over", "<65 years")) %>% | ||
transmute( | ||
country = "United States", | ||
region_code = `FIPS Code`, | ||
year = Year, | ||
quarter = Quarter, | ||
start_date = mdy(`Start Date`), | ||
end_date = mdy(`End Date`), | ||
days = as.integer(end_date - start_date + 1L), | ||
total_deaths = as.integer(`Total Deaths`), | ||
covid_deaths = as.integer(`COVID-19 Deaths`), | ||
across(c(total_deaths, covid_deaths), replace_na, 0L) | ||
) %>% | ||
group_by(country, region_code, year, quarter, start_date, end_date, | ||
days) %>% | ||
summarise( | ||
across(c(total_deaths, covid_deaths), sum, na.rm = TRUE), | ||
.groups = "drop" | ||
) | ||
|
||
# Union historical county-quarterly total deaths and | ||
# 2020 county-quarterly total and covid deaths | ||
# Limits data to 50 states + Washington DC | ||
# Consider omitting counties that had zero (or completely censored) deaths in all quarters | ||
united_states_county_quarterly_deaths <- united_states_county_quarterly_total_deaths %>% | ||
bind_rows(united_states_county_quarterly_covid_deaths_clean) %>% | ||
mutate(state_code = str_sub(region_code, end = 2)) %>% | ||
semi_join(states_50, by = "state_code") %>% | ||
left_join( | ||
united_states_county_yearly_population_complete, | ||
by = c("country", "region_code", "year") | ||
) %>% | ||
left_join( | ||
county_names, | ||
by = "region_code" | ||
) %>% | ||
mutate( | ||
expected_deaths = "TBC", # To be calculated | ||
quarter = as_factor(quarter), | ||
population_z = (population - mean(population, na.rm = TRUE)) / | ||
sd(population, na.rm = (TRUE)) | ||
) %>% | ||
# group_by(region) %>% | ||
# mutate(total_deaths_max = max(total_deaths)) %>% | ||
# ungroup() %>% | ||
# filter(total_deaths_max != 0) %>% | ||
# ungroup() %>% | ||
dplyr::select(country, state, region, region_code, start_date, end_date, days, | ||
year, quarter, population, population_z, total_deaths, | ||
covid_deaths, expected_deaths) | ||
|
||
# Estimate expected and excess deaths for 2020 #### | ||
|
||
united_states_county_quarterly_results <- estimate_excess_deaths( | ||
df = united_states_county_quarterly_deaths, | ||
period = "quarter", | ||
calculate = TRUE, | ||
train_model = TRUE | ||
) | ||
|
||
saveRDS( | ||
united_states_county_quarterly_results[[1]], | ||
file.path( | ||
here::here(), | ||
"results/united_states_county_quarterly_model.RDS" | ||
) | ||
) | ||
|
||
data.table::fwrite( | ||
united_states_county_quarterly_results[[2]], | ||
file.path( | ||
here::here(), | ||
"results/united_states_county_quarterly_excess_deaths_estimates.csv" | ||
) | ||
) | ||
|
||
data.table::fwrite( | ||
united_states_county_quarterly_covid_deaths, | ||
file.path( | ||
here::here(), | ||
"data/united_states_county_quarterly_covid_deaths.csv" | ||
) | ||
) |
Oops, something went wrong.