Skip to content


Repository files navigation

Accurately Estimating Total COVID-19 Infections using Information Theory

This code showcases that the parameterization identified by our MDL framework MDLINFER (MDLPARAM) is superior to baseline parameterization (BASEPARAM) in (A) estimating total infections, (B) future projections on reported infections, and (C) predicting COVID-19 symptomatic rate trends.

We replicate our methods on multiple epidemiological models (SAPHIRE model and SEIR+HD model), multiple regions (Minneapolis, South Florida, Philadelphia, and San Francisco), and multiple periods (spring to summer 2020, and fall 2020) based on the severity of the outbreak and the availability of serological studies and symptomatic surveillance data. To account for the stochasticity in the calibration and simulation, we run the SAPHIRE model with 10000 MCMC runs and report the average. For the SEIR+HD model, we run with 1200 particles and 600 simulations and report the average.

In each region, we divide the timeline into two time periods: (i) observed period, when only the number of reported infections are available, and both BASEINFER and MDLINFER are used to learn the baseline parameterization (BASEPARAM) and MDL parameterization (MDLPARAM), and (ii) forecast period, where we evaluate the forecasts generated by the parameterizations learned in the observed period. To handle the time-varying reported rates, we divide the observed period into multiple sub-periods and learn different reported rates for each sub-period separately.

This code showcases MDLINFER using two different ODE-based epidemiological models: SAPHIRE model (Hao, X. et al. Reconstruction of the full transmission dynamics of covid-19 in wuhan. Nature 584, 420–424 (2020). Github link: and SEIR+HD model (Kain, M. P., Childs, M. L., Becker, A. D. & Mordecai, E. A. Chopping the tail: How preventing superspreading can help to maintain covid-19 control. Epidemics 34, 100430 (2021). GitHub link:


The code is running on Python 3.8.6 and R 4.0.3. We are using the following packages in the code. These packages can be installed via pip (for Python) or install.packages() command (for R). The installation of these packages will cost around 1 hour.



Directory structure

- Figure2 -> This folder allows you to reproduce Figure 2 in the main article.
	- Minneapolis-Nature -> Saved SAPHIRE Model Results for Minneapolis.
	- Minneapolis-Mordecai -> Saved SEIR+HD Model Results for Minneapolis.
	- Florida-Nature -> Saved SAPHIRE Model Results for South Florida.
	- Florida-Mordecai -> Saved SEIR+HD Model Results for South Florida.
	- -> Running this code directly will reproduce Figure 2.
-Figure3 -> This folder allows you to reproduce Figure 3 in the main article.
	- -> Running this code directly will reproduce Figure 3.
-Figure4 -> This folder allows you to reproduce Figure 4 in the main article.
	- -> Running this code directly will reproduce Figure 4.
-Figure5 -> This folder allows you to reproduce Figure 5 in the main article.
	-  -> Running this code directly will reproduce Figure 5.
- Minnesota
		- Period1
			-Step 1 -> MDLINFER Step 1 code
			-Step 2 -> MDLINFER Step 2 code
		- Period2
		- ...
- South Florida (Similar to Minnesota)
- Demo -> The demo code
- Pseudocode.pdf -> The pseudo-code for MDLINFER


The dataset used in this article are:

New York Times reported infections

This dataset consists of the daily time sequence of reported COVID-19 infections and the mortality (cumulative values) for each county in the US starting from January 21, 2020 to current. As of Feb 21, 2022, the number of reported infections is 78434184 and the mortality is 934659. More details can be found in:

Serological studies

This dataset consists of the point and 95% confidence interval estimates of the prevalence of antibodies to SARS-CoV-2 in 10 US locations every 3–4 weeks from March to July 2020. For each location, CDC works with commercial laboratories to collect blood specimens in the population and test about 1800 collected specimens every 3–4 weeks. More details can be found in:

Symptomatic surveillance

This dataset consists of the point estimate (and standard error) of the COVID-related symptomatic rate for each county in the US starting from April 6, 2020 to date. The survey asks a series of questions on randomly sampled social media (Facebook) users to estimate the percentage of people who have COVID-like symptoms such as fever along with cough or shortness of breath or difficulty breathing on a given day. As of November 2021, the average number of Facebook survey responses each day is about 40,000, and overall it consists of over 25 million survey responses. More details can be found in:

We integrate the datasets into each testbed, hence the datasets are in each folder.

Running MDLINFER code

Step 1 of MDLINFER: Finding the reported rate alpha*

The step 1 of MDLINFER is to find a good reported rate alpha*. You can run the step 1 algorithm in the step1 folder. The steps are as follow:

(1) For the SAPHIRE model, please set the code_root in scripts_main/Run_SEIR_main_analysis.R as the step1 folder.
(2) Run '' to do a linear search to find a good reported rate alpha*.
(3) You can find the reported rate alpha* saved in the alpha.txt.

Step 2 of MDLINFER: Finding the total infections D*

The step 2 of MDLINFER is to find the total infections D*. You can run the step 2 algorithm in the step2 folder. The steps are as follow:

(1) For the SAPHIRE model, please similarly set the code_root in scripts_main/Run_SEIR_main_analysis.R as the step2 folder.
(2) Copy the result.csv corresponding to the alpha* and paste it to step2 folder as a warm start.
(3) Set the alpha_star as the alpha* found in step 1 in
(4) Run '' to use the Nelder-Mead to find the D* that minimizes MDL cost with reported rate constraints.
(5) You can find the total infections D* saved in D_star.txt.

Demo Code of MDLINFER

We also provide a demo code to run MDLINFER. The steps are as follow:

(1) Please set the code_root in scripts_main/Run_SEIR_main_analysis.R in step1 folder and step2 folder.
(2) Run the
(3) You can find the total infections D* saved in D_star.txt.

This demo code is based on the SAPHIRE model. Here, we save the step 1 calibration results of attempting different reported rate alpha. The demo code usually takes 1-2 hours to run.


No description, website, or topics provided.







No releases published


No packages published