Missing Data, Data Imputation

In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation (Wiki)

Overview
- A free book on data imputation by the author of mice package:
  Flexible Imputation of Missing Data (2018) Stef van Buuren
- Missing-data imputation(Ch. 25) of Data Analysis Using Regression and Multilevel/Hierarchical Models (2006) ' Andrew Gelman, Jennifer Hill
- Missing Data: Our View of the State of the Art (2002) Joseph L. Schafer, John W. Graham
- A Review of Methods for Missing Data (2001) Therese D. Pigott

Types of missing data (Wiki)

Missed completely at random
Missed at random
Missed data that depends on unobserved variables
Missed data that depends on the missing value itself

Discarding data

Listwise deletion Complete-case analysis
- Samples (rows) are removed from a dataset if they have missing values. Probably the most simple and popular approach. Often done automatically by many ML packages
- When dealing with big number of variables that have missing values, the number of samples after deletion can be too small
- May lead to biased estimates. Also smaller sample size increases standard errors
Available-case analysis Complete-variables analysis
- Excluding variables from data if their missing-values rate is lower than some threshold

Imputation (Wiki)

Whenever a single imputation strategy is used, the standard errors of estimates tend to be too low. The intuition here is that we have substantial uncertainty about the missing values, but by choosing a single imputation we inessence pretend that we know the true value with certainty (Data Analysis Using Regression and Multilevel/Hierarchical Models)

Mean/Mode Replacement
- Replaces missing values with a variable (column) mean, mode or median
- Distorts a probability distribution of an imputed variable
- Distorts relationship between variables
LOCF Last Observation Carried Forward imputation (Wiki)
- In time-series data the last observed value before a missing one is "carried forward" to fill in the blank points
- Does analysis using “last observation carried forward” introduce bias in dementia research? Frank J. Molnar, Brian Hutton, Dean Fergusson
Indicator variables
- Extra category that indicates missingness of a variable
- Extra binary indicator variable used together with a variable that includes missing data (works with continuous data too)
Regression Imputation
- A regression model is created on a variable with missing values, then used to predict blank points
- Deterministic regression imputation uses the original prediction of the regression model to impute missing values
- Stochastic regression imputation adds random error
- Article about the regression imputation method with examples:
  Regression Imputation (Stochastic vs. Deterministic & R Example) Statistics Globe
Iterative Regression Imputation
- When multiple variables have missing values, IRI imputes them iteratively: using non-missing variables first to impute the first missing variable, then using the imputed variable together with non-missing predictors to predict missing values of the second one, etc
SRMI Sequential Regression Multivariate Imputation
- A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models (2001) Trivellore E. Raghunathan, James M. Lepkowski, John Van Hoewyk, Peter Solenberger
Cold-deck Imputation
- Impute missing data using previously collected datasets
Hot-deck Imputation
- For each sample with a missing value, find a similar complete-case sample in the same dataset and use it for imputation
- Uses a scoring function to measure similarity
- A Review of Hot Deck Imputation for Survey Non-response (2010) Rebecca R. Andridge, Roderick J. A. Little
KNN k-Nearest Neighbors
Model-based Imputation
- When something is known about why missing data exist, it's possible to directly model the missingness
Multiple Imputation (Wiki)
- Drawing imputed values multiple times from some distribution. Then each realization of imputed data is analysed. Aggregated results from all realizations are used to get uncertainty estimation.
- Multiple Imputation for Nonresponse in Surveys (1987) Donald B. Rubin
- [Analyzing Incomplete Political Science Data: An Alternative Algorithm forMultiple Imputation] (2001) Gary King, James Honaker, Anne Joseph, Kenneth Scheve
MICE Multivariate Imputation by Chained Equations (Homepage, Code, CRAN)
- A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models (2001) Trivellore E. Raghunathan, James M. Lepkowski, John Van Hoewyk, Peter Solenberger
- Multiple imputation of discrete and continuousdata by fully conditional specification (2007) Stef van Buuren
- Multiple imputation by chained equations: what is it and how does it work? (2011) Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, Philip J. Leaf
- mice: Multivariate Imputation by Chained Equations in R (2011) Stef van Buuren, Karin Groothuis-Oudshoorn
MissForest (Code, CRAN)
- MissForest - nonparametric missing value imputation for mixed-type data (2011) Daniel J.Stekhoven, Peter Buhlmann
Optimal Transport (Code)
- Missing Data Imputation using Optimal Transport (2020) Boris Muzellec, Julie Josse, Claire Boyer, Marco Cuturi
Autoencoders
- Missing data imputation in the electronic health record using deeply learned autoencoders Brett K. Beaulieu-Jones, Jason H. Moore
- Multiple Imputation for Biomedical Datausing Monte Carlo Dropout Autoencoders (2020) Kristian Miok, Dong Nguyen-Doan, Marko Robnik-Šikonja, Daniela Zaharie
GAIN Missing data imputation with GANs
- GAIN: Missing Data Imputation using Generative Adversarial Nets (2018) insung Yoon, James Jordon, Mihaela van der Schaar

Timeseries imputation

RNNs
- Modeling Missing Data in Clinical Time Series with RNNs (2016) Zachary C. Lipton, David C. Kale, Randall Wetzel
- Estimating Missing Data in Temporal Data Streams Using Multi-directional Recurrent Neural Networks (2017) Jinsung Yoon, William R. Zame, Mihaela van der Schaar
- BRITS: Bidirectional Recurrent Imputation for Time Series (2018) Wei Cao, Dong Wang, Jian Li, Hao Zhou, Yitan Li, Lei Li
GPs
- GP-VAE: Deep Probabilistic Time Series Imputation (2019) Vincent Fortuin, Dmitry Baranchuk, Gunnar Rätsch, Stephan Mandt

Other methods, packages

MIDAS Multiple Imputation with Denoising Autoencoders (Code, Paper)
Impute.jl (Code)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Missing Data, Data Imputation

Types of missing data (Wiki)

Discarding data

Imputation (Wiki)

Timeseries imputation

Other methods, packages

About

Uh oh!

Releases

Packages

mlpapers/missing-data

Folders and files

Latest commit

History

Repository files navigation

Missing Data, Data Imputation

Types of missing data (Wiki)

Discarding data

Imputation (Wiki)

Timeseries imputation

Other methods, packages

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages