missForest

missForest is a nonparametric imputation method for mixed-type tabular data in R. It handles numeric and categorical variables simultaneously by iteratively training random forests to predict missing entries from the observed ones. No explicit modeling assumptions, no matrix factorizations—just strong predictive baselines that work well out of the box.

The package also includes utilities to measure imputation error, generate missingness for experiments, and inspect variable types.


Installation

# CRAN (recommended)
install.packages("missForest")

# Development version (from GitHub)
# install.packages("remotes")
remotes::install_github("stekhoven/missForest")

Quick start

library(missForest)

# Example data
data(iris)

# Introduce ~20% MCAR missingness
set.seed(81)
iris_mis <- prodNA(iris, noNA = 0.20)

# Impute with default backend (ranger)
imp <- missForest(iris_mis, xtrue = iris, verbose = TRUE)

# Imputed data
head(imp$ximp)

# Estimated OOB errors (NRMSE for numeric, PFC for factors)
imp$OOBerror

# True error if xtrue was provided (for benchmarking only)
imp$error

Choosing a backend

# Legacy behavior using randomForest
imp_rf <- missForest(iris_mis, backend = "randomForest")

# Explicitly use ranger with limited threads
imp_rg <- missForest(iris_mis, backend = "ranger", num.threads = 2)

Parallelization

Two modes are available via parallelize:

# Not run:
# library(doParallel)
# registerDoParallel(2)
# imp_vars <- missForest(iris_mis, parallelize = "variables", verbose = TRUE)
# imp_fors <- missForest(iris_mis, parallelize = "forests",  verbose = TRUE, num.threads = 2)

API overview

missForest(xmis, ...)

Core imputation function.

Key arguments:

Some argument mappings for backend = "ranger":

Utilities


Tips & best practices


Citation

If you use missForest, please cite:

You can also cite the package:

citation("missForest")

Contributing

Issues and pull requests are welcome. Please include a minimal reproducible example when reporting bugs. For performance discussions, share small benchmarks and session info.


License

GPL (≥ 2)


Contact

Daniel J. Stekhoven — stekhoven@stat.math.ethz.ch