missForest is a nonparametric imputation method for mixed-type tabular data in R. It handles numeric and categorical variables simultaneously by iteratively training random forests to predict missing entries from the observed ones. No explicit modeling assumptions, no matrix factorizations—just strong predictive baselines that work well out of the box.
ranger
(default) and randomForest
(legacy/compat)The package also includes utilities to measure imputation error, generate missingness for experiments, and inspect variable types.
# CRAN (recommended)
install.packages("missForest")
# Development version (from GitHub)
# install.packages("remotes")
remotes::install_github("stekhoven/missForest")library(missForest)
# Example data
data(iris)
# Introduce ~20% MCAR missingness
set.seed(81)
iris_mis <- prodNA(iris, noNA = 0.20)
# Impute with default backend (ranger)
imp <- missForest(iris_mis, xtrue = iris, verbose = TRUE)
# Imputed data
head(imp$ximp)
# Estimated OOB errors (NRMSE for numeric, PFC for factors)
imp$OOBerror
# True error if xtrue was provided (for benchmarking only)
imp$error# Legacy behavior using randomForest
imp_rf <- missForest(iris_mis, backend = "randomForest")
# Explicitly use ranger with limited threads
imp_rg <- missForest(iris_mis, backend = "ranger", num.threads = 2)Two modes are available via parallelize:
"variables": build forests for different variables in
parallel (register a foreach backend)."forests": parallelize within a single variable’s
forest (ranger threads; or foreach sub-forests for randomForest).# Not run:
# library(doParallel)
# registerDoParallel(2)
# imp_vars <- missForest(iris_mis, parallelize = "variables", verbose = TRUE)
# imp_fors <- missForest(iris_mis, parallelize = "forests", verbose = TRUE, num.threads = 2)missForest(xmis, ...)Core imputation function.
Key arguments:
xmis — data frame/matrix with missing values (columns
must be numeric or factor).maxiter — maximum iterations (default
10).ntree — trees per forest (default
100).mtry — variables tried at each split (default
sqrt(p)).nodesize — length-2 numeric: minimum
node size for c(numeric, factor). Default
c(5, 1).variablewise — return per-variable OOB error if
TRUE.parallelize — "no",
"variables", or "forests".num.threads — threads for ranger (ignored
by randomForest).backend — "ranger" (default) or
"randomForest".xtrue — optional complete data for
benchmarking (adds $error).Some argument mappings for backend = "ranger":
ntree → num.treesnodesize → min.bucket (separately for
regression/classification; default c(5,1))sampsize (counts) → sample.fraction
(fractions; overall or per-class)classwt → class.weightscutoff handled by fitting probability
forests and post-thresholdingmixError(ximp, xmis, xtrue) — computes
NRMSE (numeric) and PFC (factor) over
true missing entries.nrmse(ximp, xmis, xtrue) — NRMSE for continuous-only
data.prodNA(x, noNA = 0.1) — injects MCAR missingness into a
data frame.varClass(x) — returns
"numeric"/"factor" per column.Convert character columns to factors before calling
missForest.
For wide data, consider parallelize = "variables".
For deep/expensive trees, consider
parallelize = "forests".
Set a seed for quasi-reproducible results:
set.seed(123); imp <- missForest(x)You can lower ntree during prototyping to speed up
iteration.
If you use missForest, please cite:
You can also cite the package:
citation("missForest")Issues and pull requests are welcome. Please include a minimal reproducible example when reporting bugs. For performance discussions, share small benchmarks and session info.
GPL (≥ 2)
Daniel J. Stekhoven — stekhoven@stat.math.ethz.ch