Uses the "ranger" package (Wright & Ziegler) to do fast missing value imputation by
chained random forests, see Stekhoven & Buehlmann and Van Buuren & Groothuis-Oudshoorn.
Between the iterative model fitting, it offers the option of predictive mean matching.
This firstly avoids imputation with values not present in the original data
(like a value 0.3334 in a 0-1 coded variable).
Secondly, predictive mean matching tries to raise the variance in the resulting
conditional distributions to a realistic level. This allows to do multiple imputation
when repeating the call to missRanger()
.
Usage
missRanger(
data,
formula = . ~ .,
pmm.k = 0L,
maxiter = 10L,
seed = NULL,
verbose = 1,
returnOOB = FALSE,
case.weights = NULL,
data_only = TRUE,
keep_forests = FALSE,
...
)
Arguments
- data
A
data.frame
with missing values to impute.- formula
A two-sided formula specifying variables to be imputed (left hand side) and variables used to impute (right hand side). Defaults to
. ~ .
, i.e., use all variables to impute all variables. For instance, if all variables (with missings) should be imputed by all variables except variable "ID", use. ~ . - ID
. Note that a "." is evaluated separately for each side of the formula. Further note that variables with missings must appear in the left hand side if they should be used on the right hand side.- pmm.k
Number of candidate non-missing values to sample from in the predictive mean matching steps. 0 to avoid this step.
- maxiter
Maximum number of chaining iterations.
- seed
Integer seed to initialize the random generator.
- verbose
Controls how much info is printed to screen. 0 to print nothing. 1 (default) to print a progress bar per iteration, 2 to print the OOB prediction error per iteration and variable (1 minus R-squared for regression). Furthermore, if
verbose
is positive, the variables used for imputation are listed as well as the variables to be imputed (in the imputation order). This will be useful to detect if some variables are unexpectedly skipped.- returnOOB
Logical flag. If TRUE, the final average out-of-bag prediction errors per variable is added to the resulting data as attribute "oob". Only relevant when
data_only = TRUE
(and when forests are grown).- case.weights
Vector with non-negative case weights.
- data_only
If
TRUE
(default), only the imputed data is returned. Otherwise, a "missRanger" object with additional information is returned.- keep_forests
Should the random forests of the final imputations be returned? The default is
FALSE
. Setting this option will use a lot of memory. Only relevant whendata_only = TRUE
(and when forests are grown).- ...
Arguments passed to
ranger::ranger()
. If the data set is large, better use less trees (e.g.num.trees = 20
) and/or a low value ofsample.fraction
. The following arguments are incompatible, amongst others:write.forest
,probability
,split.select.weights
,dependent.variable.name
, andclassification
.
Value
If data_only
an imputed data.frame
. Otherwise, a "missRanger" object with
the following elements that can be extracted via $
:
data
: The imputed data.forests
: Whenkeep_forests = TRUE
, a list of "ranger" models used to generate the imputed data.NULL
otherwise.visit_seq
: Variables to be imputed (in this order).impute_by
: Variables used for imputation.best_iter
: Best iteration.pred_errors
: Per-iteration OOB prediction errors (1 - R^2 for regression, classification error otherwise).mean_pred_errors
: Per-iteration averages of OOB prediction errors.
Details
The iterative chaining stops as soon as maxiter
is reached or if the average
out-of-bag (OOB) prediction errors stop reducing.
In the latter case, except for the first iteration, the second last (= best)
imputed data is returned.
OOB prediction errors are quantified as 1 - R^2 for numeric variables, and as classification error otherwise. If a variable has been imputed only univariately, the value is 1.
A note on mtry
: Be careful when passing a non-default mtry
to
ranger::ranger()
because the number of available covariates might be growing during
the first iteration, depending on the missing pattern.
Values NULL
(default) and 1 are safe choices.
Additionally, recent versions of ranger::ranger()
allow mtry
to be a
single-argument function of the number of available covariables,
e.g., mtry = function(m) max(1, m %/% 3)
.
References
Wright, M. N. & Ziegler, A. (2016). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, in press. <arxiv.org/abs/1508.04409>.
Stekhoven, D.J. and Buehlmann, P. (2012). 'MissForest - nonparametric missing value imputation for mixed-type data', Bioinformatics, 28(1) 2012, 112-118. https://doi.org/10.1093/bioinformatics/btr597.
Van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. http://www.jstatsoft.org/v45/i03/
Examples
irisWithNA <- generateNA(iris, seed = 34)
irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100)
#>
#> Missing value imputation by random forests
#>
#> Variables to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> Variables used to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>
#> iter 1
#>
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
#> iter 2
#>
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
#> iter 3
#>
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
#> iter 4
#>
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
head(irisImputed)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
head(irisWithNA)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 <NA>
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
if (FALSE) {
# Extended output
imp <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100, data_only = FALSE)
head(imp$data)
imp$pred_errors
# If you even want to keep the random forests of the best iteration
imp <- missRanger(
irisWithNA, pmm.k = 3, num.trees = 100, data_only = FALSE, keep_forests = TRUE
)
imp$forests$Sepal.Width
imp$pred_errors[imp$best_iter, "Sepal.Width"] # 1 - R-squared
}