Uses the "ranger" package (Wright & Ziegler) to do fast missing value imputation by
chained random forests, see Stekhoven & Buehlmann and Van Buuren & Groothuis-Oudshoorn.
Between the iterative model fitting, it offers the option of predictive mean matching.
This firstly avoids imputation with values not present in the original data
(like a value 0.3334 in a 0-1 coded variable).
Secondly, predictive mean matching tries to raise the variance in the resulting
conditional distributions to a realistic level. This allows to do multiple imputation
when repeating the call to missRanger()
.
Usage
missRanger(
data,
formula = . ~ .,
pmm.k = 0L,
num.trees = 500,
mtry = NULL,
min.node.size = NULL,
min.bucket = NULL,
max.depth = NULL,
replace = TRUE,
sample.fraction = if (replace) 1 else 0.632,
case.weights = NULL,
num.threads = NULL,
save.memory = FALSE,
maxiter = 10L,
seed = NULL,
verbose = 1,
returnOOB = FALSE,
data_only = !keep_forests,
keep_forests = FALSE,
...
)
Arguments
- data
A
data.frame
with missing values to impute.- formula
A two-sided formula specifying variables to be imputed (left hand side) and variables used to impute (right hand side). Defaults to
. ~ .
, i.e., use all variables to impute all variables. For instance, if all variables (with missings) should be imputed by all variables except variable "ID", use. ~ . - ID
. Note that a "." is evaluated separately for each side of the formula. Further note that variables with missings must appear in the left hand side if they should be used on the right hand side.- pmm.k
Number of candidate non-missing values to sample from in the predictive mean matching steps. 0 to avoid this step.
- num.trees
Number of trees passed to
ranger::ranger()
.- mtry
Number of covariates considered per split. The default
NULL
equals the rounded down root of the number of features. Can be a function, e.g.,function(p) trunc(p/3)
. Passed toranger::ranger()
. Note that during the first iteration, the number of features is growing. Thus, a fixed value can lead to an error. Using a function likefunction(p) min(p, 2)
will fix such problem.- min.node.size
Minimal node size passed to
ranger::ranger()
. By default 1 for classification and 5 for regression.- min.bucket
Minimal terminal node size passed to
ranger::ranger()
. The defaultNULL
means 1.- max.depth
Maximal tree depth passed to
ranger::ranger()
.NULL
means unlimited depth. 1 means single split trees.- replace
Sample with replacement passed to
ranger::ranger()
.- sample.fraction
Fraction of rows per tree passed to
ranger::ranger()
. The default: use all rows whenreplace = TRUE
and 0.632 otherwise.- case.weights
Optional case weights passed to
ranger::ranger()
.- num.threads
Number of threads passed to
ranger::ranger()
. The defaultNULL
uses all threads.- save.memory
Slow but memory saving mode of
ranger::ranger()
.- maxiter
Maximum number of iterations.
- seed
Integer seed.
- verbose
A value in 0, 1, 2 controlling the verbosity.
- returnOOB
Should the final average OOB prediction errors be added as data attribute "oob"? Only relevant when
data_only = TRUE
.- data_only
If
TRUE
(default), only the imputed data is returned. Otherwise, a "missRanger" object with additional information is returned.- keep_forests
Should the random forests of the last relevant iteration be returned? The default is
FALSE
. Setting this option will use a lot of memory. Only relevant whendata_only = TRUE
.- ...
Additional arguments passed to
ranger::ranger()
. Not all make sense.
Value
If data_only = TRUE
an imputed data.frame
. Otherwise, a "missRanger" object
with the following elements:
data
: The imputed data.data_raw
: The original data provided.forests
: Whenkeep_forests = TRUE
, a list of "ranger" models used to generate the imputed data.NULL
otherwise.to_impute
: Variables to be imputed (in this order).impute_by
: Variables used for imputation.best_iter
: Best iteration.pred_errors
: Per-iteration OOB prediction errors (1 - R^2 for regression, classification error otherwise).mean_pred_errors
: Per-iteration averages of OOB prediction errors.pmm.k
: Same as inputpmm.k
.
Details
The iterative chaining stops as soon as maxiter
is reached or if the average
out-of-bag (OOB) prediction errors stop reducing.
In the latter case, except for the first iteration, the second last (= best)
imputed data is returned.
OOB prediction errors are quantified as 1 - R^2 for numeric variables, and as classification error otherwise. If a variable has been imputed only univariately, the value is 1.
References
Wright, M. N. & Ziegler, A. (2016). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, in press. <arxiv.org/abs/1508.04409>.
Stekhoven, D.J. and Buehlmann, P. (2012). 'MissForest - nonparametric missing value imputation for mixed-type data', Bioinformatics, 28(1) 2012, 112-118. https://doi.org/10.1093/bioinformatics/btr597.
Van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. http://www.jstatsoft.org/v45/i03/
Examples
iris2 <- generateNA(iris, seed = 1)
imp1 <- missRanger(iris2, pmm.k = 5, num.trees = 50, seed = 1)
#> Missing value imputation by random forests
#>
#> Variables to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> Variables used to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>
#> iter 1
#>
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
#> iter 2
#>
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
#> iter 3
#>
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
#> iter 4
#>
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
#> iter 5
#>
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
#> iter 6
#>
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
head(imp1)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
# Extended output
imp2 <- missRanger(iris2, pmm.k = 5, num.trees = 50, data_only = FALSE, seed = 1)
#> Missing value imputation by random forests
#>
#> Variables to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> Variables used to impute: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>
#> iter 1
#>
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
#> iter 2
#>
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
#> iter 3
#>
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
#> iter 4
#>
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
#> iter 5
#>
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
#> iter 6
#>
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
summary(imp2)
#> missRanger object. Extract imputed data via $data
#> - best iteration: 5
#> - best average OOB imputation error: 0.152777
#>
#> Sequence of OOB prediction errors:
#>
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> [1,] 1.0000000 0.9627440 0.34944235 0.17346001 0.06666667
#> [2,] 0.2422792 0.4863735 0.02825732 0.07225574 0.04444444
#> [3,] 0.2014726 0.4972930 0.02236181 0.05878516 0.05185185
#> [4,] 0.2137259 0.4760602 0.02246406 0.05920740 0.03703704
#> [5,] 0.1685071 0.4627045 0.02408824 0.05673333 0.05185185
#> [6,] 0.1938871 0.4644700 0.02947855 0.05985398 0.04444444
#>
#> Mean performance per iteration:
#> [1] 0.5104626 0.1747220 0.1663529 0.1616989 0.1527770 0.1584268
#>
#> First rows of imputed data:
#>
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
all.equal(imp1, imp2$data)
#> [1] TRUE
# Formula interface: Univariate imputation of Species and Sepal.Width
imp3 <- missRanger(iris2, Species + Sepal.Width ~ 1)
#> Missing value imputation by random forests
#>
#> Variables to impute: Species, Sepal.Width
#> Variables used to impute:
#>
#> iter 1
#>
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%