Impute missing values in predictor data using proximity from randomForest.

# S3 method for default
rfImpute(x, y, iter=5, ntree=300, ...)
# S3 method for formula
rfImpute(x, data, ..., subset)

Arguments

x

A data frame or matrix of predictors, some containing NAs, or a formula.

y

Response vector (NA's not allowed).

data

A data frame containing the predictors and response.

iter

Number of iterations to run the imputation.

ntree

Number of trees to grow in each iteration of randomForest.

...

Other arguments to be passed to randomForest.

subset

A logical vector indicating which observations to use.

Value

A data frame or matrix containing the completed data matrix, where NAs are imputed using proximity from randomForest. The first column contains the response.

Details

The algorithm starts by imputing NAs using na.roughfix. Then randomForest is called with the completed data. The proximity matrix from the randomForest is used to update the imputation of the NAs. For continuous predictors, the imputed value is the weighted average of the non-missing obervations, where the weights are the proximities. For categorical predictors, the imputed value is the category with the largest average proximity. This process is iterated iter times.

Note: Imputation has not (yet) been implemented for the unsupervised case. Also, Breiman (2003) notes that the OOB estimate of error from randomForest tend to be optimistic when run on the data matrix with imputed values.

References

Leo Breiman (2003). Manual for Setting Up, Using, and Understanding Random Forest V4.0. https://www.stat.berkeley.edu/~breiman/Using_random_forests_v4.0.pdf

See also

Examples

data(iris) iris.na <- iris set.seed(111) ## artificially drop some data values. for (i in 1:4) iris.na[sample(150, sample(20)), i] <- NA set.seed(222) iris.imputed <- rfImpute(Species ~ ., iris.na)
#> ntree OOB 1 2 3 #> 300: 6.00% 0.00% 8.00% 10.00% #> ntree OOB 1 2 3 #> 300: 4.67% 0.00% 6.00% 8.00% #> ntree OOB 1 2 3 #> 300: 5.33% 0.00% 6.00% 10.00% #> ntree OOB 1 2 3 #> 300: 5.33% 0.00% 6.00% 10.00% #> ntree OOB 1 2 3 #> 300: 6.00% 0.00% 8.00% 10.00%
set.seed(333) iris.rf <- randomForest(Species ~ ., iris.imputed) print(iris.rf)
#> #> Call: #> randomForest(formula = Species ~ ., data = iris.imputed) #> Type of random forest: classification #> Number of trees: 500 #> No. of variables tried at each split: 2 #> #> OOB estimate of error rate: 5.33% #> Confusion matrix: #> setosa versicolor virginica class.error #> setosa 50 0 0 0.00 #> versicolor 0 47 3 0.06 #> virginica 0 5 45 0.10