Random Forest Cross-Valdidation for feature selection

This function shows the cross-validated prediction performance of models with sequentially reduced number of predictors (ranked by variable importance) via a nested cross-validation procedure.

rfcv(trainx, trainy, cv.fold=5, scale="log", step=0.5,
     mtry=function(p) max(1, floor(sqrt(p))), recursive=FALSE, ...)

Arguments

trainx	matrix or data frame containing columns of predictor variables
trainy	vector of response, must have length equal to the number of rows in `trainx`
cv.fold	number of folds in the cross-validation
scale	if `"log"`, reduce a fixed proportion (`step`) of variables at each step, otherwise reduce `step` variables at a time
step	if `log=TRUE`, the fraction of variables to remove at each step, else remove this many variables at a time
mtry	a function of number of remaining predictor variables to use as the `mtry` parameter in the `randomForest` call
recursive	whether variable importance is (re-)assessed at each step of variable reduction
...	other arguments passed on to `randomForest`

Value

A list with the following components:

list(n.var=n.var, error.cv=error.cv, predicted=cv.pred)

n.var

vector of number of variables used at each step

error.cv

corresponding vector of error rates or MSEs at each step

predicted

list of n.var components, each containing the predicted values from the cross-validation

References

Svetnik, V., Liaw, A., Tong, C. and Wang, T., ``Application of Breiman's Random Forest to Modeling Structure-Activity Relationships of Pharmaceutical Molecules'', MCS 2004, Roli, F. and Windeatt, T. (Eds.) pp. 334-343.

Examples

set.seed(647)
myiris <- cbind(iris[1:4], matrix(runif(96 * nrow(iris)), nrow(iris), 96))
result <- rfcv(myiris, iris$Species, cv.fold=3)
with(result, plot(n.var, error.cv, log="x", type="o", lwd=2))

## The following can take a while to run, so if you really want to try
## it, copy and paste the code into R.

if (FALSE) {
result <- replicate(5, rfcv(myiris, iris$Species), simplify=FALSE)
error.cv <- sapply(result, "[[", "error.cv")
matplot(result[[1]]$n.var, cbind(rowMeans(error.cv), error.cv), type="l",
        lwd=c(2, rep(1, ncol(error.cv))), col=1, lty=1, log="x",
        xlab="Number of variables", ylab="CV Error")
}

Random Forest Cross-Valdidation for feature selection

Arguments

Value

References

See also

Examples

Contents

Author