Control Object for Selection By Filtering (SBF)

Controls the execution of models with simple filters for feature selection

sbfControl(functions = NULL, method = "boot", saveDetails = FALSE,
  number = ifelse(method %in% c("cv", "repeatedcv"), 10, 25),
  repeats = ifelse(method %in% c("cv", "repeatedcv"), 1, number),
  verbose = FALSE, returnResamp = "final", p = 0.75, index = NULL,
  indexOut = NULL, timingSamps = 0, seeds = NA,
  allowParallel = TRUE, multivariate = FALSE)

Arguments

functions	a list of functions for model fitting, prediction and variable filtering (see Details below)
method	The external resampling method: `boot`, `cv`, `LOOCV` or `LGOCV` (for repeated training/test splits
saveDetails	a logical to save the predictions and variable importances from the selection process
number	Either the number of folds or number of resampling iterations
repeats	For repeated k-fold cross-validation only: the number of complete sets of folds to compute
verbose	a logical to print a log for each external resampling iteration
returnResamp	A character string indicating how much of the resampled summary metrics should be saved. Values can be ``final'' or ``none''
p	For leave-group out cross-validation: the training percentage
index	a list with elements for each external resampling iteration. Each list element is the sample rows used for training at that iteration.
indexOut	a list (the same length as `index`) that dictates which sample are held-out for each resample. If `NULL`, then the unique set of samples not contained in `index` is used.
timingSamps	the number of training set samples that will be used to measure the time for predicting samples (zero indicates that the prediction time should not be estimated).
seeds	an optional set of integers that will be used to set the seed at each resampling iteration. This is useful when the models are run in parallel. A value of `NA` will stop the seed from being set within the worker processes while a value of `NULL` will set the seeds using a random set of integers. Alternatively, a vector of integers can be used. The vector should have `B+1` elements where `B` is the number of resamples. See the Examples section below.
allowParallel	if a parallel backend is loaded and available, should the function use it?
multivariate	a logical; should all the columns of `x` be exposed to the `score` function at once?

Value

a list that echos the specified arguments

Details

More details on this function can be found at http://topepo.github.io/caret/feature-selection-using-univariate-filters.html.

Simple filter-based feature selection requires function to be specified for some operations.

The fit function builds the model based on the current data set. The arguments for the function must be:

x the current training set of predictor data with the appropriate subset of variables (i.e. after filtering)
y the current outcome data (either a numeric or factor vector)
... optional arguments to pass to the fit function in the call to sbf

The function should return a model object that can be used to generate predictions.

The pred function returns a vector of predictions (numeric or factors) from the current model. The arguments are:

object the model generated by the fit function
x the current set of predictor set for the held-back samples

The score function is used to return scores with names for each predictor (such as a p-value). Inputs are:

x the predictors for the training samples. If sbfControl()$multivariate is TRUE, this will be the full predictor matrix. Otherwise it is a vector for a specific predictor.
y the current training outcomes

When sbfControl()$multivariate is TRUE, the score function should return a named vector where length(scores) == ncol(x). Otherwise, the function's output should be a single value. Univariate examples are give by anovaScores for classification and gamScores for regression and the example below.

The filter function is used to return a logical vector with names for each predictor (TRUE indicates that the prediction should be retained). Inputs are:

score the output of the score function
x the predictors for the training samples
y the current training outcomes

The function should return a named logical vector.

Examples of these functions are included in the package: caretSBF, lmSBF, rfSBF, treebagSBF, ldaSBF and nbSBF.

The web page http://topepo.github.io/caret/ has more details and examples related to this function.

Examples


if (FALSE) {
data(BloodBrain)

## Use a GAM is the filter, then fit a random forest model
set.seed(1)
RFwithGAM <- sbf(bbbDescr, logBBB,
                 sbfControl = sbfControl(functions = rfSBF,
                                         verbose = FALSE,
                                         seeds = sample.int(100000, 11),
                                         method = "cv"))
RFwithGAM


## A simple example for multivariate scoring
rfSBF2 <- rfSBF
rfSBF2$score <- function(x, y) apply(x, 2, rfSBF$score, y = y)

set.seed(1)
RFwithGAM2 <- sbf(bbbDescr, logBBB,
                  sbfControl = sbfControl(functions = rfSBF2,
                                          verbose = FALSE,
                                          seeds = sample.int(100000, 11),
                                          method = "cv",
                                          multivariate = TRUE))
RFwithGAM2


}

Control Object for Selection By Filtering (SBF)

Arguments

Value

Details

See also

Examples

Contents

Author