The cross validation function of xgboost
xgb.cv(params = list(), data, nrounds, nfold, label = NULL, missing = NA, prediction = FALSE, showsd = TRUE, metrics = list(), obj = NULL, feval = NULL, stratified = TRUE, folds = NULL, verbose = TRUE, print_every_n = 1L, early_stopping_rounds = NULL, maximize = NULL, callbacks = list(), ...)
params | the list of parameters. Commonly used ones are:
See |
---|---|
data | takes an |
nrounds | the max number of iterations |
nfold | the original dataset is randomly partitioned into |
label | vector of response values. Should be provided only when data is an R-matrix. |
missing | is only used when input is a dense matrix. By default is set to NA, which means that NA values should be considered as 'missing' by the algorithm. Sometimes, 0 or other extreme value might be used to represent missing values. |
prediction | A logical value indicating whether to return the test fold predictions
from each CV model. This parameter engages the |
showsd |
|
metrics, | list of evaluation metrics to be used in cross validation, when it is not specified, the evaluation metric is chosen according to objective function. Possible options are:
|
obj | customized objective function. Returns gradient and second order gradient with given prediction and dtrain. |
feval | customized evaluation function. Returns
|
stratified | a |
folds |
|
verbose |
|
print_every_n | Print each n-th iteration evaluation messages when |
early_stopping_rounds | If |
maximize | If |
callbacks | a list of callback functions to perform various task during boosting.
See |
... | other parameters to pass to |
An object of class xgb.cv.synchronous
with the following elements:
call
a function call.
params
parameters that were passed to the xgboost library. Note that it does not
capture parameters changed by the cb.reset.parameters
callback.
callbacks
callback functions that were either automatically assigned or
explicitly passed.
evaluation_log
evaluation history stored as a data.table
with the
first column corresponding to iteration number and the rest corresponding to the
CV-based evaluation means and standard deviations for the training and test CV-sets.
It is created by the cb.evaluation.log
callback.
niter
number of boosting iterations.
nfeatures
number of features in training data.
folds
the list of CV folds' indices - either those passed through the folds
parameter or randomly generated.
best_iteration
iteration number with the best evaluation metric value
(only available with early stopping).
best_ntreelimit
the ntreelimit
value corresponding to the best iteration,
which could further be used in predict
method
(only available with early stopping).
pred
CV prediction values available when prediction
is set.
It is either vector or matrix (see cb.cv.predict
).
models
a liost of the CV folds' models. It is only available with the explicit
setting of the cb.cv.predict(save_models = TRUE)
callback.
The original sample is randomly partitioned into nfold
equal size subsamples.
Of the nfold
subsamples, a single subsample is retained as the validation data for testing the model, and the remaining nfold - 1
subsamples are used as training data.
The cross-validation process is then repeated nrounds
times, with each of the nfold
subsamples used exactly once as the validation data.
All observations are used for both training and validation.
Adapted from http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29#k-fold_cross-validation
data(agaricus.train, package='xgboost') dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label) cv <- xgb.cv(data = dtrain, nrounds = 3, nthread = 2, nfold = 5, metrics = list("rmse","auc"), max_depth = 3, eta = 1, objective = "binary:logistic")#> [1] train-rmse:0.162410+0.001690 train-auc:0.987113+0.000597 test-rmse:0.162529+0.005671 test-auc:0.987127+0.002355 #> [2] train-rmse:0.077843+0.001069 train-auc:0.999913+0.000044 test-rmse:0.078115+0.004131 test-auc:0.999904+0.000078 #> [3] train-rmse:0.044191+0.004786 train-auc:0.999961+0.000028 test-rmse:0.048266+0.007876 test-auc:0.999945+0.000055print(cv)#> ##### xgb.cv 5-folds #> iter train_rmse_mean train_rmse_std train_auc_mean train_auc_std #> 1 0.1624096 0.001689528 0.9871128 5.969219e-04 #> 2 0.0778430 0.001069471 0.9999132 4.370538e-05 #> 3 0.0441912 0.004785576 0.9999614 2.755068e-05 #> test_rmse_mean test_rmse_std test_auc_mean test_auc_std #> 0.1625292 0.005670582 0.9871266 2.355142e-03 #> 0.0781154 0.004131304 0.9999042 7.821355e-05 #> 0.0482658 0.007876465 0.9999446 5.451825e-05#> ##### xgb.cv 5-folds #> call: #> xgb.cv(data = dtrain, nrounds = 3, nfold = 5, metrics = list("rmse", #> "auc"), nthread = 2, max_depth = 3, eta = 1, objective = "binary:logistic") #> params (as set within xgb.cv): #> nthread = "2", max_depth = "3", eta = "1", objective = "binary:logistic", eval_metric = "rmse", eval_metric = "auc", silent = "1" #> callbacks: #> cb.print.evaluation(period = print_every_n, showsd = showsd) #> cb.evaluation.log() #> niter: 3 #> evaluation_log: #> iter train_rmse_mean train_rmse_std train_auc_mean train_auc_std #> 1 0.1624096 0.001689528 0.9871128 5.969219e-04 #> 2 0.0778430 0.001069471 0.9999132 4.370538e-05 #> 3 0.0441912 0.004785576 0.9999614 2.755068e-05 #> test_rmse_mean test_rmse_std test_auc_mean test_auc_std #> 0.1625292 0.005670582 0.9871266 2.355142e-03 #> 0.0781154 0.004131304 0.9999042 7.821355e-05 #> 0.0482658 0.007876465 0.9999446 5.451825e-05