xgb.train
is an advanced interface for training an xgboost model.
The xgboost
function is a simpler wrapper for xgb.train
.
xgb.train(params = list(), data, nrounds, watchlist = list(), obj = NULL, feval = NULL, verbose = 1, print_every_n = 1L, early_stopping_rounds = NULL, maximize = NULL, save_period = NULL, save_name = "xgboost.model", xgb_model = NULL, callbacks = list(), ...) xgboost(data = NULL, label = NULL, missing = NA, weight = NULL, params = list(), nrounds, verbose = 1, print_every_n = 1L, early_stopping_rounds = NULL, maximize = NULL, save_period = NULL, save_name = "xgboost.model", xgb_model = NULL, callbacks = list(), ...)
params | the list of parameters. The complete list of parameters is available at http://xgboost.readthedocs.io/en/latest/parameter.html. Below is a shorter summary: 1. General Parameters
2. Booster Parameters 2.1. Parameter for Tree Booster
2.2. Parameter for Linear Booster
3. Task Parameters
|
---|---|
data | training dataset. |
nrounds | max number of boosting iterations. |
watchlist | named list of xgb.DMatrix datasets to use for evaluating model performance.
Metrics specified in either |
obj | customized objective function. Returns gradient and second order gradient with given prediction and dtrain. |
feval | customized evaluation function. Returns
|
verbose | If 0, xgboost will stay silent. If 1, it will print information about performance.
If 2, some additional information will be printed out.
Note that setting |
print_every_n | Print each n-th iteration evaluation messages when |
early_stopping_rounds | If |
maximize | If |
save_period | when it is non-NULL, model is saved to disk after every |
save_name | the name or path for periodically saved model file. |
xgb_model | a previously built model to continue the training from.
Could be either an object of class |
callbacks | a list of callback functions to perform various task during boosting.
See |
... | other parameters to pass to |
label | vector of response values. Should not be provided when data is
a local data file name or an |
missing | by default is set to NA, which means that NA values should be considered as 'missing' by the algorithm. Sometimes, 0 or other extreme value might be used to represent missing values. This parameter is only used when input is a dense matrix. |
weight | a vector indicating the weight for each row of the input. |
An object of class xgb.Booster
with the following elements:
handle
a handle (pointer) to the xgboost model in memory.
raw
a cached memory dump of the xgboost model saved as R's raw
type.
niter
number of boosting iterations.
evaluation_log
evaluation history stored as a data.table
with the
first column corresponding to iteration number and the rest corresponding to evaluation
metrics' values. It is created by the cb.evaluation.log
callback.
call
a function call.
params
parameters that were passed to the xgboost library. Note that it does not
capture parameters changed by the cb.reset.parameters
callback.
callbacks
callback functions that were either automatically assigned or
explicitly passed.
best_iteration
iteration number with the best evaluation metric value
(only available with early stopping).
best_ntreelimit
the ntreelimit
value corresponding to the best iteration,
which could further be used in predict
method
(only available with early stopping).
best_score
the best evaluation metric value during early stopping.
(only available with early stopping).
feature_names
names of the training dataset features
(only when column names were defined in training data).
nfeatures
number of features in training data.
These are the training functions for xgboost
.
The xgb.train
interface supports advanced features such as watchlist
,
customized objective and evaluation metric functions, therefore it is more flexible
than the xgboost
interface.
Parallelization is automatically enabled if OpenMP
is present.
Number of threads can also be manually specified via nthread
parameter.
The evaluation metric is chosen automatically by Xgboost (according to the objective)
when the eval_metric
parameter is not provided.
User may set one or several eval_metric
parameters.
Note that when using a customized metric, only this single metric can be used.
The following is the list of built-in metrics for which Xgboost provides optimized implementation:
rmse
root mean square error. http://en.wikipedia.org/wiki/Root_mean_square_error
logloss
negative log-likelihood. http://en.wikipedia.org/wiki/Log-likelihood
mlogloss
multiclass logloss. http://wiki.fast.ai/index.php/Log_Loss
error
Binary classification error rate. It is calculated as (# wrong cases) / (# all cases)
.
By default, it uses the 0.5 threshold for predicted values to define negative and positive instances.
Different threshold (e.g., 0.) could be specified as "error@0."
merror
Multiclass classification error rate. It is calculated as (# wrong cases) / (# all cases)
.
auc
Area under the curve. http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve for ranking evaluation.
aucpr
Area under the PR curve. https://en.wikipedia.org/wiki/Precision_and_recall for ranking evaluation.
ndcg
Normalized Discounted Cumulative Gain (for ranking task). http://en.wikipedia.org/wiki/NDCG
The following callbacks are automatically created when certain parameters are set:
cb.print.evaluation
is turned on when verbose > 0
;
and the print_every_n
parameter is passed to it.
cb.evaluation.log
is on when watchlist
is present.
cb.early.stop
: when early_stopping_rounds
is set.
cb.save.model
: when save_period > 0
is set.
Tianqi Chen and Carlos Guestrin, "XGBoost: A Scalable Tree Boosting System", 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, 2016, https://arxiv.org/abs/1603.02754
data(agaricus.train, package='xgboost') data(agaricus.test, package='xgboost') dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label) dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label) watchlist <- list(train = dtrain, eval = dtest) ## A simple xgb.train example: param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2, objective = "binary:logistic", eval_metric = "auc") bst <- xgb.train(param, dtrain, nrounds = 2, watchlist)#> [1] train-auc:0.958228 eval-auc:0.960373 #> [2] train-auc:0.981413 eval-auc:0.979930## An xgb.train example where custom objective and evaluation metric are used: logregobj <- function(preds, dtrain) { labels <- getinfo(dtrain, "label") preds <- 1/(1 + exp(-preds)) grad <- preds - labels hess <- preds * (1 - preds) return(list(grad = grad, hess = hess)) } evalerror <- function(preds, dtrain) { labels <- getinfo(dtrain, "label") err <- as.numeric(sum(labels != (preds > 0)))/length(labels) return(list(metric = "error", value = err)) } # These functions could be used by passing them either: # as 'objective' and 'eval_metric' parameters in the params list: param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2, objective = logregobj, eval_metric = evalerror) bst <- xgb.train(param, dtrain, nrounds = 2, watchlist)#> [1] train-error:0.046522 eval-error:0.042831 #> [2] train-error:0.022263 eval-error:0.021726# or through the ... arguments: param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2) bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, objective = logregobj, eval_metric = evalerror)#> [1] train-error:0.046522 eval-error:0.042831 #> [2] train-error:0.022263 eval-error:0.021726# or as dedicated 'obj' and 'feval' parameters of xgb.train: bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, obj = logregobj, feval = evalerror)#> [1] train-error:0.046522 eval-error:0.042831 #> [2] train-error:0.022263 eval-error:0.021726## An xgb.train example of using variable learning rates at each iteration: param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2, objective = "binary:logistic", eval_metric = "auc") my_etas <- list(eta = c(0.5, 0.1)) bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, callbacks = list(cb.reset.parameters(my_etas)))#> [1] train-auc:0.958228 eval-auc:0.960373 #> [2] train-auc:0.992347 eval-auc:0.994009## Early stopping: bst <- xgb.train(param, dtrain, nrounds = 25, watchlist, early_stopping_rounds = 3)#> [1] train-auc:0.958228 eval-auc:0.960373 #> Multiple eval metrics are present. Will use eval_auc for early stopping. #> Will train until eval_auc hasn't improved in 3 rounds. #> #> [2] train-auc:0.981413 eval-auc:0.979930 #> [3] train-auc:0.997070 eval-auc:0.998518 #> [4] train-auc:0.998757 eval-auc:0.998943 #> [5] train-auc:0.999298 eval-auc:0.999830 #> [6] train-auc:0.999585 eval-auc:1.000000 #> [7] train-auc:0.999585 eval-auc:1.000000 #> [8] train-auc:0.999916 eval-auc:1.000000 #> [9] train-auc:0.999916 eval-auc:1.000000 #> Stopping. Best iteration: #> [6] train-auc:0.999585 eval-auc:1.000000 #>## An 'xgboost' interface example: bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")#> [1] train-error:0.046522 #> [2] train-error:0.022263