Parse a boosted tree model text dump into a data.table structure.

xgb.model.dt.tree(feature_names = NULL, model = NULL, text = NULL,
  trees = NULL, use_int_id = FALSE, ...)

Arguments

feature_names

character vector of feature names. If the model already contains feature names, those would be used when feature_names=NULL (default value). Non-null feature_names could be provided to override those in the model.

model

object of class xgb.Booster

text

character vector previously generated by the xgb.dump function (where parameter with_stats = TRUE should have been set). text takes precedence over model.

trees

an integer vector of tree indices that should be parsed. If set to NULL, all trees of the model are parsed. It could be useful, e.g., in multiclass classification to get only the trees of one certain class. IMPORTANT: the tree index in xgboost models is zero-based (e.g., use trees = 0:4 for first 5 trees).

use_int_id

a logical flag indicating whether nodes in columns "Yes", "No", "Missing" should be represented as integers (when FALSE) or as "Tree-Node" character strings (when FALSE).

...

currently not used.

Value

A data.table with detailed information about model trees' nodes.

The columns of the data.table are:

  • Tree: integer ID of a tree in a model (zero-based index)

  • Node: integer ID of a node in a tree (zero-based index)

  • ID: character identifier of a node in a model (only when use_int_id=FALSE)

  • Feature: for a branch node, it's a feature id or name (when available); for a leaf note, it simply labels it as 'Leaf'

  • Split: location of the split for a branch node (split condition is always "less than")

  • Yes: ID of the next node when the split condition is met

  • No: ID of the next node when the split condition is not met

  • Missing: ID of the next node when branch value is missing

  • Quality: either the split gain (change in loss) or the leaf value

  • Cover: metric related to the number of observation either seen by a split or collected by a leaf during training.

When use_int_id=FALSE, columns "Yes", "No", and "Missing" point to model-wide node identifiers in the "ID" column. When use_int_id=TRUE, those columns point to node identifiers from the corresponding trees in the "Node" column.

Examples

# Basic use: data(agaricus.train, package='xgboost') bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2, eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic")
#> [1] train-error:0.046522 #> [2] train-error:0.022263
(dt <- xgb.model.dt.tree(colnames(agaricus.train$data), bst))
#> Tree Node ID Feature Split Yes No Missing #> 1: 0 0 0-0 odor=none -9.536743e-07 0-1 0-2 0-1 #> 2: 0 1 0-1 stalk-root=club -9.536743e-07 0-3 0-4 0-3 #> 3: 0 2 0-2 spore-print-color=green -9.536743e-07 0-5 0-6 0-5 #> 4: 0 3 0-3 Leaf NA <NA> <NA> <NA> #> 5: 0 4 0-4 Leaf NA <NA> <NA> <NA> #> 6: 0 5 0-5 Leaf NA <NA> <NA> <NA> #> 7: 0 6 0-6 Leaf NA <NA> <NA> <NA> #> 8: 1 0 1-0 stalk-root=rooted -9.536743e-07 1-1 1-2 1-1 #> 9: 1 1 1-1 odor=none -9.536743e-07 1-3 1-4 1-3 #> 10: 1 2 1-2 Leaf NA <NA> <NA> <NA> #> 11: 1 3 1-3 Leaf NA <NA> <NA> <NA> #> 12: 1 4 1-4 Leaf NA <NA> <NA> <NA> #> Quality Cover #> 1: 4000.5310100 1628.25000 #> 2: 1158.2120400 924.50000 #> 3: 198.1738280 703.75000 #> 4: 1.7121772 812.00000 #> 5: -1.7004405 112.50000 #> 6: -1.9407086 690.50000 #> 7: 1.8596492 13.25000 #> 8: 832.5450440 788.85205 #> 9: 569.7250980 768.38971 #> 10: -6.2362447 20.46239 #> 11: 0.7847176 458.93686 #> 12: -0.9685304 309.45282
# This bst model already has feature_names stored with it, so those would be used when # feature_names is not set: (dt <- xgb.model.dt.tree(model = bst))
#> Tree Node ID Feature Split Yes No Missing #> 1: 0 0 0-0 odor=none -9.536743e-07 0-1 0-2 0-1 #> 2: 0 1 0-1 stalk-root=club -9.536743e-07 0-3 0-4 0-3 #> 3: 0 2 0-2 spore-print-color=green -9.536743e-07 0-5 0-6 0-5 #> 4: 0 3 0-3 Leaf NA <NA> <NA> <NA> #> 5: 0 4 0-4 Leaf NA <NA> <NA> <NA> #> 6: 0 5 0-5 Leaf NA <NA> <NA> <NA> #> 7: 0 6 0-6 Leaf NA <NA> <NA> <NA> #> 8: 1 0 1-0 stalk-root=rooted -9.536743e-07 1-1 1-2 1-1 #> 9: 1 1 1-1 odor=none -9.536743e-07 1-3 1-4 1-3 #> 10: 1 2 1-2 Leaf NA <NA> <NA> <NA> #> 11: 1 3 1-3 Leaf NA <NA> <NA> <NA> #> 12: 1 4 1-4 Leaf NA <NA> <NA> <NA> #> Quality Cover #> 1: 4000.5310100 1628.25000 #> 2: 1158.2120400 924.50000 #> 3: 198.1738280 703.75000 #> 4: 1.7121772 812.00000 #> 5: -1.7004405 112.50000 #> 6: -1.9407086 690.50000 #> 7: 1.8596492 13.25000 #> 8: 832.5450440 788.85205 #> 9: 569.7250980 768.38971 #> 10: -6.2362447 20.46239 #> 11: 0.7847176 458.93686 #> 12: -0.9685304 309.45282
# How to match feature names of splits that are following a current 'Yes' branch: merge(dt, dt[, .(ID, Y.Feature=Feature)], by.x='Yes', by.y='ID', all.x=TRUE)[order(Tree,Node)]
#> Yes Tree Node ID Feature Split No Missing #> 1: 0-1 0 0 0-0 odor=none -9.536743e-07 0-2 0-1 #> 2: 0-3 0 1 0-1 stalk-root=club -9.536743e-07 0-4 0-3 #> 3: 0-5 0 2 0-2 spore-print-color=green -9.536743e-07 0-6 0-5 #> 4: <NA> 0 3 0-3 Leaf NA <NA> <NA> #> 5: <NA> 0 4 0-4 Leaf NA <NA> <NA> #> 6: <NA> 0 5 0-5 Leaf NA <NA> <NA> #> 7: <NA> 0 6 0-6 Leaf NA <NA> <NA> #> 8: 1-1 1 0 1-0 stalk-root=rooted -9.536743e-07 1-2 1-1 #> 9: 1-3 1 1 1-1 odor=none -9.536743e-07 1-4 1-3 #> 10: <NA> 1 2 1-2 Leaf NA <NA> <NA> #> 11: <NA> 1 3 1-3 Leaf NA <NA> <NA> #> 12: <NA> 1 4 1-4 Leaf NA <NA> <NA> #> Quality Cover Y.Feature #> 1: 4000.5310100 1628.25000 stalk-root=club #> 2: 1158.2120400 924.50000 Leaf #> 3: 198.1738280 703.75000 Leaf #> 4: 1.7121772 812.00000 <NA> #> 5: -1.7004405 112.50000 <NA> #> 6: -1.9407086 690.50000 <NA> #> 7: 1.8596492 13.25000 <NA> #> 8: 832.5450440 788.85205 odor=none #> 9: 569.7250980 768.38971 Leaf #> 10: -6.2362447 20.46239 <NA> #> 11: 0.7847176 458.93686 <NA> #> 12: -0.9685304 309.45282 <NA>