Parse a boosted tree model text dump into a data.table
structure.
xgb.model.dt.tree(feature_names = NULL, model = NULL, text = NULL, trees = NULL, use_int_id = FALSE, ...)
feature_names | character vector of feature names. If the model already
contains feature names, those would be used when |
---|---|
model | object of class |
text |
|
trees | an integer vector of tree indices that should be parsed.
If set to |
use_int_id | a logical flag indicating whether nodes in columns "Yes", "No", "Missing" should be represented as integers (when FALSE) or as "Tree-Node" character strings (when FALSE). |
... | currently not used. |
A data.table
with detailed information about model trees' nodes.
The columns of the data.table
are:
Tree
: integer ID of a tree in a model (zero-based index)
Node
: integer ID of a node in a tree (zero-based index)
ID
: character identifier of a node in a model (only when use_int_id=FALSE
)
Feature
: for a branch node, it's a feature id or name (when available);
for a leaf note, it simply labels it as 'Leaf'
Split
: location of the split for a branch node (split condition is always "less than")
Yes
: ID of the next node when the split condition is met
No
: ID of the next node when the split condition is not met
Missing
: ID of the next node when branch value is missing
Quality
: either the split gain (change in loss) or the leaf value
Cover
: metric related to the number of observation either seen by a split
or collected by a leaf during training.
When use_int_id=FALSE
, columns "Yes", "No", and "Missing" point to model-wide node identifiers
in the "ID" column. When use_int_id=TRUE
, those columns point to node identifiers from
the corresponding trees in the "Node" column.
# Basic use: data(agaricus.train, package='xgboost') bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2, eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic")#> [1] train-error:0.046522 #> [2] train-error:0.022263#> Tree Node ID Feature Split Yes No Missing #> 1: 0 0 0-0 odor=none -9.536743e-07 0-1 0-2 0-1 #> 2: 0 1 0-1 stalk-root=club -9.536743e-07 0-3 0-4 0-3 #> 3: 0 2 0-2 spore-print-color=green -9.536743e-07 0-5 0-6 0-5 #> 4: 0 3 0-3 Leaf NA <NA> <NA> <NA> #> 5: 0 4 0-4 Leaf NA <NA> <NA> <NA> #> 6: 0 5 0-5 Leaf NA <NA> <NA> <NA> #> 7: 0 6 0-6 Leaf NA <NA> <NA> <NA> #> 8: 1 0 1-0 stalk-root=rooted -9.536743e-07 1-1 1-2 1-1 #> 9: 1 1 1-1 odor=none -9.536743e-07 1-3 1-4 1-3 #> 10: 1 2 1-2 Leaf NA <NA> <NA> <NA> #> 11: 1 3 1-3 Leaf NA <NA> <NA> <NA> #> 12: 1 4 1-4 Leaf NA <NA> <NA> <NA> #> Quality Cover #> 1: 4000.5310100 1628.25000 #> 2: 1158.2120400 924.50000 #> 3: 198.1738280 703.75000 #> 4: 1.7121772 812.00000 #> 5: -1.7004405 112.50000 #> 6: -1.9407086 690.50000 #> 7: 1.8596492 13.25000 #> 8: 832.5450440 788.85205 #> 9: 569.7250980 768.38971 #> 10: -6.2362447 20.46239 #> 11: 0.7847176 458.93686 #> 12: -0.9685304 309.45282# This bst model already has feature_names stored with it, so those would be used when # feature_names is not set: (dt <- xgb.model.dt.tree(model = bst))#> Tree Node ID Feature Split Yes No Missing #> 1: 0 0 0-0 odor=none -9.536743e-07 0-1 0-2 0-1 #> 2: 0 1 0-1 stalk-root=club -9.536743e-07 0-3 0-4 0-3 #> 3: 0 2 0-2 spore-print-color=green -9.536743e-07 0-5 0-6 0-5 #> 4: 0 3 0-3 Leaf NA <NA> <NA> <NA> #> 5: 0 4 0-4 Leaf NA <NA> <NA> <NA> #> 6: 0 5 0-5 Leaf NA <NA> <NA> <NA> #> 7: 0 6 0-6 Leaf NA <NA> <NA> <NA> #> 8: 1 0 1-0 stalk-root=rooted -9.536743e-07 1-1 1-2 1-1 #> 9: 1 1 1-1 odor=none -9.536743e-07 1-3 1-4 1-3 #> 10: 1 2 1-2 Leaf NA <NA> <NA> <NA> #> 11: 1 3 1-3 Leaf NA <NA> <NA> <NA> #> 12: 1 4 1-4 Leaf NA <NA> <NA> <NA> #> Quality Cover #> 1: 4000.5310100 1628.25000 #> 2: 1158.2120400 924.50000 #> 3: 198.1738280 703.75000 #> 4: 1.7121772 812.00000 #> 5: -1.7004405 112.50000 #> 6: -1.9407086 690.50000 #> 7: 1.8596492 13.25000 #> 8: 832.5450440 788.85205 #> 9: 569.7250980 768.38971 #> 10: -6.2362447 20.46239 #> 11: 0.7847176 458.93686 #> 12: -0.9685304 309.45282# How to match feature names of splits that are following a current 'Yes' branch: merge(dt, dt[, .(ID, Y.Feature=Feature)], by.x='Yes', by.y='ID', all.x=TRUE)[order(Tree,Node)]#> Yes Tree Node ID Feature Split No Missing #> 1: 0-1 0 0 0-0 odor=none -9.536743e-07 0-2 0-1 #> 2: 0-3 0 1 0-1 stalk-root=club -9.536743e-07 0-4 0-3 #> 3: 0-5 0 2 0-2 spore-print-color=green -9.536743e-07 0-6 0-5 #> 4: <NA> 0 3 0-3 Leaf NA <NA> <NA> #> 5: <NA> 0 4 0-4 Leaf NA <NA> <NA> #> 6: <NA> 0 5 0-5 Leaf NA <NA> <NA> #> 7: <NA> 0 6 0-6 Leaf NA <NA> <NA> #> 8: 1-1 1 0 1-0 stalk-root=rooted -9.536743e-07 1-2 1-1 #> 9: 1-3 1 1 1-1 odor=none -9.536743e-07 1-4 1-3 #> 10: <NA> 1 2 1-2 Leaf NA <NA> <NA> #> 11: <NA> 1 3 1-3 Leaf NA <NA> <NA> #> 12: <NA> 1 4 1-4 Leaf NA <NA> <NA> #> Quality Cover Y.Feature #> 1: 4000.5310100 1628.25000 stalk-root=club #> 2: 1158.2120400 924.50000 Leaf #> 3: 198.1738280 703.75000 Leaf #> 4: 1.7121772 812.00000 <NA> #> 5: -1.7004405 112.50000 <NA> #> 6: -1.9407086 690.50000 <NA> #> 7: 1.8596492 13.25000 <NA> #> 8: 832.5450440 788.85205 odor=none #> 9: 569.7250980 768.38971 Leaf #> 10: -6.2362447 20.46239 <NA> #> 11: 0.7847176 458.93686 <NA> #> 12: -0.9685304 309.45282 <NA>