Parse a boosted tree model text dump

Parse a boosted tree model text dump into a data.table structure.

xgb.model.dt.tree(feature_names = NULL, model = NULL, text = NULL,
  trees = NULL, use_int_id = FALSE, ...)

Arguments

feature_names	character vector of feature names. If the model already contains feature names, those would be used when `feature_names=NULL` (default value). Non-null `feature_names` could be provided to override those in the model.
model	object of class `xgb.Booster`
text	`character` vector previously generated by the `xgb.dump` function (where parameter `with_stats = TRUE` should have been set). `text` takes precedence over `model`.
trees	an integer vector of tree indices that should be parsed. If set to `NULL`, all trees of the model are parsed. It could be useful, e.g., in multiclass classification to get only the trees of one certain class. IMPORTANT: the tree index in xgboost models is zero-based (e.g., use `trees = 0:4` for first 5 trees).
use_int_id	a logical flag indicating whether nodes in columns "Yes", "No", "Missing" should be represented as integers (when FALSE) or as "Tree-Node" character strings (when FALSE).
...	currently not used.

Value

A data.table with detailed information about model trees' nodes.

The columns of the data.table are:

Tree: integer ID of a tree in a model (zero-based index)
Node: integer ID of a node in a tree (zero-based index)
ID: character identifier of a node in a model (only when use_int_id=FALSE)
Feature: for a branch node, it's a feature id or name (when available); for a leaf note, it simply labels it as 'Leaf'
Split: location of the split for a branch node (split condition is always "less than")
Yes: ID of the next node when the split condition is met
No: ID of the next node when the split condition is not met
Missing: ID of the next node when branch value is missing
Quality: either the split gain (change in loss) or the leaf value
Cover: metric related to the number of observation either seen by a split or collected by a leaf during training.

When use_int_id=FALSE, columns "Yes", "No", and "Missing" point to model-wide node identifiers in the "ID" column. When use_int_id=TRUE, those columns point to node identifiers from the corresponding trees in the "Node" column.

Examples

# Basic use:

data(agaricus.train, package='xgboost')

bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
               eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic")
#> [1]	train-error:0.046522 
#> [2]	train-error:0.022263 

(dt <- xgb.model.dt.tree(colnames(agaricus.train$data), bst))
#>     Tree Node  ID                 Feature         Split  Yes   No Missing
#>  1:    0    0 0-0               odor=none -9.536743e-07  0-1  0-2     0-1
#>  2:    0    1 0-1         stalk-root=club -9.536743e-07  0-3  0-4     0-3
#>  3:    0    2 0-2 spore-print-color=green -9.536743e-07  0-5  0-6     0-5
#>  4:    0    3 0-3                    Leaf            NA <NA> <NA>    <NA>
#>  5:    0    4 0-4                    Leaf            NA <NA> <NA>    <NA>
#>  6:    0    5 0-5                    Leaf            NA <NA> <NA>    <NA>
#>  7:    0    6 0-6                    Leaf            NA <NA> <NA>    <NA>
#>  8:    1    0 1-0       stalk-root=rooted -9.536743e-07  1-1  1-2     1-1
#>  9:    1    1 1-1               odor=none -9.536743e-07  1-3  1-4     1-3
#> 10:    1    2 1-2                    Leaf            NA <NA> <NA>    <NA>
#> 11:    1    3 1-3                    Leaf            NA <NA> <NA>    <NA>
#> 12:    1    4 1-4                    Leaf            NA <NA> <NA>    <NA>
#>          Quality      Cover
#>  1: 4000.5310100 1628.25000
#>  2: 1158.2120400  924.50000
#>  3:  198.1738280  703.75000
#>  4:    1.7121772  812.00000
#>  5:   -1.7004405  112.50000
#>  6:   -1.9407086  690.50000
#>  7:    1.8596492   13.25000
#>  8:  832.5450440  788.85205
#>  9:  569.7250980  768.38971
#> 10:   -6.2362447   20.46239
#> 11:    0.7847176  458.93686
#> 12:   -0.9685304  309.45282

# This bst model already has feature_names stored with it, so those would be used when 
# feature_names is not set:
(dt <- xgb.model.dt.tree(model = bst))
#>     Tree Node  ID                 Feature         Split  Yes   No Missing
#>  1:    0    0 0-0               odor=none -9.536743e-07  0-1  0-2     0-1
#>  2:    0    1 0-1         stalk-root=club -9.536743e-07  0-3  0-4     0-3
#>  3:    0    2 0-2 spore-print-color=green -9.536743e-07  0-5  0-6     0-5
#>  4:    0    3 0-3                    Leaf            NA <NA> <NA>    <NA>
#>  5:    0    4 0-4                    Leaf            NA <NA> <NA>    <NA>
#>  6:    0    5 0-5                    Leaf            NA <NA> <NA>    <NA>
#>  7:    0    6 0-6                    Leaf            NA <NA> <NA>    <NA>
#>  8:    1    0 1-0       stalk-root=rooted -9.536743e-07  1-1  1-2     1-1
#>  9:    1    1 1-1               odor=none -9.536743e-07  1-3  1-4     1-3
#> 10:    1    2 1-2                    Leaf            NA <NA> <NA>    <NA>
#> 11:    1    3 1-3                    Leaf            NA <NA> <NA>    <NA>
#> 12:    1    4 1-4                    Leaf            NA <NA> <NA>    <NA>
#>          Quality      Cover
#>  1: 4000.5310100 1628.25000
#>  2: 1158.2120400  924.50000
#>  3:  198.1738280  703.75000
#>  4:    1.7121772  812.00000
#>  5:   -1.7004405  112.50000
#>  6:   -1.9407086  690.50000
#>  7:    1.8596492   13.25000
#>  8:  832.5450440  788.85205
#>  9:  569.7250980  768.38971
#> 10:   -6.2362447   20.46239
#> 11:    0.7847176  458.93686
#> 12:   -0.9685304  309.45282

# How to match feature names of splits that are following a current 'Yes' branch:

merge(dt, dt[, .(ID, Y.Feature=Feature)], by.x='Yes', by.y='ID', all.x=TRUE)[order(Tree,Node)]
#>      Yes Tree Node  ID                 Feature         Split   No Missing
#>  1:  0-1    0    0 0-0               odor=none -9.536743e-07  0-2     0-1
#>  2:  0-3    0    1 0-1         stalk-root=club -9.536743e-07  0-4     0-3
#>  3:  0-5    0    2 0-2 spore-print-color=green -9.536743e-07  0-6     0-5
#>  4: <NA>    0    3 0-3                    Leaf            NA <NA>    <NA>
#>  5: <NA>    0    4 0-4                    Leaf            NA <NA>    <NA>
#>  6: <NA>    0    5 0-5                    Leaf            NA <NA>    <NA>
#>  7: <NA>    0    6 0-6                    Leaf            NA <NA>    <NA>
#>  8:  1-1    1    0 1-0       stalk-root=rooted -9.536743e-07  1-2     1-1
#>  9:  1-3    1    1 1-1               odor=none -9.536743e-07  1-4     1-3
#> 10: <NA>    1    2 1-2                    Leaf            NA <NA>    <NA>
#> 11: <NA>    1    3 1-3                    Leaf            NA <NA>    <NA>
#> 12: <NA>    1    4 1-4                    Leaf            NA <NA>    <NA>
#>          Quality      Cover       Y.Feature
#>  1: 4000.5310100 1628.25000 stalk-root=club
#>  2: 1158.2120400  924.50000            Leaf
#>  3:  198.1738280  703.75000            Leaf
#>  4:    1.7121772  812.00000            <NA>
#>  5:   -1.7004405  112.50000            <NA>
#>  6:   -1.9407086  690.50000            <NA>
#>  7:    1.8596492   13.25000            <NA>
#>  8:  832.5450440  788.85205       odor=none
#>  9:  569.7250980  768.38971            Leaf
#> 10:   -6.2362447   20.46239            <NA>
#> 11:    0.7847176  458.93686            <NA>
#> 12:   -0.9685304  309.45282            <NA>

Arguments

Value

Examples

Contents