xgboost.plot_tree：二元特征解释-解网

问：

我构建了一个 XGBoost 模型，并试图检查各个估计器。作为参考，这是一个具有离散和连续输入特征的二元分类任务。输入特征矩阵是 .scipy.sparse.csr_matrix

然而，当我去检查单个估计器时，我发现很难解释二进制输入特征，如下所示。最底部图表中的实值很容易解释 - 其标准在该特征的预期范围内。但是，对二进制特征进行的比较没有意义。这些特征中的每一个都是 1 或 0。是一个非常小的负数，我想这只是 XGBoost 或其基础绘图库中的一些浮点特性，但是当功能始终为正时，使用这种比较是没有意义的。有人可以帮我了解哪个方向（即与。对应于这些二进制特征节点的哪一面是真/假的一面？f60150f60150<X> < -9.53674e-07-9.53674e-07yes, missingno

下面是一个可重现的例子：

import numpy as np
import scipy.sparse
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import plot_tree, XGBClassifier
import matplotlib.pyplot as plt

def booleanize_csr_matrix(mat):
    ''' Convert sparse matrix with positive integer elements to 1s '''
    nnz_inds = mat.nonzero()
    keep = np.where(mat.data > 0)[0]
    n_keep = len(keep)
    result = scipy.sparse.csr_matrix(
        (np.ones(n_keep), (nnz_inds[0][keep], nnz_inds[1][keep])),
        shape=mat.shape
    )
    return result

### Setup dataset
res = fetch_20newsgroups()

text = res.data
outcome = res.target

### Use default params from CountVectorizer to create initial count matrix
vec = CountVectorizer()
X = vec.fit_transform(text)

# Whether to "booleanize" the input matrix
booleanize = True

# Whether to, after "booleanizing", convert the data type to match what's returned by `vec.fit_transform(text)`
to_int = True

if booleanize and to_int:
    X = booleanize_csr_matrix(X)
    X = X.astype(np.int64)

# Make it a binary classification problem
y = np.where(outcome == 1, 1, 0)

# Random state ensures we will be able to compare trees and their features consistently
model = XGBClassifier(random_state=100)
model.fit(X, y)

plot_tree(model, rankdir='LR'); plt.show()

运行上述 with 并将设置为将生成以下图表：booleanizeto_intTrue

运行上述 with 并将设置为将生成以下图表：booleanizeto_intFalse

哎呀，即使我做了一个非常简单的例子，我也会得到“正确”的结果，无论是整数还是浮点类型。Xy

X = np.matrix(
    [
        [1,0],
        [1,0],
        [0,1],
        [0,1],
        [1,1],
        [1,0],
        [0,0],
        [0,0],
        [1,1],
        [0,1]
    ]
)

y = np.array([1,0,0,0,1,1,1,0,1,1])

model = XGBClassifier(random_state=100)
model.fit(X, y)

plot_tree(model, rankdir='LR'); plt.show()

Python 机器学习 XGBoost

我认为以前 xgboost 在分类变量方面并不擅长。这是引入 catboost.ai/en/docs 的原因之一。xgboost 的 1.5 版本引入了对分类变量的实验性支持。xgboost.readthedocs.io/en/stable/tutorials/categorical.html。尽管截至今天，它似乎仍处于实验阶段。我不认为这条评论会帮助你，因为自发布以来已经超过4年了，但我希望这对有类似问题的人有所帮助

答：

0赞 Bilal Asghar 11/8/2023 #1

您在 XGBoost 树可视化中看到的比较值通常用于在决策树中将数据拆分为两个分支。对于二元特征，例如函数创建的二元特征，比较确实用于确定样本应遵循哪个分支（True 或 False）。booleanize_csr_matrix

在 XGBoost 二元分类模型的上下文中：

如果二元特征（如）具有比较，例如，则表示树正在根据特征的值是否小于来拆分样本。f60150<X> < -9.53674e-07f60150-9.53674e-07
如果特征始终为正（即 1 或 0），则仍应以相同的方式解释比较。这种比较基本上等同于“等于 0 还是 1？<X> < -9.53674e-07f60150

XGBoost 可能以使用浮点数进行拆分的方式表示二进制特征，但比较仍可用于根据特征的值将样本分离到不同的分支中。在实践中，这个小负数被用作拆分的阈值，但它不会影响二值特征的解释。

因此，在二元特征的情况下，比较值用于确定特征是 0 还是 1，这就是树在分类过程中做出决策的方式。对于二元特征，这些比较通常很简单，用于将样本分为两组：一组特征为 0，另一组特征为 1。

上一个：使用 Spyder python 连接到远程计算机时出现问题

下一个：安装 scipy 时未找到库 mkl_rt、openblas、lapack - 如何更改库的标志

xgboost.plot_tree：二元特征解释

xgboost.plot_tree: binary feature interpretation

评论