xgboost 具有稀疏矩阵数据和多项式 Y 的随机森林-解网

问：

我不确定是否可以以我需要的方式组合许多不错的功能（？），但我要做的是在多类因变量上运行具有稀疏数据预测变量的随机森林。xgboost

我知道它可以做其中任何一件事：xgboost

通过调整参数的随机森林：xgboost

bst <- xgboost(data = train$data, label = train$label, max.depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "binary:logistic")

稀疏矩阵预测变量

bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4, eta = 1, nthread = 2, nround = 10,objective = "binary:logistic")

多项式（多类）因变量模型，通过或multi:softmaxmulti:softprob

xgboost(data = data, label = multinomial_vector, max.depth = 4, eta = 1, nthread = 2, nround = 10,objective = "multi:softmax")

但是，当我尝试一次执行所有操作时，我遇到了有关不合格长度的错误：

sparse_matrix     <- sparse.model.matrix(TripType~.-1, data = train)
Y                 <- train$TripType
bst               <- xgboost(data = sparse_matrix, label = Y, max.depth = 4, num_parallel_tree = 100, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "multi:softmax")
Error in xgb.setinfo(dmat, names(p), p[[1]]) : 
  The length of labels must equal to the number of rows in the input data
length(Y)
[1] 647054
length(sparse_matrix)
[1] 66210988200
nrow(sparse_matrix)
[1] 642925

我得到的长度误差是将我的单个多类依赖向量（我们称之为 n）的长度与稀疏矩阵索引的长度进行比较，我相信 j 预测变量的长度为 j * n。

这里的具体用例是沃尔玛 Kaggle.com 竞争（数据是公开的，但默认情况下非常大——大约 650,000 行和数千个候选特征）。我一直在通过 H2O 在其上运行多项式射频模型，但听起来很多人一直在使用，所以我想知道这是否可能。xgboost

如果不可能，那么我想知道是否可以/应该分别估计因变量的每个水平并尝试得出结果？

R 稀疏矩阵随机森林 XGBoost

xgboost 具有稀疏矩阵数据和多项式 Y 的随机森林

xgboost Random Forest with sparse matrix data and multinomial Y

评论