整洁模型中的 XGBoost 贝叶斯优化

XGBoost Bayesian Optimisation in tidymodels

提问人:arne maa 提问时间:7/28/2023 最后编辑:arne maa 更新时间:7/29/2023 访问量:96

问:

我尝试将贝叶斯优化应用于 tidymodels 框架中的二元分类问题 (XGBOOST)。

  1. 我的代码中是否有任何缺陷 - 该模型已在 2 CPU Linux 机器上运行了 72 天。虽然我的数据集很大,大约有 6GB(100 万行和 2000 列),但我缺乏经验来判断计算时间是否在预期范围内,或者是否有任何我没有看到的问题。
  • 更新: 2 天后,我得到了这个输出:
❯  Generating a set of 10 initial parameter results
✓ Initialization complete


── Iteration 1 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

i Current best:     roc_auc=0.7634 (@iter 0)
i Gaussian process model
x Gaussian process model: Error in fit_gp(mean_stats %>% dplyr::select(-.iter), pset = param_info, : argument is missing, with no default
Error in `check_gp_failure()`:
! Gaussian process model was not fit.
Run `rlang::last_trace()` to see where the error occurred.
✖ Optimization stopped prematurely; returning current results.
> stopCluster(cl)
> doParallel::stopImplicitCluster()
> show_best(myxgb_res, n=25)
# A tibble: 10 × 11
   learn_rate tree_depth min_n loss_reduction .metric .estimator  mean     n std_err .config               .iter
        <dbl>      <int> <int>          <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 <int>
 1       1.34          8     5           34.1 roc_auc binary     0.763    10 0.00125 Preprocessor1_Model01     0
 2       1.49          6     7           64.1 roc_auc binary     0.751    10 0.00165 Preprocessor1_Model05     0
 3       1.53          5     3           12.1 roc_auc binary     0.727    10 0.00204 Preprocessor1_Model03     0
 4       1.35          7     3          356.  roc_auc binary     0.721    10 0.00214 Preprocessor1_Model07     0
 5       1.17          9     4          548.  roc_auc binary     0.704    10 0.00167 Preprocessor1_Model04     0
 6       1.04          9     6         1195.  roc_auc binary     0.685    10 0.00100 Preprocessor1_Model06     0
 7       1.80          6     7        54577.  roc_auc binary     0.5      10 0       Preprocessor1_Model02     0
 8       1.21          4     1         5707.  roc_auc binary     0.5      10 0       Preprocessor1_Model08     0
 9       1.98          8     9        32672.  roc_auc binary     0.5      10 0       Preprocessor1_Model09     0
10       1.74          4     9        10550.  roc_auc binary     0.5      10 0       Preprocessor1_Model10     0

此外,当我在示例“mtcars”数据集上运行代码时,会出现以下错误,因此我认为我一定错过了一些东西

❯  Generating a set of 10 initial parameter results
→ A | warning: No control observations were detected in `truth` with control level '1'.
There were issues with some computations   A: x1
✓ Initialization complete


── Iteration 1 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

i Current best:     roc_auc=0.9167 (@iter 0)
i Gaussian process model
x Gaussian process model: Error in fit_gp(mean_stats %>% dplyr::select(-.iter), pset = param_info, : argument is missing, with no default
Error in `check_gp_failure()`:
! Gaussian process model was not fit.
Run `rlang::last_trace()` to see where the error occurred.
✖ Optimization stopped prematurely; returning current results.
> show_best(myxgb_res, n=10)
# A tibble: 10 × 11
   learn_rate tree_depth min_n loss_reduction .metric .estimator  mean     n std_err .config               .iter
        <dbl>      <int> <int>          <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 <int>
 1     0.108           9     2           3.32 roc_auc binary     0.917     9  0.0589 Preprocessor1_Model02     0
 2     0.0771          7     2           1.22 roc_auc binary     0.917     9  0.0589 Preprocessor1_Model04     0
 3     0.0340          5     4           4.38 roc_auc binary     0.5       9  0      Preprocessor1_Model01     0
 4     0.250           4     7           2.24 roc_auc binary     0.5       9  0      Preprocessor1_Model03     0
 5     0.237          10     5           3.78 roc_auc binary     0.5       9  0      Preprocessor1_Model05     0
 6     0.161           4     6           4.66 roc_auc binary     0.5       9  0      Preprocessor1_Model06     0
 7     0.211           6    10           1.40 roc_auc binary     0.5       9  0      Preprocessor1_Model07     0
 8     0.149           7     9           3.94 roc_auc binary     0.5       9  0      Preprocessor1_Model08     0
 9     0.0417          3     3           2.17 roc_auc binary     0.5       9  0      Preprocessor1_Model09     0
10     0.290           8     8           2.76 roc_auc binary     0.5       9  0      Preprocessor1_Model10     0
> 

后续行动,更一般:

  1. 如果我的目标变量在 1/3 比率下略微失衡,我应该使用哪个指标来优化模型。我认为 F1 (f_meas) 应该是最相关的?
# Load necessary libraries
library(tidymodels)
library(doParallel)
library(xgboost)


options(scipen = 999)
# Load example data
data(mtcars)

# Create a binary target
mtcars$am <- as.factor(mtcars$am)

set.seed(123)
df_split <- initial_split(mtcars, strata = am)

df_train <- training(df_split)
df_test <- testing(df_split)
df_train_folds <- vfold_cv(df_train, strata = am)


# /////////////////////////////////////////////////////////////////////////////

#prep
recipe_df <- recipe(am ~ ., data=df_train) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors())

xgb_prep<- prep(recipe_df,verbose=T,)


# Create model

xgb_spec <- boost_tree(
  trees = 100,
  tree_depth = tune(),
  min_n = tune(),
  learn_rate = tune(),
  loss_reduction = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")


# set ranges for parameters
params <- parameters(
  learn_rate(),
  tree_depth(), 
  min_n(), 
  loss_reduction()
) %>%
  update(
    learn_rate = learn_rate(c(0.01, 0.3), trans=NULL),  # range for the learning rate
    tree_depth = tree_depth(c(3, 10)),  # range for the tree depth
    min_n = min_n(c(1, 10)),  # range for the minimum number of observations
    loss_reduction = loss_reduction(c(1, 5), trans=NULL)  # range for the loss reduction
  ) %>%
  finalize(df_train)

# Merge into workflow
xgb_wf <- workflow() %>% 
  add_model(xgb_spec) %>% 
  add_recipe(xgb_prep)




#Parallel Processing
# gc()
# numCores = 30 #detectCores() 72
# cl = parallel::makeCluster(numCores)
# doParallel::registerDoParallel(cl)
options(tidymodels.dark = TRUE)

myxgb_res <- tune_bayes(
  xgb_wf,
  resamples = df_train_folds,
  param_info = params,
  initial = 10,
  iter = 30, 
  metrics = metric_set(roc_auc),
  control = control_bayes(
    no_improve = 5, 
    save_pred = T, 
    verbose = T,
    seed = 123
  ),
  parallel_over = "everything",
)

# stopCluster(cl)
# doParallel::stopImplicitCluster()

show_best(myxgb_res, n=10)

感谢您的任何建议!

xgboost 混淆矩阵 超参数 tidymodels 高斯过程

评论

0赞 topepo 7/28/2023
阶级失衡有多严重?我怀疑您的所有初始值都具有相同的指标值(或者很多都具有相同的指标值)。初始运行中最好的 ROC 是 1/2(非常糟糕),警告让我认为您只给它一个类的数据。
0赞 arne maa 7/28/2023
我的原始数据中的类不平衡是 1:3,在可重现的 mtcars 数据中是 2:3
0赞 topepo 7/28/2023
您能否通过调用并显示这些结果来分离初始值?与(加上在运行这些之前设置种子)几乎相同的代码。tune_grid()tune_bayes()
1赞 topepo 7/29/2023
因此,我们需要一种方法,让您的任何模型都能够提供最小的预测能力;目前没有迹象表明他们正在回升。您可以尝试使用 themis 包进行下采样,或作为引擎参数传递,以便您的模型不会过度拟合到多数类。您也可以坚持网格搜索;XGBoost 很容易优化(一旦你有一些信号),顺序优化在这里可能是浪费时间。scale_pos_weight
1赞 topepo 7/29/2023
另外,你的学习率真的太高了。函数调用中的单位应为 log10 单位;您的范围从 10^.1 (=1.23) 到大约 10^.3(约 2)。0.05 到 0.1 左右的值通常是一个好地方。learn_rate(c(0.01, 0.3))

答: 暂无答案