提问人:arne maa 提问时间:7/28/2023 最后编辑:arne maa 更新时间:7/29/2023 访问量:96
整洁模型中的 XGBoost 贝叶斯优化
XGBoost Bayesian Optimisation in tidymodels
问:
我尝试将贝叶斯优化应用于 tidymodels 框架中的二元分类问题 (XGBOOST)。
- 我的代码中是否有任何缺陷 - 该模型已在 2 CPU Linux 机器上运行了 72 天。虽然我的数据集很大,大约有 6GB(100 万行和 2000 列),但我缺乏经验来判断计算时间是否在预期范围内,或者是否有任何我没有看到的问题。
- 更新: 2 天后,我得到了这个输出:
❯ Generating a set of 10 initial parameter results
✓ Initialization complete
── Iteration 1 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
i Current best: roc_auc=0.7634 (@iter 0)
i Gaussian process model
x Gaussian process model: Error in fit_gp(mean_stats %>% dplyr::select(-.iter), pset = param_info, : argument is missing, with no default
Error in `check_gp_failure()`:
! Gaussian process model was not fit.
Run `rlang::last_trace()` to see where the error occurred.
✖ Optimization stopped prematurely; returning current results.
> stopCluster(cl)
> doParallel::stopImplicitCluster()
> show_best(myxgb_res, n=25)
# A tibble: 10 × 11
learn_rate tree_depth min_n loss_reduction .metric .estimator mean n std_err .config .iter
<dbl> <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <int>
1 1.34 8 5 34.1 roc_auc binary 0.763 10 0.00125 Preprocessor1_Model01 0
2 1.49 6 7 64.1 roc_auc binary 0.751 10 0.00165 Preprocessor1_Model05 0
3 1.53 5 3 12.1 roc_auc binary 0.727 10 0.00204 Preprocessor1_Model03 0
4 1.35 7 3 356. roc_auc binary 0.721 10 0.00214 Preprocessor1_Model07 0
5 1.17 9 4 548. roc_auc binary 0.704 10 0.00167 Preprocessor1_Model04 0
6 1.04 9 6 1195. roc_auc binary 0.685 10 0.00100 Preprocessor1_Model06 0
7 1.80 6 7 54577. roc_auc binary 0.5 10 0 Preprocessor1_Model02 0
8 1.21 4 1 5707. roc_auc binary 0.5 10 0 Preprocessor1_Model08 0
9 1.98 8 9 32672. roc_auc binary 0.5 10 0 Preprocessor1_Model09 0
10 1.74 4 9 10550. roc_auc binary 0.5 10 0 Preprocessor1_Model10 0
此外,当我在示例“mtcars”数据集上运行代码时,会出现以下错误,因此我认为我一定错过了一些东西
❯ Generating a set of 10 initial parameter results
→ A | warning: No control observations were detected in `truth` with control level '1'.
There were issues with some computations A: x1
✓ Initialization complete
── Iteration 1 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
i Current best: roc_auc=0.9167 (@iter 0)
i Gaussian process model
x Gaussian process model: Error in fit_gp(mean_stats %>% dplyr::select(-.iter), pset = param_info, : argument is missing, with no default
Error in `check_gp_failure()`:
! Gaussian process model was not fit.
Run `rlang::last_trace()` to see where the error occurred.
✖ Optimization stopped prematurely; returning current results.
> show_best(myxgb_res, n=10)
# A tibble: 10 × 11
learn_rate tree_depth min_n loss_reduction .metric .estimator mean n std_err .config .iter
<dbl> <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <int>
1 0.108 9 2 3.32 roc_auc binary 0.917 9 0.0589 Preprocessor1_Model02 0
2 0.0771 7 2 1.22 roc_auc binary 0.917 9 0.0589 Preprocessor1_Model04 0
3 0.0340 5 4 4.38 roc_auc binary 0.5 9 0 Preprocessor1_Model01 0
4 0.250 4 7 2.24 roc_auc binary 0.5 9 0 Preprocessor1_Model03 0
5 0.237 10 5 3.78 roc_auc binary 0.5 9 0 Preprocessor1_Model05 0
6 0.161 4 6 4.66 roc_auc binary 0.5 9 0 Preprocessor1_Model06 0
7 0.211 6 10 1.40 roc_auc binary 0.5 9 0 Preprocessor1_Model07 0
8 0.149 7 9 3.94 roc_auc binary 0.5 9 0 Preprocessor1_Model08 0
9 0.0417 3 3 2.17 roc_auc binary 0.5 9 0 Preprocessor1_Model09 0
10 0.290 8 8 2.76 roc_auc binary 0.5 9 0 Preprocessor1_Model10 0
>
后续行动,更一般:
- 如果我的目标变量在 1/3 比率下略微失衡,我应该使用哪个指标来优化模型。我认为 F1 (f_meas) 应该是最相关的?
# Load necessary libraries
library(tidymodels)
library(doParallel)
library(xgboost)
options(scipen = 999)
# Load example data
data(mtcars)
# Create a binary target
mtcars$am <- as.factor(mtcars$am)
set.seed(123)
df_split <- initial_split(mtcars, strata = am)
df_train <- training(df_split)
df_test <- testing(df_split)
df_train_folds <- vfold_cv(df_train, strata = am)
# /////////////////////////////////////////////////////////////////////////////
#prep
recipe_df <- recipe(am ~ ., data=df_train) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
xgb_prep<- prep(recipe_df,verbose=T,)
# Create model
xgb_spec <- boost_tree(
trees = 100,
tree_depth = tune(),
min_n = tune(),
learn_rate = tune(),
loss_reduction = tune()
) %>%
set_engine("xgboost") %>%
set_mode("classification")
# set ranges for parameters
params <- parameters(
learn_rate(),
tree_depth(),
min_n(),
loss_reduction()
) %>%
update(
learn_rate = learn_rate(c(0.01, 0.3), trans=NULL), # range for the learning rate
tree_depth = tree_depth(c(3, 10)), # range for the tree depth
min_n = min_n(c(1, 10)), # range for the minimum number of observations
loss_reduction = loss_reduction(c(1, 5), trans=NULL) # range for the loss reduction
) %>%
finalize(df_train)
# Merge into workflow
xgb_wf <- workflow() %>%
add_model(xgb_spec) %>%
add_recipe(xgb_prep)
#Parallel Processing
# gc()
# numCores = 30 #detectCores() 72
# cl = parallel::makeCluster(numCores)
# doParallel::registerDoParallel(cl)
options(tidymodels.dark = TRUE)
myxgb_res <- tune_bayes(
xgb_wf,
resamples = df_train_folds,
param_info = params,
initial = 10,
iter = 30,
metrics = metric_set(roc_auc),
control = control_bayes(
no_improve = 5,
save_pred = T,
verbose = T,
seed = 123
),
parallel_over = "everything",
)
# stopCluster(cl)
# doParallel::stopImplicitCluster()
show_best(myxgb_res, n=10)
感谢您的任何建议!
答: 暂无答案
评论
tune_grid()
tune_bayes()
scale_pos_weight
learn_rate(c(0.01, 0.3))