将 sklearn 警告保存到数据帧

Saving sklearn warnings to a dataframe

提问人:Bradley Sutliff 提问时间:6/8/2023 更新时间:6/8/2023 访问量:36

问:

我正在使用 sklearn 的 GridSearchCV 来优化具有不同数据集的 Adaboost 分类器的参数。然后,我创建/添加到具有数据集名称、best_params_和best_score_等信息的 DatafFrame。

有时我会收到警告,例如 ConvergenceWarning,或者只是一个已弃用的包。它们不一定会伤害任何东西,但我想将它们添加为一列。

这篇文章(将 scikit-learn 详细日志写入外部文件)似乎与 bluesummers 和 mbil 的消息很接近,但我真的不想写一个文件来读回我的数据帧。

这是一个最小的工作示例。对于末尾的 DataFrame,它当前填充 NA 的“warnings”列。但是,因为我正在使用而不是我应该得到一堆错误,我想在警告列中抓取并保存这些错误。AdaBoostClassifier(base_estimator=RandomForestClassifier())AdaBoostClassifier(estimator=RandomForestClassifier())

from sklearn.model_selection import GridSearchCV, KFold, cross_val_score,StratifiedKFold
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
import numpy as np
import tqdm as tq
import pandas as pd
from sklearn.preprocessing import StandardScaler

df_params = pd.DataFrame(columns=['learning_rate', 'n_estimators', 'accuracy', 'warning'])
abc = AdaBoostClassifier(base_estimator=RandomForestClassifier())

parameters = {'n_estimators':[5,10],
              'learning_rate':[0.01,0.2]}

a = np.random.random((50, 3))
b = np.random.random((70, 3))
c = np.random.random((50, 5))


for i, data in tq.tqdm(enumerate([a,b,c])):
    X = data
    sc =StandardScaler()
    X = sc.fit_transform(X)
    y = ['foo', 'bar']*int(len(X)/2)
    
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=None)
    clf = GridSearchCV(abc, parameters, cv=skf, scoring='accuracy', n_jobs=-1,)
    clf.fit(X,y)
    
    dict_best_params = clf.best_params_.copy()
    dict_best_params['accuracy'] = clf.best_score_
    best_params = pd.DataFrame(dict_best_params, index=[i])
    df_params = pd.concat([df_params, best_params], ignore_index=False)

df_params.head()
python-3.x scikit-learn 警告 错误记录

评论


答:

0赞 Corralien 6/8/2023 #1

IIUC,您可以使用catch_warning

import warnings  # HERE
import numpy as np
import tqdm as tq
import pandas as pd
from sklearn.preprocessing import StandardScaler

df_params = pd.DataFrame(columns=['learning_rate', 'n_estimators', 'accuracy', 'warning'])
abc = AdaBoostClassifier(base_estimator=RandomForestClassifier())

parameters = {'n_estimators':[5,10],
              'learning_rate':[0.01,0.2]}

a = np.random.random((50, 3))
b = np.random.random((70, 3))
c = np.random.random((50, 5))


warns = []
for i, data in tq.tqdm(enumerate([a,b,c])):
    with warnings.catch_warnings(record=True) as cx_manager:  # HERE
        X = data
        sc =StandardScaler()
        X = sc.fit_transform(X)
        y = ['foo', 'bar']*int(len(X)/2)
    
        skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=None)
        clf = GridSearchCV(abc, parameters, cv=skf, scoring='accuracy', n_jobs=-1,)
        clf.fit(X,y)
    
        dict_best_params = clf.best_params_.copy()
        dict_best_params['accuracy'] = clf.best_score_
        dict_best_params['warning'] = [i.message for i in cx_manager]  # HERE
        best_params = pd.DataFrame(dict_best_params, index=[i])
        df_params = pd.concat([df_params, best_params], ignore_index=False)

输出:

>>> df_params
   learning_rate n_estimators  accuracy                                            warning
0           0.20           10  0.520000  `base_estimator` was renamed to `estimator` in...
1           0.20           10  0.514286  `base_estimator` was renamed to `estimator` in...
2           0.01            5  0.440000  `base_estimator` was renamed to `estimator` in...

评论

1赞 Bradley Sutliff 6/8/2023
这正是我想要的!谢谢!