GridSearchCV 机器学习

GridSearchCV Machine learning

提问人:Pieter Jansen 提问时间:7/29/2023 最后编辑:desertnautPieter Jansen 更新时间:7/30/2023 访问量:59

问:

我使用 GridSearch 来查找此决策树的相对最佳超参数(并使用 K-Fold CV 来评估模型的性能)。请查看代码和输出结果中的“最佳结果”行。

为什么它没有给我任何关于标准的信息(例如,是使用熵还是基尼)?

当我使用我编写的其他代码运行测试时,它有效,但提供的信息不正确(例如,根据GridSearch,熵更适合此模型,而实际上,当我运行手动测试时,Gini提供了更好的准确性和召回率(但是,对于精度,熵更好,但结果应基于代码中指定的准确性)。此外,对于最大深度,它建议使用值 7,而在实践中,9 或更多给出了更好的结果。

import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, classification_report
from matplotlib import pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
column_names = ['file_path', '50', '100', '250', '500', '1000', 'r50', 'r100', 'r250', 'r500', 'r1000', 'rfile', 'class2']
df = pd.read_csv("C:/Folder/deftxt - copy.csv", sep = ';', header = 0, names = column_names)
    
x = df.drop(['class2', 'file_path'], axis=1)
df['class2'] = df['class2'].astype(int)
y = df['class2'].values
    
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, shuffle = True, random_state = 100)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
    
model = DecisionTreeClassifier(random_state=100)
model.fit(x_train, y_train)
model.get_params()
    
k_fold_acc = cross_val_score(model, x_train, y_train, cv=10)
k_fold_mean = k_fold_acc.mean()
for i in k_fold_acc:
    print(i)
print("accuracy K Fold CV:" + str(k_fold_mean))
    
param_dist={
    "criterion":["gini", "entropy"],
    "max_depth":[1,2,3,4,5,6,7, None],
    "min_samples_split":[2,3,4,5],
}
grid = GridSearchCV(model, param_grid=param_dist, cv=10, n_jobs=-1, scoring='accuracy', verbose=1)
grid.fit(x_train, y_train)
    
print("The best results:" + str(grid.best_estimator_))
    
fn = ['50', '100', '250', '500', '1000', '-50', '-100', '-250', '-500', '-1000', 'total']
cn = ['ClassA', 'ClassB']
    
grid_predictions = grid.predict(x_test)
print(classification_report(y_test, grid_predictions))

输出:

(1369, 11) (587, 11) (1369,) (587,)
0.9927007299270073
0.9927007299270073
0.9781021897810219
0.9927007299270073
0.9927007299270073
0.9854014598540146
0.9854014598540146
0.9927007299270073
0.9781021897810219
0.9779411764705882
accuracy K Fold CV:0.9868452125375698
Fitting 10 folds for each of 64 candidates, totalling 640 fits
The best results:DecisionTreeClassifier(max_depth=7, random_state=100)
                precision    recall  f1-score   support
    
            0       0.98      0.97      0.97       174
            1       0.99      0.99      0.99       413
    
    accuracy                           0.98       587
    macro avg       0.98      0.98      0.98       587
weighted avg       0.98      0.98      0.98       587
    
    
Process finished with exit code 0
python 机器学习 scikit-learn 决策树 gridsearchcv

评论

0赞 Ben Reiniger 7/29/2023
您应该将此处的帖子限制为一个问题。对于您的第一个,请参阅 stackoverflow.com/q/66373570/10495893;到你的第二个,datascience.stackexchange.com/q/82028/55122

答:

0赞 Nick ODell 7/30/2023 #1

为什么它没有给我任何关于标准的信息(例如,是使用熵还是基尼)?

将 Sklearn 模型转换为字符串时,它仅显示非默认参数。

例:

from sklearn.tree import DecisionTreeClassifier
print(str(DecisionTreeClassifier(max_depth=7, random_state=100, criterion="entropy")))
print(str(DecisionTreeClassifier(max_depth=7, random_state=100, criterion="gini")))

这将打印:

DecisionTreeClassifier(criterion='entropy', max_depth=7, random_state=100)
DecisionTreeClassifier(max_depth=7, random_state=100)

不打印该参数,因为它是默认值。criterion="gini"

要查看所有参数,您可以打印以下内容:

print(str(DecisionTreeClassifier(max_depth=7, random_state=100, criterion="gini").get_params()))