提问人:Pieter Jansen 提问时间:7/29/2023 最后编辑:desertnautPieter Jansen 更新时间:7/30/2023 访问量:59
GridSearchCV 机器学习
GridSearchCV Machine learning
问:
我使用 GridSearch 来查找此决策树的相对最佳超参数(并使用 K-Fold CV 来评估模型的性能)。请查看代码和输出结果中的“最佳结果”行。
为什么它没有给我任何关于标准的信息(例如,是使用熵还是基尼)?
当我使用我编写的其他代码运行测试时,它有效,但提供的信息不正确(例如,根据GridSearch,熵更适合此模型,而实际上,当我运行手动测试时,Gini提供了更好的准确性和召回率(但是,对于精度,熵更好,但结果应基于代码中指定的准确性)。此外,对于最大深度,它建议使用值 7,而在实践中,9 或更多给出了更好的结果。
import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, classification_report
from matplotlib import pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
column_names = ['file_path', '50', '100', '250', '500', '1000', 'r50', 'r100', 'r250', 'r500', 'r1000', 'rfile', 'class2']
df = pd.read_csv("C:/Folder/deftxt - copy.csv", sep = ';', header = 0, names = column_names)
x = df.drop(['class2', 'file_path'], axis=1)
df['class2'] = df['class2'].astype(int)
y = df['class2'].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, shuffle = True, random_state = 100)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
model = DecisionTreeClassifier(random_state=100)
model.fit(x_train, y_train)
model.get_params()
k_fold_acc = cross_val_score(model, x_train, y_train, cv=10)
k_fold_mean = k_fold_acc.mean()
for i in k_fold_acc:
print(i)
print("accuracy K Fold CV:" + str(k_fold_mean))
param_dist={
"criterion":["gini", "entropy"],
"max_depth":[1,2,3,4,5,6,7, None],
"min_samples_split":[2,3,4,5],
}
grid = GridSearchCV(model, param_grid=param_dist, cv=10, n_jobs=-1, scoring='accuracy', verbose=1)
grid.fit(x_train, y_train)
print("The best results:" + str(grid.best_estimator_))
fn = ['50', '100', '250', '500', '1000', '-50', '-100', '-250', '-500', '-1000', 'total']
cn = ['ClassA', 'ClassB']
grid_predictions = grid.predict(x_test)
print(classification_report(y_test, grid_predictions))
输出:
(1369, 11) (587, 11) (1369,) (587,)
0.9927007299270073
0.9927007299270073
0.9781021897810219
0.9927007299270073
0.9927007299270073
0.9854014598540146
0.9854014598540146
0.9927007299270073
0.9781021897810219
0.9779411764705882
accuracy K Fold CV:0.9868452125375698
Fitting 10 folds for each of 64 candidates, totalling 640 fits
The best results:DecisionTreeClassifier(max_depth=7, random_state=100)
precision recall f1-score support
0 0.98 0.97 0.97 174
1 0.99 0.99 0.99 413
accuracy 0.98 587
macro avg 0.98 0.98 0.98 587
weighted avg 0.98 0.98 0.98 587
Process finished with exit code 0
答:
0赞
Nick ODell
7/30/2023
#1
为什么它没有给我任何关于标准的信息(例如,是使用熵还是基尼)?
将 Sklearn 模型转换为字符串时,它仅显示非默认参数。
例:
from sklearn.tree import DecisionTreeClassifier
print(str(DecisionTreeClassifier(max_depth=7, random_state=100, criterion="entropy")))
print(str(DecisionTreeClassifier(max_depth=7, random_state=100, criterion="gini")))
这将打印:
DecisionTreeClassifier(criterion='entropy', max_depth=7, random_state=100)
DecisionTreeClassifier(max_depth=7, random_state=100)
不打印该参数,因为它是默认值。criterion="gini"
要查看所有参数,您可以打印以下内容:
print(str(DecisionTreeClassifier(max_depth=7, random_state=100, criterion="gini").get_params()))
评论