提问人:Amina Umar 提问时间:11/12/2023 最后编辑:desertnautAmina Umar 更新时间:11/22/2023 访问量:38
StackingClassifier 具有在特征子集上训练的基础模型
StackingClassifier with base-models trained on feature subsets
问:
我可以使用合成数据集来最好地描述我的目标。假设我有以下几点:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
df = pd.DataFrame(X, columns=list('ABCDEFGHIJ'))
X_train, X_test, y_train, y_test = train_test_split(
df, y, test_size=0.3, random_state=42)
X_train.head()
A B C D E F G H I J
541 -0.277848 1.022357 -0.950125 -2.100213 0.883638 0.821387 1.154613 0.075376 1.176242 -0.470087
440 1.089665 0.841446 -1.701004 -1.036256 -1.229357 0.345068 1.876470 -0.750067 0.080685 -1.318271
482 0.016010 0.025488 -1.189296 -1.052935 -0.623029 0.669521 1.518927 0.690019 -0.045486 -0.494186
422 -0.133358 -2.16219 1.170989 -0.942150 1.933444 -0.55118 -0.059908 -0.938672 -0.924097 -0.796185
778 0.901954 1.479360 -2.639176 -2.588845 -0.753915 -1.650621 2.727146 0.075260 1.330432 -0.941594
在进行特征重要性分析后,发现数据集中的3个类中的每一个都可以最好地使用特征子集进行预测,而不是整体。例如:
class | optimal predictors
-------+-------------------
0 | A, B, C
1 | D, E, F, G
2 | G, H, I, J
-------+-------------------
在这一点上,我想使用 3 个分类器来训练子模型,每个类一个,并使用类的最佳预测变量(作为基础模型)。然后是最终预测。one-ve-rest
StackingClassifier
我对 ,可以训练不同的基础模型(例如 etc)和使用另一个模型的元分类器,例如.StackingClassifier
DT, SVC, KNN
Logistice Regression
然而,在这种情况下,基础模型是一个分类器,只是每个分类器都使用最适合该类的特征子集进行训练,如上所述。DT
然后最后对 .X_test
但我不确定如何做到这一点。因此,我使用如上所述的伪数据来描述我的工作。
如何设计它来训练基础模型和最终预测?
答:
-2赞
Kinjal
11/27/2023
#1
您需要在自定义函数之上使用 and(检查此答案)来按相应的列筛选数据并操作目标值。make_pipeline
FunctionTransformer
下面是一个演示代码
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelBinarizer
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier
X, y = load_iris(return_X_y=True)
# Functions to filter the X on the important features
def select_dim1(X):
red_X1 = X[:,:2]
return red_X1
def select_dim2(X):
red_X2 = X[:,2:]
return red_X2
def select_dim3(X):
red_X3 = X[:,3:]
return red_X3
# functions to label binarise separate classes in y
def y_0(y):
y[y == 0] = 1
return y
def y_1(y):
y[(y == 0)|(y == 2)] = 0
return y
def y_2(y):
y[y < 2] = 0
return y
# converting them to function transformer
from sklearn.preprocessing import FunctionTransformer
select_dim1_tr = FunctionTransformer(select_dim1)
select_dim2_tr = FunctionTransformer(select_dim2)
select_dim3_tr = FunctionTransformer(select_dim3)
select_y_0_tr = FunctionTransformer(y_0)
select_y_1_tr = FunctionTransformer(y_1)
select_y_2_tr = FunctionTransformer(y_2)
estimators = [
('dt1', make_pipeline(select_y_0_tr,select_dim1_tr, DecisionTreeClassifier(random_state=42))),
('dt2', make_pipeline(select_y_1_tr,select_dim2_tr, DecisionTreeClassifier(random_state=42))),
('dt3', make_pipeline(select_y_2_tr,select_dim3_tr, DecisionTreeClassifier(random_state=42)))
]
clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
clf.fit(X_train, y_train).score(X_test, y_test)
评论
0赞
Amina Umar
11/27/2023
但这完全不是问题所说的。
0赞
Kinjal
11/27/2023
如何?您需要堆叠使用训练集子集构建的 3 个决策树。这 3 个目标都有单独的一对一目标。我只是显示它与您选择的答案不同。
1赞
Ben Reiniger
11/27/2023
我认为这行不通,因为无法在管道中修改?FunctionTransformer
y
0赞
Kinjal
11/28/2023
你是对的。会修复它。
-1赞
seralouk
11/27/2023
#2
你可以以编程方式做你所描述的事情,但我不确定使用一个简单的随机森林在内部完成所有这些工作(特征、子选择和拟合等)会有什么好处。
下面是您所描述的实现。我使用了与您提到的完全相同的基础和堆叠模型:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.base import clone
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
def select_columns(X, columns):
return X[columns]
X, y = make_classification(n_samples=1000, n_features=10, n_classes=3, n_informative=3)
df = pd.DataFrame(X, columns=list('ABCDEFGHIJ'))
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3, random_state=42)
feature_subsets = {
0: ['A', 'B', 'C'],
1: ['D', 'E', 'F', 'G'],
2: ['G', 'H', 'I', 'J']
}
# Base model
base_dt_model = DecisionTreeClassifier(random_state=42)
#One-vs-Rest classifiers with feature subsets
classifiers = []
for class_label, features in feature_subsets.items():
model = clone(base_dt_model)
# select features, then apply the unique model
pipeline = Pipeline([
('feature_selection', FunctionTransformer(select_columns, kw_args={'columns': features})),
('classifier', model)
])
classifiers.append(('dt_class_' + str(class_label), pipeline))
# Logistic Regression as the metaclassifier
stack = StackingClassifier(estimators=classifiers, final_estimator=LogisticRegression())
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
评论
1赞
Amina Umar
11/27/2023
太棒了!非常感谢您的回答。
1赞
Kinjal
11/27/2023
该代码不会将目标作为 .它使用不同的特征子集在同一个目标上进行训练。one-vs-all
1赞
seralouk
11/27/2023
@Kinjal这就是OP所描述的,并被接受为解决方案
1赞
seralouk
11/27/2023
我想这只是这里的一个表达/解释问题。这与“然而,在这种情况下,基本模型是一个 DT 分类器,只是每个分类器都使用最适合该类的特征子集进行训练”相矛盾,这是最终想要的行为。这就是我提出的建议。
2赞
Ben Reiniger
11/27/2023
Asker 已经问了这个问题的几个变体,但这个问题似乎是专门针对 one-vs-rest 目标的,所以我同意这并不能回答这个问题。
评论
StackingClassifier
StackingClassifier