在管道中转换的特征的置换特征重要性 (sklearn)

Permutation feature importance on features transformed within a pipeline (sklearn)

提问人:victoris_93 提问时间:10/26/2023 更新时间:10/26/2023 访问量:21

问:

早些时候也提出了类似的问题。我需要通过 计算预处理特征的特征重要性。预处理是在管道中实现的。代码如下:sklearn.inspection.permutation_importance

import numpy as np
import pandas as pd
import os
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

###
data load
###

X_train, X_test, y_train, y_test = train_test_split(all_features, diagnosis, test_size=0.25, random_state=42)

pca_conn = Pipeline(
    steps = [("group_whiten", StandardScaler()),
             ('pca_conn', PCA(n_components = 100)),
            ("pca_whiten", StandardScaler())]
)

pca_grad = Pipeline(
    steps = [("group_whiten", StandardScaler()),
             ('pca_grad', PCA(n_components = 100)),
            ("pca_whiten", StandardScaler())]
)

pca_centroid_disp_pca = Pipeline(
    steps = [("group_whiten", StandardScaler()),
             ('pca_grad', PCA(n_components = 10)),
            ("pca_whiten", StandardScaler())]
)

pca_cortex_disp_pca = Pipeline(
    steps = [("group_whiten", StandardScaler()),
             ('pca_grad', PCA(n_components = 100)),
            ("pca_whiten", StandardScaler())]
)

cat_encoder = Pipeline(
    steps = [("cat_encoder", OneHotEncoder(handle_unknown="ignore"))]
)
whiten = Pipeline(
    steps = [("whiten", StandardScaler())]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("pca_conn", pca_conn, conn_cols),
        ("pca_grad", pca_grad, grad_cols),
        ("pca_centroid_disp_pca", pca_centroid_disp_pca, centroid_disp_cols),
        ("pca_cortex_disp_pca", pca_cortex_disp_pca, cortex_disp_cols),
        ("encode_dataset", cat_encoder, ["dataset"]),
        ("encode_sex", cat_encoder, ["sex"]),
        ("whiten_fd", whiten, ["mean_fd"]),
        ("whiten_age", whiten, ["age"])
    ]
)

lr = LogisticRegression(random_state=42, max_iter = 10000)
clf = Pipeline([('preprocessor', preprocessor),
                ('lr',lr)])

from sklearn.linear_model import LogisticRegression

trained_logreg = clf.fit(X_train, y_train)
trained_logreg.score(X_test, y_test)

perm_acc = permutation_importance(trained_logreg, X_test, y_test,n_repeats=100, random_state=42, n_jobs = -1)

默认情况下,似乎会计算原始特征的排列重要性。有没有人尝试过为变换的特征实现排列重要性?有什么提示吗?我认为单独进行预处理不是一种选择(数据泄漏)。

python scikit-learn pipeline pca 功能选择

评论

0赞 Ben Reiniger 10/27/2023
会有什么数据泄露?(这里是题外话,更擅长 stats.SE 或 datascience.SE,但如果确定您需要/希望在管道中如何测量排列重要性,编程问题可能仍然可以回答?
0赞 Ben Reiniger 10/27/2023
另请参阅 stackoverflow.com/q/62106204/10495893
0赞 victoris_93 10/27/2023
一般建议是将变换保留在管道中,以便在模型拟合之前仅对训练集进行变换。据我所知,如果我先转换所有特征,然后拟合模型,我会反对这个建议。就我而言,原始特征的排列重要性是不可行的,因为原始数据集包含超过一百万个特征。因此,需要对转换后的数据进行特征重要性处理。

答: 暂无答案