提问人:medo0070 提问时间:11/8/2023 更新时间:11/8/2023 访问量:15
加速和优化GA算法作为回归问题的特征选择
Accelerate and Optimze GA algorithm as feature selection for Regression Problem
问:
我正在尝试将方差分析与 GA 算法进行比较作为特征选择,然后将所选集应用于各种 ML 模型,并根据 MAE、RMSE 和 R2 进行比较。我使用 GA 算法对回归问题进行特征选择。我的数据集包含(78 个要素,1 个目标,1016 行)。我面临 3 个问题:
- 该程序需要很长时间才能处理一代 GA。
- 由于我是 GA 的新手,我不确定我使用的健身功能是否良好。
- 当在 MAE、RMSE 和 R2 方面与用于 ML 模型的方差分析进行比较时,它得到的结果比 GA 结果更差。
对我的上述相关问题的健身功能有什么建议吗? 提前致谢。
以下是我的代码的一部分:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import power_transform
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.svm import LinearSVR
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor, BaggingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import StackingRegressor
from deap import base, creator, tools, algorithms
import random
import warnings
import os
import tensorflow as tf
from multiprocessing import Pool
# Initialize TPU
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' +
os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
warnings.filterwarnings('ignore')
# Load the dataset
df = pd.read_csv("CBECS_Office_Subset.csv")
original_feature_names = df.columns[:-1] # Exclude the target variable
# Normalize the data
scaler = MinMaxScaler()
# Transform the data and ignore warnings during this process
with warnings.catch_warnings():
warnings.filterwarnings("ignore")
df = power_transform(df, method='yeo-johnson')
df = scaler.fit_transform(df)
X = df[:, :-1]
y = df[:, -1]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Models
models = {
"Linear Regression": LinearRegression(),
"Support Vector Machine": LinearSVR(),
"Random Forest": RandomForestRegressor(),
"Extra Trees Regressor": ExtraTreesRegressor(),
"Adaboost Regressor": AdaBoostRegressor(),
"MLP Regressor": MLPRegressor(),
"Bagging Regressor": BaggingRegressor(),
"Stacking Regressor": StackingRegressor(estimators=[
('lr', LinearRegression()),
('svm', LinearSVR()),
('rf', RandomForestRegressor(n_estimators=100, random_state=42)),
('etr', ExtraTreesRegressor(n_estimators=100, random_state=42)),
('ada', AdaBoostRegressor(n_estimators=100, random_state=42)),
('mlp', MLPRegressor()),
], final_estimator=LinearRegression())
}
# Define the GA optimization function
# Create a fitness function that maximizes an aggregated error metric
creator.create("FitnessMin", base.Fitness, weights=(1.0,))
creator.create("Individual", list, fitness=creator.FitnessMin)
# Define genetic operators
toolbox = base.Toolbox()
toolbox.register("attr_bool", random.randint, 0, 1) # Binary representation for feature selection
toolbox.register("individual", tools.initRepeat, creator.Individual, toolbox.attr_bool, n=len(X[0]))
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
# Define the evaluation function (fitness function)
def evaluate_individual(individual, model):
selected_features = [i for i, bit in enumerate(individual) if bit]
X_train_subset = X_train[:, selected_features]
X_test_subset = X_test[:, selected_features]
model.fit(X_train_subset, y_train)
y_pred = model.predict(X_test_subset)
mae = np.mean(np.abs(y_test - y_pred))
rmse = np.sqrt(np.mean((y_test - y_pred) ** 2))
r2 = 1.0 - (np.sum((y_test - y_pred) ** 2) / np.sum((y_test - np.mean(y_test)) ** 2))
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
# Define an aggregation method,
fitness = mae
return fitness,
答: 暂无答案
评论