如何使用直方图或表DataFrame作为线性回归中的预测因子?

How to use Histogram or table DataFrame as a predictor in linear regression?

提问人:Raeed Mundow 提问时间:11/11/2023 最后编辑:desertnautRaeed Mundow 更新时间:11/12/2023 访问量:65

问:

我正在尝试构建一个模型,可以计算光晕中暗物质粒子的浓度,已经有软件包和方法可以根据称为 NFW 曲线的特定密度分布在径向网格中生成粒子。我把粒子的每个实现都变成一个直方图,我想用这个直方图作为我的预测器,预测器集中的每个直方图对应一个称为浓度参数 c 的响应。这应该是一件容易的事,因为我之前做过类似的工作,但是在这里我遇到了一个问题,集合中的每个直方图都被分成 100 个 bin,但每个直方图的 bin 大小并不相同,例如我有这个直方图对应于 c=5:

    Bin Edges  Histogram Values
0    0.000486                21
1    0.002544                39
2    0.004602                73
3    0.006660                60
4    0.008718                64
..        ...               ...
95   0.195999                83
96   0.198057                64
97   0.200115                63
98   0.202173                74
99   0.204231                70

这个直方图对应于 c=20:

    Bin Edges  Histogram Values
0    0.000085                76
1    0.002147               188
2    0.004209               205
3    0.006271               216
4    0.008333               230
..        ...               ...
95   0.195968                40
96   0.198030                36
97   0.200092                45
98   0.202154                40
99   0.204215                42

为了清楚起见,我将直方图转换为表格,因为您可以看到不同预测变量之间的bin_edges不同,因此我无法将此 2 维数据简化为单个维度。

我的尝试是这样的:

import numpy as np
from halotools import empirical_models
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd


# Define the range for concentration parameter (c) and radius (r)
min_c = 4.0
max_c = 40.0
M = 1E15  # Mpcs
z = 0.0  # Redshift
num_samples = 300  # Number of samples
num_bins = 100  # Number of bins in each histogram


# Generate random concentration values (c) within the specified range
concentration_values = np.random.uniform(min_c, max_c, num_samples)

# Initialize lists to store features (X) and target values (y)
X = []
y = []

# Generate NFW density profiles and extract relevant information
for c in concentration_values:
    # Generate a realization of particles
    nfw_profile = empirical_models.NFWProfile()
    nfw_radial_positions = nfw_profile.mc_generate_nfw_radial_positions(halo_mass=M, conc=c)
    # Make a histrogram from the realization that is divided into 100 bins
    nfw_hist, bin_edges = np.histogram(nfw_radial_positions, num_bins)
    # Create a DataFrame from the histogram values and bin edges
    hist_table = pd.DataFrame({'Bin Edges': bin_edges[:-1], 'Histogram Values': nfw_hist})
    X.append(hist_table)
    y.append(c)


# Split the dataset into training, validation, and test sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
model = LinearRegression()
# Train the model on the training data
model.fit(X_train, y_train)

# Predict concentration parameters on the validation set
y_val_pred = model.predict(X_val)

# Evaluate the model's performance on the validation set
mse = mean_squared_error(y_val, y_val_pred)
mae = mean_absolute_error(y_val, y_val_pred)
r2 = r2_score(y_val, y_val_pred)

# Print the evaluation metrics
print("Mean Squared Error (MSE):", mse)
print("Mean Absolute Error (MAE):", mae)
print("R-squared (R^2) Score:", r2)

但是我一直收到这个错误:

ValueError: Found array with dim 3. LinearRegression expected <= 2.

有什么想法吗?

编辑:这是完整的错误

    Traceback (most recent call last):
  File "C:\Users\Raeed\AppData\Local\Programs\Python\Python39\lib\contextlib.py", line 135, in __exit__
    self.gen.throw(type, value, traceback)
  File "C:\Users\Raeed\PycharmProjects\NFW_profile\lib\site-packages\sklearn\_config.py", line 353, in config_context
    yield
  File "C:\Users\Raeed\PycharmProjects\NFW_profile\lib\site-packages\sklearn\base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "C:\Users\Raeed\PycharmProjects\NFW_profile\lib\site-packages\sklearn\linear_model\_base.py", line 678, in fit
    X, y = self._validate_data(
  File "C:\Users\Raeed\PycharmProjects\NFW_profile\lib\site-packages\sklearn\base.py", line 622, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "C:\Users\Raeed\PycharmProjects\NFW_profile\lib\site-packages\sklearn\utils\validation.py", line 1146, in check_X_y
    X = check_array(
  File "C:\Users\Raeed\PycharmProjects\NFW_profile\lib\site-packages\sklearn\utils\validation.py", line 951, in check_array
    raise ValueError(
ValueError: Found array with dim 3. LinearRegression expected <= 2.
python 机器学习 scikit-learn 线性回归

评论


答: 暂无答案