提问人:Raeed Mundow 提问时间:11/11/2023 最后编辑:desertnautRaeed Mundow 更新时间:11/12/2023 访问量:65
如何使用直方图或表DataFrame作为线性回归中的预测因子?
How to use Histogram or table DataFrame as a predictor in linear regression?
问:
我正在尝试构建一个模型,可以计算光晕中暗物质粒子的浓度,已经有软件包和方法可以根据称为 NFW 曲线的特定密度分布在径向网格中生成粒子。我把粒子的每个实现都变成一个直方图,我想用这个直方图作为我的预测器,预测器集中的每个直方图对应一个称为浓度参数 c 的响应。这应该是一件容易的事,因为我之前做过类似的工作,但是在这里我遇到了一个问题,集合中的每个直方图都被分成 100 个 bin,但每个直方图的 bin 大小并不相同,例如我有这个直方图对应于 c=5:
Bin Edges Histogram Values
0 0.000486 21
1 0.002544 39
2 0.004602 73
3 0.006660 60
4 0.008718 64
.. ... ...
95 0.195999 83
96 0.198057 64
97 0.200115 63
98 0.202173 74
99 0.204231 70
这个直方图对应于 c=20:
Bin Edges Histogram Values
0 0.000085 76
1 0.002147 188
2 0.004209 205
3 0.006271 216
4 0.008333 230
.. ... ...
95 0.195968 40
96 0.198030 36
97 0.200092 45
98 0.202154 40
99 0.204215 42
为了清楚起见,我将直方图转换为表格,因为您可以看到不同预测变量之间的bin_edges不同,因此我无法将此 2 维数据简化为单个维度。
我的尝试是这样的:
import numpy as np
from halotools import empirical_models
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Define the range for concentration parameter (c) and radius (r)
min_c = 4.0
max_c = 40.0
M = 1E15 # Mpcs
z = 0.0 # Redshift
num_samples = 300 # Number of samples
num_bins = 100 # Number of bins in each histogram
# Generate random concentration values (c) within the specified range
concentration_values = np.random.uniform(min_c, max_c, num_samples)
# Initialize lists to store features (X) and target values (y)
X = []
y = []
# Generate NFW density profiles and extract relevant information
for c in concentration_values:
# Generate a realization of particles
nfw_profile = empirical_models.NFWProfile()
nfw_radial_positions = nfw_profile.mc_generate_nfw_radial_positions(halo_mass=M, conc=c)
# Make a histrogram from the realization that is divided into 100 bins
nfw_hist, bin_edges = np.histogram(nfw_radial_positions, num_bins)
# Create a DataFrame from the histogram values and bin edges
hist_table = pd.DataFrame({'Bin Edges': bin_edges[:-1], 'Histogram Values': nfw_hist})
X.append(hist_table)
y.append(c)
# Split the dataset into training, validation, and test sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
model = LinearRegression()
# Train the model on the training data
model.fit(X_train, y_train)
# Predict concentration parameters on the validation set
y_val_pred = model.predict(X_val)
# Evaluate the model's performance on the validation set
mse = mean_squared_error(y_val, y_val_pred)
mae = mean_absolute_error(y_val, y_val_pred)
r2 = r2_score(y_val, y_val_pred)
# Print the evaluation metrics
print("Mean Squared Error (MSE):", mse)
print("Mean Absolute Error (MAE):", mae)
print("R-squared (R^2) Score:", r2)
但是我一直收到这个错误:
ValueError: Found array with dim 3. LinearRegression expected <= 2.
有什么想法吗?
编辑:这是完整的错误
Traceback (most recent call last):
File "C:\Users\Raeed\AppData\Local\Programs\Python\Python39\lib\contextlib.py", line 135, in __exit__
self.gen.throw(type, value, traceback)
File "C:\Users\Raeed\PycharmProjects\NFW_profile\lib\site-packages\sklearn\_config.py", line 353, in config_context
yield
File "C:\Users\Raeed\PycharmProjects\NFW_profile\lib\site-packages\sklearn\base.py", line 1152, in wrapper
return fit_method(estimator, *args, **kwargs)
File "C:\Users\Raeed\PycharmProjects\NFW_profile\lib\site-packages\sklearn\linear_model\_base.py", line 678, in fit
X, y = self._validate_data(
File "C:\Users\Raeed\PycharmProjects\NFW_profile\lib\site-packages\sklearn\base.py", line 622, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "C:\Users\Raeed\PycharmProjects\NFW_profile\lib\site-packages\sklearn\utils\validation.py", line 1146, in check_X_y
X = check_array(
File "C:\Users\Raeed\PycharmProjects\NFW_profile\lib\site-packages\sklearn\utils\validation.py", line 951, in check_array
raise ValueError(
ValueError: Found array with dim 3. LinearRegression expected <= 2.
答: 暂无答案
评论