是否可以使用两个单独位置中的任何一个位置的 k 指数来预测估计的 kp 指数?

Can the k-index in either of the two individual locations be used to predict the estimated kp-index?

提问人:S. M. 提问时间:11/18/2023 更新时间:11/18/2023 访问量:43

问:

k 指数衡量磁层的状况。它通常平均超过三个小时,因此每天有 8 次测量。

  • 行星 k 指数 (kp-index) 是世界上 15 个地点所采取的措施的平均值,
  • 估计的KP指数是仅8个地点的平均值。数据可在此处获得:

https://www.swpc.noaa.gov/products/planetary-k-index

如果您查看他们过去 30 天的数据:

https://services.swpc.noaa.gov/text/daily-geomagnetic-indices.txt

您可以看到,他们给出了在两个不同位置(弗雷德里克斯堡和学院)测量的 k 指数,以及估计的 kp 指数。(您可以忽略标记为“A”的列。

单个训练点是弗雷德里克堡某一天 k 指数的 8 值向量,其标签是同一天估计的 kp 指数的 8 值向量。在大量样本点上训练算法后,一个新查询是弗雷德里克堡在另一天(没有标签)的 k 指数,并且该算法必须生成一个 8 值向量,该向量是它对当天估计的 kp 指数的猜测。例如,如果查询向量是

2 2 1 2 1 2 1 2

(昨天弗雷德里克斯堡的 k 指数向量),然后是答案

2.00 0.67 0.67 0.67 0.67 1.33 1.00 1.33

将有零误差,因为这正是当天估计的 KP 指数。当然,我们不能指望算法能产生零误差的解决方案。

我的意图:

一个有趣的问题是,是否可以使用两个单独位置中的任何一个位置的 k 指数来预测估计的 kp 指数?

我的想法:

我的想法是,这可以通过在一些数据上训练学习模型(例如 SVR)来完成(其中点是来自弗雷德里克斯堡或学院的 8 向量代表某一天,其标签是 8 向量,是估计的 kp 指数)说一天),然后在其余数据上测试其误差。我正在用python编程,我可能想使用这个功能。sklearn.multioutput.MultiOutputRegressor

我的回答:

给定数据集.txt

#Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVR
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler
import warnings

#Suppressing warnings for a cleaner 
warnings.filterwarnings("ignore")

# Load the dataset
dataset_path = "dataset.txt"
dataset = pd.read_csv(dataset_path, sep="\t", header=None)

# Split the space-separated data into columns
data = dataset[0].str.split(expand=True)

# Exclude unwanted columns: 4,5,6,7 and the last 5 columns
data = data.drop(columns=[3, 4, 5, 6] + list(range(data.shape[1]-5, data.shape[1])))


# Extracting date, kp-index, and Fredericksburg columns
date_columns = data.iloc[:, :3]
kp_index_columns = data.iloc[:, 3:11]
federickburg_columns = data.iloc[:, 11:19]

# Convert individual date columns to a single date column
date_columns['date'] = date_columns[0] + '-' + date_columns[1] + '-' + date_columns[2]

# Transforming Fredericksburg columns to float type
federickburg_transformed = federickburg_columns.astype(float)

# Defining the features (X) and target (y)
X = federickburg_transformed
y = kp_index_columns.astype(float)

# Splitting the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Initialize the SVR model with a linear kernel
svr = SVR(kernel='linear', C=1)

# Using MultiOutputRegressor to handle multiple target columns
model = MultiOutputRegressor(svr).fit(X_train, y_train)


# Predict on the test data
y_pred = model.predict(X_test)


# Calculate the Mean Squared Error (MSE) and R^2 score for evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Displaying the MSE and R^2 scores 
mse, r2 #(1.6405386761259917, 0.04202137156755004)

# Selecting a random sample from our features
random_index = np.random.randint(0, len(X))
sample_fredericksburg_random = X.iloc[random_index].values.reshape(1, -1)

# Predicting the kp-index for the randomly chosen Federicksburg k-index vector
predicted_kp_index_random = model.predict(sample_fredericksburg_random)

# Retrieving the actual kp-index for the chosen sample
actual_kp_index_random = y.iloc[random_index].values.reshape(1, -1)


# Scaling the features and target data for better performance during grid search
scaler_X = StandardScaler().fit(X_train)
scaler_y = StandardScaler().fit(y_train)
X_train_scaled = scaler_X.transform(X_train)
y_train_scaled = scaler_y.transform(y_train)
X_test_scaled = scaler_X.transform(X_test)
y_test_scaled = scaler_y.transform(y_test)

# Defining the hyperparameter grid for SVR
param_grid = {
 'estimator__C': [0.1, 1, 10, 100],
 'estimator__epsilon': [0.01, 0.1, 1],
 'estimator__kernel': ['linear', 'rbf']
}

# Performing grid search to find the best hyperparameters
grid_search = GridSearchCV(MultiOutputRegressor(SVR()), param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train_scaled, y_train_scaled)


# Using the best estimator found during grid search
best_svr = grid_search.best_estimator_

# Training the model with the best hyperparameters
best_svr.fit(X_train_scaled, y_train_scaled)


# Making predictions on the scaled test data
y_pred_scaled = best_svr.predict(X_test_scaled)

# Transforming the predicted data back to the original scale
y_pred_best = scaler_y.inverse_transform(y_pred_scaled)

# Calculating evaluation metrics for the optimized model
mse_best = mean_squared_error(y_test, y_pred_best)
r2_best = r2_score(y_test, y_pred_best)

# Displaying the improvements in MSE and R^2
mse_improvement = mse - mse_best
r2_improvement = r2_best - r2

# Displaying the results
mse_best, r2_best, mse_improvement, r2_improvement #(0.2357857371499288, 0.8438206360172018, 1.404752938976063, 0.8017992644496518)


# Predicting the kp-index for the previously chosen Federicksburg k-index vector using the optimized model
sample_fredericksburg_scaled = scaler_X.transform(sample_fredericksburg_random)
predicted_kp_index_scaled_optimized = best_svr.predict(sample_fredericksburg_scaled)

# Transforming the predicted data back to the original scale
predicted_kp_index_optimized = scaler_y.inverse_transform(predicted_kp_index_scaled_optimized)

# Displaying the optimized prediction and actual kp-index
predicted_kp_index_optimized, actual_kp_index_random 
#(array([[3.31721741, 2.61251221,1.63346665, 2.87117847,1.03796742,
#0.98798582, 1.65333422, 0.67988594]]),
#array([[3.333, 2.667, 1.667, 2.667, 1. , 1. , 1.667, 0.667]]))



# Extracting data for the last 30 days
date_last_30 = date_columns['date'].tail(30)
X_last_30 = X.tail(30)
y_last_30 = y.tail(30)

# Predicting KP-index for the last 30 days
y_pred_last_30_scaled = best_svr.predict(scaler_X.transform(X_last_30))
y_pred_last_30 = scaler_y.inverse_transform(y_pred_last_30_scaled)


# Visualizing the actual vs. predicted KP-index for the last 30 days
plt.figure(figsize=(20, 15))

 # Looping through each KP-index column
for i in range(8):
    plt.subplot(4, 2, i+1)
    #Plotting the actual values
    plt.plot(date_last_30, y_last_30.iloc[:, i].values, 'r', label='Actual')
    #Plotting the predicted values
    plt.plot(date_last_30, y_pred_last_30[:, i], 'b--', label='Predicted')
    plt.title(f'KP-Index Column {i+1}')
    plt.xticks(rotation=45)
    plt.tight_layout() 
    if i == 0:
       plt.legend()

plt.suptitle('Actual vs Predicted KP-Index for Last 30 Days with Optimized Model', fontsize=16, y=1.05)
plt.show()


# Initializing the dataframe with the 'date' column
df_compact = pd.DataFrame()
df_compact['date'] = date_last_30.values

# Storing Fredericksburg values, predicted KP-index values, and actual KP-index values as lists
df_compact['federickburg'] = X_last_30.values.tolist()
df_compact['predicted_kp_index'] = y_pred_last_30.tolist()
df_compact['actual_kp_index'] = y_last_30.values.tolist()

# Rounding the values in 'predicted_kp_index' and 'actual_kp_index' columns to 2 decimal places
df_compact['predicted_kp_index'] = df_compact['predicted_kp_index'].apply(lambda x: [round(i, 2) for i in x])
df_compact['actual_kp_index'] = df_compact['actual_kp_index'].apply(lambda x: [round(i, 2) for i in x])

# Exporting the df_compact dataframe to a CSV file
csv_path = "pred_vs_act_last_30_days.csv"
df_compact.to_csv(csv_path, index=False)
df_compact.head(10) # Displaying the first few rows of the compact dataframe

# Calculating the evaluation metrics for the last 30 days' predictions
mae = mean_absolute_error(y_last_30, y_pred_last_30)
mse = mean_squared_error(y_last_30, y_pred_last_30)
r2 = r2_score(y_last_30, y_pred_last_30)
 
# Displaying the metrics
mae, mse, r2 #(0.06567479284845214, 0.010493205685711132, 0.9934506422188698)

我的输出图像是:

enter image description here

enter image description here 我的问题:

是否有可能提高代码的简洁性和复杂性?或者有没有其他方法可以解决这个问题?虽然我得到使用上述算法的准确率几乎是 100%。

python-3.x pandas csv 机器学习 数据科学

评论


答: 暂无答案