提问人:Martha's Vineyard 提问时间:11/7/2023 更新时间:11/7/2023 访问量:48
使用目标变量中不包含任何值的测试数据集进行训练-测试拆分
Train-test split with test dataset that contains no values in the target variable
问:
我在 Python 中得到了这样的数据集(只是其中的一小部分,总共 20 个特征):
州 | 代表 | 员工 | 得分 |
---|---|---|---|
阿拉巴马州 | 4 | 3 | 5 |
罗得岛州 | 7 | 4 | 2 |
马里兰 | 6 | 8 | 3 |
得克萨斯州 | 7 | 5 | 5 |
佛罗里达州 | 6 | 5 | 2 |
分数值是分类的,它只能取 1、2、3、4 或 5 的值。
我预处理了数据,并使用 LabelEncoder 对分类特征(如状态)进行编码。
现在我想进行如下所示的训练-测试拆分:所有具有分数值的行都应位于训练集中,分数列中带有“NA”的所有行都应位于测试数据集中。
我使用 RandomForestClassifier 来查找我稍后将使用的 n 个最重要的功能。
之后我使用了 KNeighborsClassifier 和 RandomForestClassifier。
但是如果我进行交叉验证,这些模型的值非常低(大约 0.5),请参阅此处的代码:
### 5) Check models' performances
clf = RandomForestClassifier(max_depth=best_max_depth, random_state=42)
clf.fit(X_train_subset, y_train)
knn = KNeighborsClassifier(n_neighbors=best_k_value)
knn.fit(X_train_subset, y_train)
knn_test_score = knn.score(X_test_subset, y_test)
clf_test_score = clf.score(X_test_subset, y_test)
knn_train_score = knn.score(X_train_subset, y_train)
clf_train_score = clf.score(X_train_subset, y_train)
print(f"TEST data - kNN Score for {n_features_to_select} selected features: {knn_test_score:.3f}")
print(f"TEST data - RF Score for {n_features_to_select} selected features: {clf_test_score:.3f}")
print(f"TRAINING data - kNN Score for {n_features_to_select} selected features: {knn_train_score:.3f}")
print(f"TRAINING data - RF Score for {n_features_to_select} selected features: {clf_train_score:.3f}")
print("*" * 70)
# Perform Cross Validation to avoid overfitting
# Source: https://scikit-learn.org/stable/modules/cross_validation.html
# Define the number of folds for cross-validation
# Smaller values for n_folds mean larger validation sets ("test" sets out of the training data) and smaller training sets for each iteration -> more variability in the assessment
# Higher values for n_folds mean samller validation sets ("test" sets out of the training data) and larger training sets for each iteration -> lower variability in the assessment
# Recommended are values between 5 and 10
# Source: https://machinelearningmastery.com/k-fold-cross-validation/
n_folds = 10
# Create a k-fold cross-validation iterator
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
# Initialize models
clf = RandomForestClassifier(max_depth=best_max_depth, random_state=42)
knn = KNeighborsClassifier(n_neighbors=best_k_value)
# Perform k-fold cross-validation for each model
clf_train_scores = cross_val_score(clf, X_train_subset, y_train, cv=kf)
knn_train_scores = cross_val_score(knn, X_train_subset, y_train, cv=kf)
clf_test_scores = cross_val_score(clf, X_test_subset, y_test, cv=kf)
knn_test_scores = cross_val_score(knn, X_test_subset, y_test, cv=kf)
# Print the mean and standard deviation of the cross-validation scores
print(f"Random Forest Classifier (RF) Cross-Validation Scores for {n_features_to_select} selected features:")
print(f"Mean RF Score: {round(clf_train_scores.mean(), 3)}")
print(f"Standard Deviation RF Score: {round(clf_train_scores.std(), 3)}")
print("*" * 70)
print(f"k-Nearest Neighbors (kNN) Cross-Validation Scores for {n_features_to_select} selected features:")
print(f"Mean kNN Score: {round(knn_train_scores.mean(), 3)}")
print(f"Standard Deviation kNN Score: {round(knn_train_scores.std(), 3)}")
print("/" * 100)
这将为以下数量的选定要素生成以下值(对于每个所选要素数,其值相似,可以是 1 到 20):
我不明白为什么“RF Score(训练)”的值如此之高,但在我进行交叉验证时却没有。
你能帮我解决我在这里做错了什么吗?
预测有那么糟糕,是因为我没有“真”y,因为y_test由 NA 组成?
答: 暂无答案
评论