提问人:Ruth Walker 提问时间:8/17/2023 最后编辑:desertnautRuth Walker 更新时间:8/18/2023 访问量:43
根据机器学习的变量将数据拆分为训练集、测试集和验证集
Splitting data into training, test and validation sets depending on variable dependent for machine learning
问:
我正在尝试将我的数据拆分为数据中的训练组、测试组和验证组。我有 2 组:对照组和 TP,在这些组中,我有一个名为 Bio 的二级变量,两组都有数字 1-4。
在各组中,我需要根据治疗组(对照组或 TP)进行拆分,然后根据 Bio 作为因变量进行拆分,这样如果我在训练集中有对照 1,我就拥有所有对照 1 组和所有 TP 1。虽然我下面的示例数据在生物分组中具有相同的数字,例如 3,但这与其他数据不同,并且在不同的生物中有不同的数字。
请参阅下面的最小数据集:
Sample Treatment Bio 285.945846 286.9638976 288.1004758 288.8109355
Control1_A13 Control 1 0.003535191 0.001777255 0.004729780 0.002364995
Control1_A14 Control 1 0.005063256 0.000110063 0.006249624 0.001041584
Control1_A15 Control 1 0.004262099 0.000836256 0.004277461 0.002699177
Control2_B13 Control 2 0.002411720 0.000466887 0.001129674 0.001109870
Control2_B14 Control 2 0.003085647 0.001831629 0.002482230 0.000000000
Control2_B15 Control 2 0.001996473 0.001060616 0.003995243 0.001369387
Control3_C13 Control 3 0.000299744 0.000851944 0.002808119 0.004065315
Control3_C14 Control 3 0.003187073 0.000591202 0.006833653 0.001713096
Control3_C15 Control 3 0.003692511 0.000262144 0.004673039 0.000126174
Control4_D13 Control 4 0.003369294 0.001087459 0.005171894 0.000675702
Control4_D14 Control 4 0.003818057 0.000838719 0.005513885 0.000458708
Control4_D15 Control 4 0.002572840 0.000257058 0.003537029 0.000009040
LX2+TP1_E1 TP 1 0.003347067 0.001231945 0.008181087 0.004436654
LX2+TP1_E2 TP 1 0.001552547 0.001463769 0.008864838 0.002728083
LX2+TP1_E3 TP 1 0.003224648 0.000812735 0.008518836 0.004303950
LX2+TP2_F1 TP 2 0.001705551 0.000182659 0.000911028 0.000240785
LX2+TP2_F2 TP 2 0.000760944 0.000759464 0.002486596 0.002377735
LX2+TP2_F3 TP 2 0.001034440 0.000647382 0.008146538 0.001028800
LX2+TP3_G1 TP 3 0.003660741 0.001260433 0.008046637 0.003182006
LX2+TP3_G2 TP 3 0.001802459 0.000547580 0.004882082 0.004121552
LX2+TP3_G3 TP 3 0.003590003 0.000089100 0.002801237 0.000403527
LX2+TP4_H1 TP 4 0.002831592 0.001534135 0.009151124 0.003021942
LX2+TP4_H2 TP 4 0.001863099 0.000959953 0.008284829 0.005169246
LX2+TP4_H3 TP 4 0.005649448 0.001959382 0.011814467 0.004110110
我尝试了 2 种不同的方法来做到这一点:
- 方法 1
set.seed(1234)
inTraining <- createDataPartition(vis_data2$Treatment, p=0.6, list=FALSE)
training.set <- vis_data2[inTraining,]
Totalvalidation.set <- vis_data2[-inTraining,]
# This will create another partition of the 40% of the data, so 20%-testing and #20%-validation
inValidation <- createDataPartition(Totalvalidation.set$Treatment, p=0.5, list=FALSE)
testing.set <- Totalvalidation.set[inValidation,]
validation.set <- Totalvalidation.set[-inValidation,]
但是,这对我来说没有考虑到第二个变量 - 生物分组
- 方法 2
set.seed(1)
#Split into training and validation data sets
Y1 = vis_data2[,1] #defining treatment/ variable column
g1 = vis_data2[,3] #defines group column
final_vis_data <- sample.split(Y1,SplitRatio = 0.5,group = g1)
table(Y1,final_vis_data) #get correct split ratios
split(final_vis_data,g1) #while keeping samples with the same group label together
full_train_set <- vis_data2[ final_vis_data,]
test.set <- vis_data2[!final_vis_data,]
#Split training data set into training and testing data sets
Y2 = full_train_set[,1] #defining treatment/ variable column
g2 = full_train_set[,3] #defines group column
final_vis_data2 <- sample.split(Y2,SplitRatio = 0.5,group = g2)
table(Y2,final_vis_data2) #get correct split ratios
split(final_vis_data2,g2) #while keeping samples with the same group label together
test.set <- full_train_set[final_vis_data2,1:3]
validation.set <- full_train_set[!final_vis_data2,1:3]
但是,当我运行它时,我经常在我的validation.index中得到“na”值,并且当我检查拆分时,Bio数据通常没有正确拆分。
如何让它工作?
答:
0赞
Seth
8/17/2023
#1
此答案使用 from 的函数,而不使用 Caret 的分区函数。它有望帮助您创建模型拟合的初始分割。rsample
为了演示拆分测试数据,正如您为验证集所描述的那样,我需要创建一些额外的组。
set.seed(123)
library(rsample)
df_split <- group_initial_split(df, group = Bio, prop = 0.6)
df_training <- training(df_split)
df_testing <- testing(df_split)
df_validation <- group_validation_split(df_testing, group = Bio, prop = 0.5)
df_analysis <- analysis(df_validation$splits[[1]])
df_assessment <- assessment(df_validation$splits[[1]])
levels(factor(df_training$Bio))
#> [1] "2" "3" "6" "8" "9" "10"
levels(factor(df_testing$Bio))
#> [1] "1" "4" "5" "7"
levels(factor(df_analysis$Bio))
#> [1] "1" "5"
levels(factor(df_assessment$Bio))
#> [1] "4" "7"
创建于 2023-08-17 with reprex v2.0.2
评论
0赞
Ruth Walker
8/19/2023
嗨,@Seth,当我尝试这种编码时,它会以我需要的方式在 Bio 上分裂,但我在治疗列(对照、TP)中的总体组现在在分裂中非常不平衡 - 有没有办法编辑它来纠正这一点?
0赞
Seth
8/19/2023
您可以通过在拆分中包括每个重采样来包含每个级别的等效比例。但是,尝试在一个变量上分层,同时将组保持在另一个变量上,需要每个变量的所有组都保持不变。您只能使用完整的数据集对此进行测试。Treatment
strata = Treatment
strata
group
评论