对固定总大小进行分层抽样，而不是分层大小-解网

问：

我有一个数据集，我希望在给定变量中尽可能均匀地进行下采样。假设数据帧为 54 个观测值，并且下采样组的固定总大小设置为 25。但是，由于分层变量中的一些 n 很小，因此当我尝试均匀地选择数字时，它会出错，因为最小组中的观测值数量小于预期的分层组大小（在下面的示例中，2 < 5）。我想要一种方法来选择较小组中的所有观测值，然后填充其他分层组中的数字，直到达到指定的样本量，而不是使用复制观测值。这意味着，当只有 2 个观测值的第一组无法再次采样时，其余组的数量将增加，直到我选择 25 个。这将尽可能按分层组提供最均匀的下采样，而不会出现重复。replace = TRUE

下面是我的例子，当我尝试均匀地切片样品时，我收到的错误。因为我使用group_by来执行此操作，所以我无法指定 25 的总样本数量。有没有更好的方法或我不知道的不同功能可以很容易地以这种方式采样？或者有没有一种方法可以帮助我发现允许某种 + 组合工作group_byslice_sample

df <- data.frame(
  strat_group = c(rep("one", 2), rep("two", 10), rep("three", 5), rep("four", 25), rep("five", 12))
)

strat_group_size <- (25 / length(unique(df$strat_group)))

df |>
  dplyr::group_by(strat_group) |>
  dplyr::slice_sample(n = strat_group_size)

Error in `dplyr::slice_sample()`:
! Problem while computing indices.
ℹ The error occurred in group 3: strat_group = "one".
Caused by error in `sample.int()`:
! cannot take a sample larger than the population when 'replace = FALSE'

我想要的是一种按分层组均匀下采样的方法，直到达到特定数字（N = 25）。输出如下所示：

df <- data.frame(
  strat_group = c(rep("1", 2), rep("2", 6), rep("3", 5), rep("4", 6), rep("5", 6))
  )

我非常感谢任何帮助！这个问题已经难倒了我一段时间了。

R dplyr Group-by Tidyverse 切片

N=25            # how many rows do we want?
df$sampled = 0  # set each row initially to 'unselected'

for(i in 1:N){
  
  # find the number taken from each group, and the number remaining in each group
  
  df$totalpergroup=ave(df$sampled, df$strat_group, FUN=sum)
  df$remaining=ave((1-df$sampled), df$strat_group, FUN=sum)
  
  # choose an unselected row from the least represented group that has at least one row left
  # use this weird way of sampling a single value because of how 'sample' works when there's only one element
  
  possibleRows <- which((df$totalpergroup==min(df[df$remaining>0,"totalpergroup"])) & (df$sampled==0))
  rowToAdd <- possibleRows[sample(length(possibleRows),1)]
  
  # select that row
   
  df$sampled[rowToAdd] <- 1

}

# Here's my subsampled df

df[df$sampled==1,]

   strat_group sampled totalpergroup remaining
1          one       1             2         0
2          one       1             2         0
3          two       1             5         5
7          two       1             5         5
8          two       1             5         5
9          two       1             5         5
10         two       1             5         5
12         two       1             5         5
13       three       1             5         0
14       three       1             5         0
15       three       1             5         0
16       three       1             5         0
17       three       1             5         0
19        four       1             6        19
21        four       1             6        19
27        four       1             6        19
28        four       1             6        19
36        four       1             6        19
42        four       1             6        19
46        five       1             6         6
47        five       1             6         6
48        five       1             6         6
51        five       1             6         6
52        five       1             6         6
53        five       1             6         6

上一个：在 R 中循环遍历数组的多个切片

下一个：如何在 R 中切片矩阵并将生成的列向量保留为实际的列向量？[复制]

对固定总大小进行分层抽样，而不是分层大小

Stratified sampling to fixed total size instead of stratified sizes

评论