While 循环未遵守条件-解网

问：

我有一个数据框，其中包含一组文本，每个文本都有一个字数。它看起来有点像这样：

df1 <- data.frame(Items = sample(1:495, 495, replace = FALSE), Length.in.words = sample(380:820, 495, replace = TRUE))

我需要将这些文本分配给法官。此赋值有几个参数。首先，每个文本应由三名法官观看。因此，我复制了两次原始数据帧，因此每个文本现在表示了三次：

df2 <- df1
df3 <- df1
df <- rbind(df1, df2, df3)

接下来，我编写了一个小函数来将文本分配给评委。它使用 tidyverse 和 groupdata2：

library(tidyverse)
library(groupdata2)

sample_scripts <- function(x, judges){

all <- fold(x, num_col = "Length.in.words", k = judges)
  judge_list <- split(all, all$.folds)
  while (TRUE %in% lapply(X = judge_list, FUN = duplicated)){
    all <- fold(all, num_col = "Length.in.words", k = judges)
    judge_list <- split(all, all$.folds)
  }

  assign("judge_list", judge_list, envir = globalenv())
}

该函数执行几项操作。它用于在字数方面平衡每个法官的文本。而且，至少在理论上，它用于迭代，直到每个法官的文本列表不包含重复。评委们不能两次看到相同的文本！我像这样运行函数：groupdata2::foldwhile

sample_scripts(df, 33)

但是“while”循环没有按预期工作 - 每次我运行该函数时，我都会在法官的几个文本列表中得到重复。我是这样测试的：

lapply(X = judge_list, FUN = duplicated)

谁能帮我解决这个问题，以便 while 循环迭代文本赋值，直到任何评委列表中没有重复项？

谢谢

彼得

r while循环

我需要两个对象。一个是judge_list，实际上它不仅包含文本编号和字数，还包含以后可能需要分析的几列其他数据。第二项是一个简单的数据框，有 33 列（每列对应一个法官），以及分配给它们的文本的文本编号（创建它的代码在上面的示例中没有）。同样，我上面提到的限制必须得到尊重：（1）每个文本由三个不同的法官看到，（2）每个法官的名单都有相似长度的文本，（3）任何法官都不能多次看到任何单个文本。

0赞 Chris 4/7/2023

将 df 保留为三个列表而不是它们可能会简化。rbind(

答：

1赞 Martin Gal 4/8/2023 #1

我尝试仅使用 R 函数重新编码您的函数：sample_scriptsbase

sample_scripts <- function(data, n_judges = 33, n_judge_per_item = 3) {

  # Default-Values: 
  # 33 Judges
  # each Item has to be revised by 3 judges
  
  # initialize output
  judges_list <- lapply(seq_len(n_judges), 
                        \(n) data.frame(
                            Item = integer(0),
                            Length.in.words = integer(0)
                            )
                        )

  # Check if a judge is still available  
  free_judge <- rep(TRUE, n_judges)
  
  # target_length tries to distribute the text sizes
  target_length <- n_judge_per_item * sum(data$Length.in.words) / n_judges 

  # loop over items data.frame
  for (i in seq_len(nrow(data))) {
    # randomly pick three free judges
    samp_judge <- sample(seq_len(n_judges)[free_judge], size = n_judge_per_item)
    
    # assign items to judge
    for (j in samp_judge) {
      judges_list[[j]] <- 
        rbind(
          judges_list[[j]],
          data.frame(Item = data[["Items"]][i],
                     Length.in.words = data[["Length.in.words"]][i])
          )
      
      # check if judge is still available
      free_judge[j] <- (sum(judges_list[[j]][["Length.in.words"]]) < target_length)
        
    }
  }
  
  # return list
  judges_list
}

您的输入只需要而不是 .例如，在本例中，返回df1df <- rbind(df1, df1, df1)sample_scripts(df1)

[[14]]
   Item Length.in.words
1    12             817
2   245             389
3   372             649
4   413             731
5   329             698
6   193             492
7   405             731
8   376             593
9   297             539
10  356             774
11  125             492
12  359             716
13  472             611
14  128             397
[...]

将 s 替换为 s 返回data.frametibble

[[1]]
# A tibble: 42 × 2
    Item Length.in.words
   <int>           <int>
 1    99             738
 2    15             483
 3   341             763
 4   333             702
 5   407             797
 6   376             593
 7   356             774
 8     4             535
 9    53             739
10   354             570
# … with 32 more rows
# ℹ Use `print(n = ...)` to see more rows

[[2]]
# A tibble: 47 × 2
    Item Length.in.words
   <int>           <int>
 1    86             770
 2   404             550
 3   357             675
 4   331             431
 5   116             589
 6   438             554
 7   459             497
 8   319             651
 9    97             718
10    85             510
# … with 37 more rows
# ℹ Use `print(n = ...)` to see more rows
[...]

您对重复项的检查仍然会给我们lapply(X = judge_list, duplicated)

[[1]]
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

[[2]]
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

通过使用，您可以检查每个评委的字数是否相似。unlist(lapply(X = judge_list, \(x) sum(x$Length.in.words)))

非常感谢。但是在运行代码后，我意识到在我的解释中有一个约束不明确 - 所有法官都需要有相同数量的文本（45）。有没有办法添加此约束？我想也许法官在年满45岁时可以从“自由法官”名单中删除;但我不知所措 - 我不知道如何在不扰乱脚本其余部分的情况下对其进行编码。

2赞 Marcus 4/8/2023 #2

此处左侧字段的解决方案。我认为这可能被认为是一个NP难题，但当我读到它时，它让我印象深刻，这是一个受约束的离散优化问题 - 进行分配以最小化分配给每个法官的页数差异，为每个项目分配三名法官，并确保没有法官被分配相同的项目两次。

运行需要一段时间，但这里有一个使用遗传算法的概率解决方案

library(GA)
library(purrr)
library(dplyr)
library(tidyr)

df1 <- data.frame(
  Items = sample(1:495, 495, replace = FALSE), 
  Length.in.words = sample(380:820, 495, replace = TRUE)
)


sample_scripts <- function(x, judges){
  
  decode <- function(encX) {
    # enumerate the three judges assigned to each text
    judgesAssign <- floor(encX) |> 
      split(seq_along(x$Items))
    
    # check for duplicates
    dups <- map_lgl(judgesAssign, \(.x) any(duplicated(.x))) |> sum()
    
    list(judgesAssign = judgesAssign, dups = dups)
  }
  
  fitness <- function(encX) {
    
    decX <- decode(encX)
    
    # enumerate the texts assigned to each judge
    itemAssign <- decX$judgesAssign |> 
      as_tibble() |> 
      pivot_longer(
        everything(), 
        names_to = "Items",
        values_to = "judge"
      ) |> 
      # match types in df
      mutate(Items = as.integer(Items))
    
    # determine words and items assigned to each count
    assignCounts <-  itemAssign|> 
      left_join(x, by = "Items") |> 
      # ensure every judge is represented
      bind_rows(
        tibble(judge = seq(judges), Length.in.words = 0)
      ) |> 
      group_by(judge) |> 
      # subtract 1 for empty row
      summarise(n_words = sum(Length.in.words), n_items = n() - 1)
    
    # minimize variance
    f <- -(var(assignCounts$n_words)/mean(assignCounts$n_words) + 
             var(assignCounts$n_items)/mean(assignCounts$n_items))
    
    # penalty term for duplicates
    pen <- sqrt(.Machine$double.xmax)
    dupPen <- decX$dups*pen
    
    # fitness function minimizes variance and heavy penalty for duplicates
    f - dupPen
    
  }
  
  scriptCt <- NROW(x)
  
  GA <- ga(
    "real-valued",
    fitness = fitness,
    lower = rep(1, scriptCt*3), upper = rep(judges, scriptCt*3) + 1,
    maxiter = 750, run = 200, seed = 123, pmutation = 0.25,
    suggestions = sample(seq(judges), 3*scriptCt, TRUE),
    maxFitness = 0
  )

  decSol <- decode(GA@solution[1, ])
  finalAssignments <- decSol$judgesAssign |>
    as_tibble() |>
    pivot_longer(
      everything(),
      names_to = "Items",
      values_to = "judge"
    ) |>
    # match types in df
    mutate(Items = as.integer(Items))
  
  list(
    gaRslt = GA,
    fitness = fitness_tot(GA@solution),
    decodedSol = decSol,
    assignmentsByJudge = split(finalAssignments$Items, finalAssignments$judge),
    assignmentsByItem = decSol$judgesAssign,
    anyDuplicates = decSol$dups
  )

}

sample_scripts(df1, 33)

它不能确保每个裁判都是每个裁判都有相同的数量，但它会进行优化以保持数量相对相同。当我运行它时，大多数评委都有 45 到 55 个项目，但也有很长的尾巴。我没有成功尝试将其限制为所有 45 个。您可以使用一些参数来解决问题，这可能会有所帮助，但我认为解决方案空间有很多局部最小值，因此可能很困难。Itemsga

如果您可以选择手动调整，则可以以此为起点，从评委那里拿走一些有很多项目，然后将它们分配给几乎没有的项目。不是全局最优，但可能非常接近。

0赞 Peter Thwaites 4/8/2023 #3

感谢大家的建议。对每位法官的相同项目数的限制是严格的（我显然应该提到这一点）——比尽量减少每个法官每个文本的平均长度的限制更严格。我写了以下内容，它使用“贪婪”的方法对项目进行分组，以确保每个评委的名单中没有重复，给每个评委 45 个项目，并确保每个项目正好被三位评委看到。它不管理项目的平均长度，但它在分配之前随机化项目列表，所以我可以重新运行该过程几次，直到我很高兴范围足够低。

sample_scripts <- function(x, texts_per_judge){
  x1 <- x
  x1 <- x1[sample(1:nrow(x1)), ]
  x2 <- x
  x2 <- x2[sample(1:nrow(x2)), ]
  x3 <- x
  x3 <- x3[sample(1:nrow(x3)), ]
  all <- rbind(x1, x2, x3)
  all <- group(all, method = "greedy", n = texts_per_judge)
  judge_list <- split(all, all$.groups)
  assign("judge_mean_length", value = aggregate(Length.in.words ~ .groups, data = judge_df, FUN = "mean"), envir = globalenv())
  assign("judge_list", judge_list, envir = globalenv())
}

我对这个解决方案非常满意，但我仍然很想听听是否有人能想出一个解决方案，在平均文本长度范围可控之前不需要重新运行该过程。

上一个：循环函数，直到两个数据集之间的长度在 R 中相等

下一个：在 R 中正确存储 while 循环结果

While 循环未遵守条件

While loop failing to respect conditions

评论

评论