提问人:Peter Thwaites 提问时间:4/7/2023 最后编辑:Peter Thwaites 更新时间:4/8/2023 访问量:65
While 循环未遵守条件
While loop failing to respect conditions
问:
我有一个数据框,其中包含一组文本,每个文本都有一个字数。它看起来有点像这样:
df1 <- data.frame(Items = sample(1:495, 495, replace = FALSE), Length.in.words = sample(380:820, 495, replace = TRUE))
我需要将这些文本分配给法官。此赋值有几个参数。首先,每个文本应由三名法官观看。因此,我复制了两次原始数据帧,因此每个文本现在表示了三次:
df2 <- df1
df3 <- df1
df <- rbind(df1, df2, df3)
接下来,我编写了一个小函数来将文本分配给评委。它使用 tidyverse 和 groupdata2:
library(tidyverse)
library(groupdata2)
sample_scripts <- function(x, judges){
all <- fold(x, num_col = "Length.in.words", k = judges)
judge_list <- split(all, all$.folds)
while (TRUE %in% lapply(X = judge_list, FUN = duplicated)){
all <- fold(all, num_col = "Length.in.words", k = judges)
judge_list <- split(all, all$.folds)
}
assign("judge_list", judge_list, envir = globalenv())
}
该函数执行几项操作。它用于在字数方面平衡每个法官的文本。而且,至少在理论上,它用于迭代,直到每个法官的文本列表不包含重复。评委们不能两次看到相同的文本!我像这样运行函数:groupdata2::fold
while
sample_scripts(df, 33)
但是“while”循环没有按预期工作 - 每次我运行该函数时,我都会在法官的几个文本列表中得到重复。我是这样测试的:
lapply(X = judge_list, FUN = duplicated)
谁能帮我解决这个问题,以便 while 循环迭代文本赋值,直到任何评委列表中没有重复项?
谢谢
彼得
答:
我尝试仅使用 R 函数重新编码您的函数:sample_scripts
base
sample_scripts <- function(data, n_judges = 33, n_judge_per_item = 3) {
# Default-Values:
# 33 Judges
# each Item has to be revised by 3 judges
# initialize output
judges_list <- lapply(seq_len(n_judges),
\(n) data.frame(
Item = integer(0),
Length.in.words = integer(0)
)
)
# Check if a judge is still available
free_judge <- rep(TRUE, n_judges)
# target_length tries to distribute the text sizes
target_length <- n_judge_per_item * sum(data$Length.in.words) / n_judges
# loop over items data.frame
for (i in seq_len(nrow(data))) {
# randomly pick three free judges
samp_judge <- sample(seq_len(n_judges)[free_judge], size = n_judge_per_item)
# assign items to judge
for (j in samp_judge) {
judges_list[[j]] <-
rbind(
judges_list[[j]],
data.frame(Item = data[["Items"]][i],
Length.in.words = data[["Length.in.words"]][i])
)
# check if judge is still available
free_judge[j] <- (sum(judges_list[[j]][["Length.in.words"]]) < target_length)
}
}
# return list
judges_list
}
您的输入只需要 而不是 .例如,在本例中,返回df1
df <- rbind(df1, df1, df1)
sample_scripts(df1)
[[14]]
Item Length.in.words
1 12 817
2 245 389
3 372 649
4 413 731
5 329 698
6 193 492
7 405 731
8 376 593
9 297 539
10 356 774
11 125 492
12 359 716
13 472 611
14 128 397
[...]
将 s 替换为 s 返回data.frame
tibble
[[1]]
# A tibble: 42 × 2
Item Length.in.words
<int> <int>
1 99 738
2 15 483
3 341 763
4 333 702
5 407 797
6 376 593
7 356 774
8 4 535
9 53 739
10 354 570
# … with 32 more rows
# ℹ Use `print(n = ...)` to see more rows
[[2]]
# A tibble: 47 × 2
Item Length.in.words
<int> <int>
1 86 770
2 404 550
3 357 675
4 331 431
5 116 589
6 438 554
7 459 497
8 319 651
9 97 718
10 85 510
# … with 37 more rows
# ℹ Use `print(n = ...)` to see more rows
[...]
您对重复项的检查仍然会给我们lapply(X = judge_list, duplicated)
[[1]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[[2]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
通过使用,您可以检查每个评委的字数是否相似。unlist(lapply(X = judge_list, \(x) sum(x$Length.in.words)))
评论
此处左侧字段的解决方案。我认为这可能被认为是一个NP难题,但当我读到它时,它让我印象深刻,这是一个受约束的离散优化问题 - 进行分配以最小化分配给每个法官的页数差异,为每个项目分配三名法官,并确保没有法官被分配相同的项目两次。
运行需要一段时间,但这里有一个使用遗传算法的概率解决方案
library(GA)
library(purrr)
library(dplyr)
library(tidyr)
df1 <- data.frame(
Items = sample(1:495, 495, replace = FALSE),
Length.in.words = sample(380:820, 495, replace = TRUE)
)
sample_scripts <- function(x, judges){
decode <- function(encX) {
# enumerate the three judges assigned to each text
judgesAssign <- floor(encX) |>
split(seq_along(x$Items))
# check for duplicates
dups <- map_lgl(judgesAssign, \(.x) any(duplicated(.x))) |> sum()
list(judgesAssign = judgesAssign, dups = dups)
}
fitness <- function(encX) {
decX <- decode(encX)
# enumerate the texts assigned to each judge
itemAssign <- decX$judgesAssign |>
as_tibble() |>
pivot_longer(
everything(),
names_to = "Items",
values_to = "judge"
) |>
# match types in df
mutate(Items = as.integer(Items))
# determine words and items assigned to each count
assignCounts <- itemAssign|>
left_join(x, by = "Items") |>
# ensure every judge is represented
bind_rows(
tibble(judge = seq(judges), Length.in.words = 0)
) |>
group_by(judge) |>
# subtract 1 for empty row
summarise(n_words = sum(Length.in.words), n_items = n() - 1)
# minimize variance
f <- -(var(assignCounts$n_words)/mean(assignCounts$n_words) +
var(assignCounts$n_items)/mean(assignCounts$n_items))
# penalty term for duplicates
pen <- sqrt(.Machine$double.xmax)
dupPen <- decX$dups*pen
# fitness function minimizes variance and heavy penalty for duplicates
f - dupPen
}
scriptCt <- NROW(x)
GA <- ga(
"real-valued",
fitness = fitness,
lower = rep(1, scriptCt*3), upper = rep(judges, scriptCt*3) + 1,
maxiter = 750, run = 200, seed = 123, pmutation = 0.25,
suggestions = sample(seq(judges), 3*scriptCt, TRUE),
maxFitness = 0
)
decSol <- decode(GA@solution[1, ])
finalAssignments <- decSol$judgesAssign |>
as_tibble() |>
pivot_longer(
everything(),
names_to = "Items",
values_to = "judge"
) |>
# match types in df
mutate(Items = as.integer(Items))
list(
gaRslt = GA,
fitness = fitness_tot(GA@solution),
decodedSol = decSol,
assignmentsByJudge = split(finalAssignments$Items, finalAssignments$judge),
assignmentsByItem = decSol$judgesAssign,
anyDuplicates = decSol$dups
)
}
sample_scripts(df1, 33)
它不能确保每个裁判都是每个裁判都有相同的数量,但它会进行优化以保持数量相对相同。当我运行它时,大多数评委都有 45 到 55 个项目,但也有很长的尾巴。我没有成功尝试将其限制为所有 45 个。您可以使用一些参数来解决问题,这可能会有所帮助,但我认为解决方案空间有很多局部最小值,因此可能很困难。Items
ga
如果您可以选择手动调整,则可以以此为起点,从评委那里拿走一些有很多项目,然后将它们分配给几乎没有的项目。不是全局最优,但可能非常接近。
感谢大家的建议。对每位法官的相同项目数的限制是严格的(我显然应该提到这一点)——比尽量减少每个法官每个文本的平均长度的限制更严格。我写了以下内容,它使用“贪婪”的方法对项目进行分组,以确保每个评委的名单中没有重复,给每个评委 45 个项目,并确保每个项目正好被三位评委看到。它不管理项目的平均长度,但它在分配之前随机化项目列表,所以我可以重新运行该过程几次,直到我很高兴范围足够低。
sample_scripts <- function(x, texts_per_judge){
x1 <- x
x1 <- x1[sample(1:nrow(x1)), ]
x2 <- x
x2 <- x2[sample(1:nrow(x2)), ]
x3 <- x
x3 <- x3[sample(1:nrow(x3)), ]
all <- rbind(x1, x2, x3)
all <- group(all, method = "greedy", n = texts_per_judge)
judge_list <- split(all, all$.groups)
assign("judge_mean_length", value = aggregate(Length.in.words ~ .groups, data = judge_df, FUN = "mean"), envir = globalenv())
assign("judge_list", judge_list, envir = globalenv())
}
我对这个解决方案非常满意,但我仍然很想听听是否有人能想出一个解决方案,在平均文本长度范围可控之前不需要重新运行该过程。
评论
rbind(