不同长度的数据帧之间的条件匹配(如果 df1 中的两个值在 df2 中,则从 df2 中获取一个值)

Conditional matching between data frames of different length (if two values from df1 are in df2, take a value from df2)

提问人:Andrew 提问时间:10/13/2023 最后编辑:Andrew 更新时间:10/13/2023 访问量:26

问:

我有两个数据帧,df1 比 df2 短。两个数据帧之间有两个值重叠(重复)。按行,如果 df1 中的“Listener”与 df2 中的“Listener”行匹配, df1 中的“Trial”与 df2 中的“Trial”行匹配,则从 df2 中获取“PercCor”并将其放入 df1 中。

向量相当长(df1 是 147200 行,df2 是 510000),所以这是我的一些麻烦。

我希望这里有一个示例数据集,可能有助于阐明这个问题。

df1 <- data.frame(Listener = c("01JA","01JA","02JK","02JK", "03GL", "03GL"), Trial = c("Trial 1", "Trial 2", "Trial 1", "Trial 2", "Trial 1", "Trial 2"),
                  Pupil = c(1,2,4,3,2,6), Time = c(20, 40, 20, 40, 20, 40), percCor = c(NA, NA, NA, NA, NA, NA))
df2 <- data.frame(Listener = c("01JA","01JA","01JA","01JA", "01JA", "02JK","02JK","02JK","02JK", "02JK", "03GL", "03GL","03GL", "03GL", "03GL"), 
                 Trial = c("Trial 1", "Trial 1", "Trial 2", "Trial 2", "Trial 3","Trial 1", "Trial 1", "Trial 2", "Trial 2", "Trial 3","Trial 1", "Trial 1", "Trial 2", "Trial 2", "Trial 3"),
                 Pupil = c(1,1,1,2,4,3,2,6,2,5,6,2,4,5,6), Time = c(50, 100, 150, 50, 100, 150,50, 100, 150, 50, 100, 150, 50, 100, 150), percCor = c(100, 100, 20, 20, 60, 60, 60, 30, 30, 40, 60, 60, 40, 40, 55))

我正在寻找的答案是:

dfans <- data.framedf1 <- data.frame(Listener = c("01JA","01JA","02JK","02JK", "03GL", "03GL"), Trial = c("Trial 1", "Trial 2", "Trial 1", "Trial 2", "Trial 1", "Trial 2"),
                                     Pupil = c(1,2,4,3,2,6), Time = c(20, 40, 20, 40, 20, 40), percCor = c(100, 20, 60, 30, 60, 40))

我尝试了以下解决方案,但每种解决方案都遇到了问题。我似乎缺乏同时评估两个表达式的知识。我也尝试了 data.table 解决方案,但语法一直出错 - 我对它不是很熟悉:(

mdf <- merge(x=df1,y=df2, by.x=c("Listener","Trial"), by.y=c("Listener","Trial"))
#not enough RAM

mdf<-left_join(df1, df2, by= c('Listener','Trial'))%>%
   mutate(percCor=percCor.y)%>%
   select(-c(percCor.x,percCor.y))
#not enough RAM

mdf <-left_join(df1, df2)
#doesn't capture percCor

mdf <-right_join(df1, df2)
#keeps unnecessary trials/data

for(id in 1:nrow(df2)){
  df1$percCor[df1$Listener %in% df2$Listener[id]] <- df2$percCor[id]
}
#doesn't evaluate both Trial and Listener and is slow. 

for(id in 1:nrow(df2)){
  df1$percCor[df1$Listener %in% df2$Listener[id] & df1$Trial %in% df2$Trial] <- df2$percCor[id]
}
#not entirely sure why it seems to not evaluate both expressions at the same time. It only takes the listener's last percCor value for all trials. It's also incredibly slow. 

帮助?这感觉像是我应该知道如何做的事情......但我对 R 非常自学成才,老实说,尽管我已经在一些项目中使用了它,但我还是一个新手。我猜这不会那么困难 - 我花了半周时间查看 stackoverflow 答案 - 所以如果这很明显,我深表歉意。

感谢您的观看!

R 循环合并 条件语句 匹配

评论

1赞 r2evans 10/13/2023
这是一个合并/联接操作,请参阅 stackoverflow.com/q/1299871/3358272stackoverflow.com/q/5706437/3358272、(data.table) stackoverflow.com/q/34598139/3358272但是,这会为每个 in 生成 1+ 行: 例如,在 中,有两行匹配 和 ,应该使用它的两行中的哪一行?最大/最小/第一个/最后一个/随机?df1df201JATrial 1percCor
0赞 PGSA 10/13/2023
欢迎来到 SO!只是为了弄清楚逻辑:如果 中的一行有一个值和一个值都与同一行 in 匹配,则从该行中获取 - 对吗?有重复的可能性吗?(>1 行匹配 和 的组合?df1ListenerTrialdf2PercCordf2df2ListenerTrial
0赞 r2evans 10/13/2023
仅供参考,“内存不足”......也许是受限内存操作的唯一本地选项,但您需要解决前两条注释中提到的重复性质。data.table
0赞 Andrew 10/13/2023
嗨 - 谢谢!是的,“df2”将有多行“01JA”和“试用版 1”,这些行将与“01JA”和“试用版 2”的“df1”中的多行匹配。我应该说得更清楚:“df2”中“01JA”和“Trial 2”的所有实例将始终具有相同的 percCor 值,因此 min/max/first/last/random 都将是相同的值。“df2”有 17 个“侦听器”实例,每个侦听器有 80 个“试用”实例。两者的独特组合将始终具有相同的“percCor”值。我希望这很清楚:) @r2evans

答:

0赞 r2evans 10/13/2023 #1

听起来您明确忽略了生成多行的某些字段。例如,仅查看 的第一行,我们看到:df2df1

merge(df1[1,], df2, by = c("Listener", "Trial"), all.x = TRUE, suffix = c("", ".y"))
#   Listener   Trial Pupil Time percCor Pupil.y Time.y percCor.y
# 1     01JA Trial 1     1   20      NA       1     50       100
# 2     01JA Trial 1     1   20      NA       1    100       100

看看这个(和其他的),是不变的,表明你不在乎 或 ,所以我们可以在去那里之前减少。percCor.yPupilTimedf2

也许:

merge(df1, unique(df2[, c("Listener", "Trial", "percCor")]), by = c("Listener", "Trial"), all.x = TRUE, suffix = c("", ".y")) |>
  transform(percCor = ifelse(is.na(percCor), percCor.y, percCor)) |>
  subset(select = -percCor.y)
#   Listener   Trial Pupil Time percCor
# 1     01JA Trial 1     1   20     100
# 2     01JA Trial 2     2   40      20
# 3     02JK Trial 1     4   20      60
# 4     02JK Trial 2     3   40      30
# 5     03GL Trial 1     2   20      60
# 6     03GL Trial 2     6   40      40

我假设在连接之前可能具有有效值,需要合并/-ing值。如果不是这种情况,则在联接之前将其删除,并且不需要(或更低)。percCorifelseifelsecoalesce

变种:

### dplyr
library(dplyr)
df2 %>%
  distinct(Listener, Trial, percCor) %>%
  right_join(df1, by = c("Listener", "Trial"), suffix = c(".y", "")) %>%
  mutate(percCor = coalesce(percCor, percCor.y)) %>%
  select(-percCor.y)
#   Listener   Trial Pupil Time percCor
# 1     01JA Trial 1     1   20     100
# 2     01JA Trial 2     2   40      20
# 3     02JK Trial 1     4   20      60
# 4     02JK Trial 2     3   40      30
# 5     03GL Trial 1     2   20      60
# 6     03GL Trial 2     6   40      40

### data.table
library(data.table)
setDT(df1)
setDT(df2)
df1[unique(df2, by = c("Listener", "Trial", "percCor")),
    percCor.y := i.percCor, on = .(Listener, Trial)
  ][, percCor := fcoalesce(as.numeric(percCor), as.numeric(percCor.y))
  ][, percCor.y := NULL][]
#    Listener   Trial Pupil  Time percCor
#      <char>  <char> <num> <num>   <num>
# 1:     01JA Trial 1     1    20     100
# 2:     01JA Trial 2     2    40      20
# 3:     02JK Trial 1     4    20      60
# 4:     02JK Trial 2     3    40      30
# 5:     03GL Trial 1     2    20      60
# 6:     03GL Trial 2     6    40      40

使用是因为对班级安全非常严格,这一直做得很差,最近放松了(恕我直言,这不是一件好事)。as.numeric(.)fcoalesceifelsedplyr::if_else

关于以下事项的说明:如果您从未先填写过,那么您可以在初始加入中执行并跳过和清理步骤。data.tablepercCordf1percCor := i.percCorfcoalesce

评论

1赞 Andrew 10/13/2023
谢谢!我现在在上班,一个病人刚刚出现......所以我必须稍后尝试一下。但我真的很感谢你的帮助!:)
1赞 Andrew 10/13/2023
美丽,非常感谢。data.table 与我更大的数据集配合得非常出色,并且快速高效。我真的很感谢你帮助我!@r2evans