提问人:Andrew 提问时间:10/13/2023 最后编辑:Andrew 更新时间:10/13/2023 访问量:26
不同长度的数据帧之间的条件匹配(如果 df1 中的两个值在 df2 中,则从 df2 中获取一个值)
Conditional matching between data frames of different length (if two values from df1 are in df2, take a value from df2)
问:
我有两个数据帧,df1 比 df2 短。两个数据帧之间有两个值重叠(重复)。按行,如果 df1 中的“Listener”与 df2 中的“Listener”行匹配,而 df1 中的“Trial”与 df2 中的“Trial”行匹配,则从 df2 中获取“PercCor”并将其放入 df1 中。
向量相当长(df1 是 147200 行,df2 是 510000),所以这是我的一些麻烦。
我希望这里有一个示例数据集,可能有助于阐明这个问题。
df1 <- data.frame(Listener = c("01JA","01JA","02JK","02JK", "03GL", "03GL"), Trial = c("Trial 1", "Trial 2", "Trial 1", "Trial 2", "Trial 1", "Trial 2"),
Pupil = c(1,2,4,3,2,6), Time = c(20, 40, 20, 40, 20, 40), percCor = c(NA, NA, NA, NA, NA, NA))
df2 <- data.frame(Listener = c("01JA","01JA","01JA","01JA", "01JA", "02JK","02JK","02JK","02JK", "02JK", "03GL", "03GL","03GL", "03GL", "03GL"),
Trial = c("Trial 1", "Trial 1", "Trial 2", "Trial 2", "Trial 3","Trial 1", "Trial 1", "Trial 2", "Trial 2", "Trial 3","Trial 1", "Trial 1", "Trial 2", "Trial 2", "Trial 3"),
Pupil = c(1,1,1,2,4,3,2,6,2,5,6,2,4,5,6), Time = c(50, 100, 150, 50, 100, 150,50, 100, 150, 50, 100, 150, 50, 100, 150), percCor = c(100, 100, 20, 20, 60, 60, 60, 30, 30, 40, 60, 60, 40, 40, 55))
我正在寻找的答案是:
dfans <- data.framedf1 <- data.frame(Listener = c("01JA","01JA","02JK","02JK", "03GL", "03GL"), Trial = c("Trial 1", "Trial 2", "Trial 1", "Trial 2", "Trial 1", "Trial 2"),
Pupil = c(1,2,4,3,2,6), Time = c(20, 40, 20, 40, 20, 40), percCor = c(100, 20, 60, 30, 60, 40))
我尝试了以下解决方案,但每种解决方案都遇到了问题。我似乎缺乏同时评估两个表达式的知识。我也尝试了 data.table 解决方案,但语法一直出错 - 我对它不是很熟悉:(
mdf <- merge(x=df1,y=df2, by.x=c("Listener","Trial"), by.y=c("Listener","Trial"))
#not enough RAM
mdf<-left_join(df1, df2, by= c('Listener','Trial'))%>%
mutate(percCor=percCor.y)%>%
select(-c(percCor.x,percCor.y))
#not enough RAM
mdf <-left_join(df1, df2)
#doesn't capture percCor
mdf <-right_join(df1, df2)
#keeps unnecessary trials/data
for(id in 1:nrow(df2)){
df1$percCor[df1$Listener %in% df2$Listener[id]] <- df2$percCor[id]
}
#doesn't evaluate both Trial and Listener and is slow.
for(id in 1:nrow(df2)){
df1$percCor[df1$Listener %in% df2$Listener[id] & df1$Trial %in% df2$Trial] <- df2$percCor[id]
}
#not entirely sure why it seems to not evaluate both expressions at the same time. It only takes the listener's last percCor value for all trials. It's also incredibly slow.
帮助?这感觉像是我应该知道如何做的事情......但我对 R 非常自学成才,老实说,尽管我已经在一些项目中使用了它,但我还是一个新手。我猜这不会那么困难 - 我花了半周时间查看 stackoverflow 答案 - 所以如果这很明显,我深表歉意。
感谢您的观看!
答:
听起来您明确忽略了生成多行的某些字段。例如,仅查看 的第一行,我们看到:df2
df1
merge(df1[1,], df2, by = c("Listener", "Trial"), all.x = TRUE, suffix = c("", ".y"))
# Listener Trial Pupil Time percCor Pupil.y Time.y percCor.y
# 1 01JA Trial 1 1 20 NA 1 50 100
# 2 01JA Trial 1 1 20 NA 1 100 100
看看这个(和其他的),是不变的,表明你不在乎 或 ,所以我们可以在去那里之前减少。percCor.y
Pupil
Time
df2
也许:
merge(df1, unique(df2[, c("Listener", "Trial", "percCor")]), by = c("Listener", "Trial"), all.x = TRUE, suffix = c("", ".y")) |>
transform(percCor = ifelse(is.na(percCor), percCor.y, percCor)) |>
subset(select = -percCor.y)
# Listener Trial Pupil Time percCor
# 1 01JA Trial 1 1 20 100
# 2 01JA Trial 2 2 40 20
# 3 02JK Trial 1 4 20 60
# 4 02JK Trial 2 3 40 30
# 5 03GL Trial 1 2 20 60
# 6 03GL Trial 2 6 40 40
我假设在连接之前可能具有有效值,需要合并/-ing值。如果不是这种情况,则在联接之前将其删除,并且不需要(或更低)。percCor
ifelse
ifelse
coalesce
变种:
### dplyr
library(dplyr)
df2 %>%
distinct(Listener, Trial, percCor) %>%
right_join(df1, by = c("Listener", "Trial"), suffix = c(".y", "")) %>%
mutate(percCor = coalesce(percCor, percCor.y)) %>%
select(-percCor.y)
# Listener Trial Pupil Time percCor
# 1 01JA Trial 1 1 20 100
# 2 01JA Trial 2 2 40 20
# 3 02JK Trial 1 4 20 60
# 4 02JK Trial 2 3 40 30
# 5 03GL Trial 1 2 20 60
# 6 03GL Trial 2 6 40 40
### data.table
library(data.table)
setDT(df1)
setDT(df2)
df1[unique(df2, by = c("Listener", "Trial", "percCor")),
percCor.y := i.percCor, on = .(Listener, Trial)
][, percCor := fcoalesce(as.numeric(percCor), as.numeric(percCor.y))
][, percCor.y := NULL][]
# Listener Trial Pupil Time percCor
# <char> <char> <num> <num> <num>
# 1: 01JA Trial 1 1 20 100
# 2: 01JA Trial 2 2 40 20
# 3: 02JK Trial 1 4 20 60
# 4: 02JK Trial 2 3 40 30
# 5: 03GL Trial 1 2 20 60
# 6: 03GL Trial 2 6 40 40
使用是因为对班级安全非常严格,这一直做得很差,最近放松了(恕我直言,这不是一件好事)。as.numeric(.)
fcoalesce
ifelse
dplyr::if_else
关于以下事项的说明:如果您从未先填写过,那么您可以在初始加入中执行并跳过和清理步骤。data.table
percCor
df1
percCor := i.percCor
fcoalesce
评论
df1
df2
01JA
Trial 1
percCor
df1
Listener
Trial
df2
PercCor
df2
df2
Listener
Trial
data.table