提问人:rajvijay 提问时间:5/18/2016 最后编辑:M--rajvijay 更新时间:6/23/2023 访问量:39670
dplyr left_join小于、大于条件
dplyr left_join by less than, greater than condition
问:
这个问题在某种程度上与以下问题有关:在一个非平凡的条件下有效地合并两个数据框,以及检查日期是否在 r 中的两个日期之间。我在这里发布的那个请求该功能是否存在:GitHub 问题
我希望使用 .我用来加入的条件是小于、大于,即和 .是否支持此功能?或者键只在它们之间使用运算符。这很容易从 SQL 运行(假设我在数据库中有数据帧)dplyr::left_join()
<=
>
dplyr::left_join()
=
这是一个 MWE:我有两个数据集,一个是公司年(),而第二个是每五年发生一次的调查数据。因此,对于介于两个调查年度之间的所有年份,我加入了相应的调查年度数据。fdata
fdata
id <- c(1,1,1,1,
2,2,2,2,2,2,
3,3,3,3,3,3,
5,5,5,5,
8,8,8,8,
13,13,13)
fyear <- c(1998,1999,2000,2001,1998,1999,2000,2001,2002,2003,
1998,1999,2000,2001,2002,2003,1998,1999,2000,2001,
1998,1999,2000,2001,1998,1999,2000)
byear <- c(1990,1995,2000,2005)
eyear <- c(1995,2000,2005,2010)
val <- c(3,1,5,6)
sdata <- tbl_df(data.frame(byear, eyear, val))
fdata <- tbl_df(data.frame(id, fyear))
test1 <- left_join(fdata, sdata, by = c("fyear" >= "byear","fyear" < "eyear"))
我得到
Error: cannot join on columns 'TRUE' x 'TRUE': index out of bounds
除非是否可以处理该条件,但我的语法缺少某些内容?left_join
答:
一种选择是按行联接为列表列,然后取消嵌套该列:
# evaluate each row individually
fdata %>%
rowwise() %>%
# insert list column of single row of sdata based on conditions
mutate(s = list(sdata %>% filter(fyear >= byear, fyear < eyear))) %>%
# unnest list column
tidyr::unnest()
# Source: local data frame [27 x 5]
#
# id fyear byear eyear val
# (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 1 1998 1995 2000 1
# 2 1 1999 1995 2000 1
# 3 1 2000 2000 2005 5
# 4 1 2001 2000 2005 5
# 5 2 1998 1995 2000 1
# 6 2 1999 1995 2000 1
# 7 2 2000 2000 2005 5
# 8 2 2001 2000 2005 5
# 9 2 2002 2000 2005 5
# 10 2 2003 2000 2005 5
# .. ... ... ... ... ...
评论
LEFT JOIN
fyear==2011
fyear==2011
SELECT * FROM fdata LEFT JOIN sdata ON fyear >= year AND fyear < eyear
data.table
从 v 1.9.8 开始添加非 equi 联接
library(data.table) #v>=1.9.8
setDT(sdata); setDT(fdata) # converting to data.table in place
fdata[sdata, on = .(fyear >= byear, fyear < eyear), nomatch = 0,
.(id, x.fyear, byear, eyear, val)]
# id x.fyear byear eyear val
# 1: 1 1998 1995 2000 1
# 2: 2 1998 1995 2000 1
# 3: 3 1998 1995 2000 1
# 4: 5 1998 1995 2000 1
# 5: 8 1998 1995 2000 1
# 6: 13 1998 1995 2000 1
# 7: 1 1999 1995 2000 1
# 8: 2 1999 1995 2000 1
# 9: 3 1999 1995 2000 1
#10: 5 1999 1995 2000 1
#11: 8 1999 1995 2000 1
#12: 13 1999 1995 2000 1
#13: 1 2000 2000 2005 5
#14: 2 2000 2000 2005 5
#15: 3 2000 2000 2005 5
#16: 5 2000 2000 2005 5
#17: 8 2000 2000 2005 5
#18: 13 2000 2000 2005 5
#19: 1 2001 2000 2005 5
#20: 2 2001 2000 2005 5
#21: 3 2001 2000 2005 5
#22: 5 2001 2000 2005 5
#23: 8 2001 2000 2005 5
#24: 2 2002 2000 2005 5
#25: 3 2002 2000 2005 5
#26: 2 2003 2000 2005 5
#27: 3 2003 2000 2005 5
# id x.fyear byear eyear val
您也可以在 1.9.6 中多花一点力气使用它。foverlaps
评论
setDF
如果有人想将他的数据集返回到纯 Data.frame,则可以在之后使用
tidyr
dplyr
正如另一个答案所指出的那样,下面的原始答案已经过时了。对于较新版本的 ,只需使用以下命令即可。(请注意,此语法适用于使用 .dplyr
dbplyr
fdata %>%
left_join(sdata,
join_by(fyear >= byear, fyear < eyear))
当创建原始答案时,没有简单的方法可以使用 .dplyr
原始答案
使用 .(但请注意,这个答案不会产生正确的结果;但MWE给出了正确的结果。filter
LEFT JOIN
INNER JOIN
如果要求合并两个表而没有要合并的东西,包会不满意,所以在下文中,我为此目的在两个表中都做了一个虚拟变量,然后过滤,然后删除:dplyr
dummy
fdata %>%
mutate(dummy=TRUE) %>%
left_join(sdata %>% mutate(dummy=TRUE)) %>%
filter(fyear >= byear, fyear < eyear) %>%
select(-dummy)
请注意,如果您在 PostgreSQL 中执行此操作(例如),查询优化器会看穿变量,如以下两个查询解释所示:dummy
> fdata %>%
+ mutate(dummy=TRUE) %>%
+ left_join(sdata %>% mutate(dummy=TRUE)) %>%
+ filter(fyear >= byear, fyear < eyear) %>%
+ select(-dummy) %>%
+ explain()
Joining by: "dummy"
<SQL>
SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val"
FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy"
FROM "fdata") AS "zzz136"
LEFT JOIN
(SELECT "byear", "eyear", "val", TRUE AS "dummy"
FROM "sdata") AS "zzz137"
USING ("dummy")) AS "zzz138"
WHERE "fyear" >= "byear" AND "fyear" < "eyear"
<PLAN>
Nested Loop (cost=0.00..50886.88 rows=322722 width=40)
Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
-> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16)
-> Materialize (cost=0.00..33.55 rows=1570 width=24)
-> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)
使用 SQL 更干净地执行此操作会得到完全相同的结果:
> tbl(pg, sql("
+ SELECT *
+ FROM fdata
+ LEFT JOIN sdata
+ ON fyear >= byear AND fyear < eyear")) %>%
+ explain()
<SQL>
SELECT "id", "fyear", "byear", "eyear", "val"
FROM (
SELECT *
FROM fdata
LEFT JOIN sdata
ON fyear >= byear AND fyear < eyear) AS "zzz140"
<PLAN>
Nested Loop Left Join (cost=0.00..50886.88 rows=322722 width=40)
Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
-> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16)
-> Materialize (cost=0.00..33.55 rows=1570 width=24)
-> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)
评论
这看起来像是打包 fuzzyjoin 地址的那种任务。包的各种功能的外观和工作方式类似于 dplyr join 函数。
在这种情况下,其中一个功能将为您工作。和之间的主要区别在于,您提供了在匹配过程中与参数一起使用的函数列表。请注意,该参数的编写方式仍然与 中的相同。fuzzy_*_join
dplyr::left_join
fuzzyjoin::fuzzy_left_join
match.fun
by
left_join
下面是一个示例。我用来匹配的函数分别是 和 to 和 to 比较。这>=
<
fyear
byear
fyear
eyear
library(fuzzyjoin)
fuzzy_left_join(fdata, sdata,
by = c("fyear" = "byear", "fyear" = "eyear"),
match_fun = list(`>=`, `<`))
Source: local data frame [27 x 5]
id fyear byear eyear val
(dbl) (dbl) (dbl) (dbl) (dbl)
1 1 1998 1995 2000 1
2 1 1999 1995 2000 1
3 1 2000 2000 2005 5
4 1 2001 2000 2005 5
5 2 1998 1995 2000 1
6 2 1999 1995 2000 1
7 2 2000 2000 2005 5
8 2 2001 2000 2005 5
9 2 2002 2000 2005 5
10 2 2003 2000 2005 5
.. ... ... ... ... ...
评论
fyear >= byear-20
fyear < eyear+5
dplyr v1.1.0
现在包括像这样执行非 equi 连接的功能,其语法几乎与您尝试过的语法完全相同。对于具有许多部分匹配的数据,这将比使用过度包含连接或过度包含连接后的步骤性能高得多。fuzzyjoin
filter
# Relies on dplyr >=1.1.0, released Jan 2023
library(dplyr)
left_join(fdata, sdata, join_by(fyear >= byear,fyear < year))
评论
join_by(between(fyear, byear, eyear, bounds = "[)"))
评论
left_join(fdata, sdata, join_by(fyear >= byear,fyear < eyear))