当引用 tibble 包含 NA/缺失值时,与 dplyr 重叠连接

Overlapping join with dplyr when reference tibble contains NA / missing values

提问人:Nova 提问时间:9/27/2023 更新时间:9/27/2023 访问量:23

问:

我想连接两个数据帧:参考数据,包含描述一个人何时戴项圈的数据;和 ,包含描述项圈在给定时间的位置的数据。项圈可以在人与人之间移动,如果项圈仍然戴着,则该项圈没有定义。refdatamovedatatimeend

以下是一些示例数据和我尝试过的联接之一:

library(dplyr)

refdata <- tibble(person = c("Jess", "Andy", "Cody"),
                  collar = c("a", "b", "a"),
                  timestart = c(1, 2, 5),
                  timeend = c(3, 4, NA))

movedata <- tibble(collar = c("a", "a", "a", "a", "a", "a", "a", 
                              "b", "b", "b", "b", "b", "b", "b", "c", "c", "c"), 
                   time = c(1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3), 
          location = c("skatepark", "home", "library", "street", "crib", "crib", "bassinet", "street", "work", "work", "home", "street", "street", "street", "street", "street", "street"))

by <- join_by(collar,
              between(time, timestart, timeend))
check <- full_join(movedata, refdata, by)

# View input data
refdata
# A tibble: 3 × 4
  person collar timestart timeend
  <chr>  <chr>      <dbl>   <dbl>
1 Jess   a              1       3
2 Andy   b              2       4
3 Cody   a              5      NA

movedata
# A tibble: 17 × 3
   collar  time location 
   <chr>  <dbl> <chr>    
 1 a          1 skatepark
 2 a          2 home     
 3 a          3 library  
 4 a          4 street   
 5 a          5 crib     
 6 a          6 crib     
 7 a          7 bassinet 
 8 b          1 street   
 9 b          2 work     
10 b          3 work     
11 b          4 home     
12 b          5 street   
13 b          6 street   
14 b          7 street   
15 c          1 street   
16 c          2 street   
17 c          3 street

# result of join (not desired output):
check 
# A tibble: 18 × 6
   collar  time location  person timestart timeend
   <chr>  <dbl> <chr>     <chr>      <dbl>   <dbl>
 1 a          1 skatepark Jess           1       3
 2 a          2 home      Jess           1       3
 3 a          3 library   Jess           1       3
 4 a          4 street    NA            NA      NA
 5 a          5 crib      NA            NA      NA
 6 a          6 crib      NA            NA      NA
 7 a          7 bassinet  NA            NA      NA
 8 b          1 street    NA            NA      NA
 9 b          2 work      Andy           2       4
10 b          3 work      Andy           2       4
11 b          4 home      Andy           2       4
12 b          5 street    NA            NA      NA
13 b          6 street    NA            NA      NA
14 b          7 street    NA            NA      NA
15 c          1 street    NA            NA      NA
16 c          2 street    NA            NA      NA
17 c          3 street    NA            NA      NA
18 a         NA NA        Cody           5      NA

但我想要的结果是,科迪已经戴上了,并且仍然戴着项圈:

# Expected result
# A tibble: 17 × 6
   collar  time location  person timestart timeend
   <chr>  <dbl> <chr>     <chr>      <dbl>   <dbl>
 1 a          1 skatepark Jess           1       3
 2 a          2 home      Jess           1       3
 3 a          3 library   Jess           1       3
 4 a          4 street    NA             5      NA
 5 a          5 crib      Cody           5      NA
 6 a          6 crib      Cody           5      NA
 7 a          7 bassinet  Cody           5      NA
 8 b          1 street    NA            NA      NA
 9 b          2 work      Andy           2       4
10 b          3 work      Andy           2       4
11 b          4 home      Andy           2       4
12 b          5 street    NA            NA      NA
13 b          6 street    NA            NA      NA
14 b          7 street    NA            NA      NA
15 c          1 street    NA            NA      NA
16 c          2 street    NA            NA      NA
17 c          3 street    NA            NA      NA

我对两种解决方案感兴趣:一种是创建上面的预期表,另一种是删除任何与人无关的行(因此上面的第 8 行和第 12-17 行)。

加入 dplyr na 重叠匹配

评论


答:

2赞 Maël 9/27/2023 #1

一种方法是将 NA 替换为:timeendInf

refdata$timeend <- replace(refdata$timeend, is.na(refdata$timeend), Inf)
by <- join_by(collar,
              between(time, timestart, timeend))
full_join(movedata, refdata, by)

# # A tibble: 17 × 6
#    collar  time location  person timestart timeend
#    <chr>  <dbl> <chr>     <chr>      <dbl>   <dbl>
#  1 a          1 skatepark Jess           1       3
#  2 a          2 home      Jess           1       3
#  3 a          3 library   Jess           1       3
#  4 a          4 street    NA            NA      NA
#  5 a          5 crib      Cody           5     Inf
#  6 a          6 crib      Cody           5     Inf
#  7 a          7 bassinet  Cody           5     Inf
#  8 b          1 street    NA            NA      NA
#  9 b          2 work      Andy           2       4
# 10 b          3 work      Andy           2       4
# 11 b          4 home      Andy           2       4
# 12 b          5 street    NA            NA      NA
# 13 b          6 street    NA            NA      NA
# 14 b          7 street    NA            NA      NA
# 15 c          1 street    NA            NA      NA
# 16 c          2 street    NA            NA      NA
# 17 c          3 street    NA            NA      NA

对于第二个预期输出,请检查:inner_join

inner_join(movedata, refdata, by)

# # A tibble: 9 × 6
#   collar  time location  person timestart timeend
#   <chr>  <dbl> <chr>     <chr>      <dbl>   <dbl>
# 1 a          1 skatepark Jess           1       3
# 2 a          2 home      Jess           1       3
# 3 a          3 library   Jess           1       3
# 4 a          5 crib      Cody           5     Inf
# 5 a          6 crib      Cody           5     Inf
# 6 a          7 bassinet  Cody           5     Inf
# 7 b          2 work      Andy           2       4
# 8 b          3 work      Andy           2       4
# 9 b          4 home      Andy           2       4

评论

0赞 Nova 9/30/2023
是的!这样可以正确地完成这两项任务。这很容易 - 我确实想知道是否有任何方法可以在连接中指定这一点,这样我就不必更改我的数据帧。