在 R 中查找匹配品牌的重叠日期

Finding Overlapping Dates for Matching Brands in R

提问人:Niklas Schröder 提问时间:3/23/2023 最后编辑:Niklas Schröder 更新时间:3/23/2023 访问量:59

问:

我得到了一个 R 数据帧,并试图查找同时列 data_recalls$Attributed_Brand 的条目匹配且时间段重叠的行。时间段可以在 Before_Date_Recall 和 After_Date_Recall 列中找到。可能的匹配项如下所示:

(1) data_recalls$Attributed_Brand:耐克;Before_Date_Recall: 2018-09-22;After_Date_Recall: 2018-10-21

(2) data_recalls$Attributed_Brand:耐克;Before_Date_Recall: 2018-10-19;After_Date_Recall: 2018-10-24

不幸的是,我只能想出一个解决方案,即代码在列 data_recalls$Attributed_Brand 中查找匹配项,然后查找重叠的日期,这不会创建所需的结果。我的数据框如下所示

data_recalls <- data.frame(Attributed_Brand = c("Nike", "Adidas", "Nike", "Puma"),
                       Before_Date_Recall = c("2018-09-22", "2018-09-20", "2018-10-19", "2018-11-01"),
                       After_Date_Recall = c("2018-10-21", "2018-10-24", "2018-10-31", "2018-11-10"),
                       stringsAsFactors = FALSE)

感谢您的任何建议!

R 循环 重叠匹配

评论

2赞 r2evans 3/23/2023
这听起来像是非 equi 合并,使用 的合并 (always)、(dplyr >= 1.1.0) 或 .缺少任何示例数据或其他任何东西,我认为这是我认为我能建议的最多。见 stackoverflow.com/q/64362881/3358272stackoverflow.com/q/64539945/3358272。如果您在这里需要更多帮助,我建议使用示例数据和代码进行可重现的问题,请阅读 stackoverflow.com/q/5963269最小可重现示例stackoverflow.com/tags/r/infodata.tabledplyr::join_byfuzzyjoin::sqldf::
0赞 Niklas Schröder 3/23/2023
感谢您的回复!我提供了一些示例数据。尝试的代码是: 无论如何,我找不到任何重叠的日期,其中 data_recalls$Attributed_Brand 的品牌匹配。library(dplyr) data_recalls %>% group_by(Attributed_Brand) %>% filter( any(Before_Date_Recall <= max(Before_Date_Recall)) & any(After_Date_Recall >= min(After_Date_Recall))

答:

0赞 Davis Vaughan 3/23/2023 #1

听起来 ivs 包适合您。该列允许您识别每个品牌名称中的重叠行。groups

library(ivs)
library(dplyr, warn.conflicts = FALSE)

data_recalls <- tibble(
  Attributed_Brand = c(
    "Nike", "Nike", "Nike", 
    "Adidas", "Adidas", "Adidas", "Adidas"
  ),
  Before_Date_Recall = as.Date(c(
    "2018-09-22", "2018-10-19", "2018-10-26",
    "2018-09-01", "2018-10-05", "2018-09-03", "2018-10-02"
  )),
  After_Date_Recall = as.Date(c(
    "2018-10-21", "2018-10-24", "2018-10-30",
    "2018-09-05", "2018-10-15", "2018-09-07", "2018-10-10"
  ))
)

data_recalls %>%
  mutate(Date_Recall = iv(Before_Date_Recall, After_Date_Recall), .keep = "unused") %>%
  mutate(group = iv_identify_group(Date_Recall), .by = Attributed_Brand) %>%
  filter(n() > 1, .by = c(Attributed_Brand, group))
#> # A tibble: 6 × 3
#>   Attributed_Brand              Date_Recall                    group
#>   <chr>                          <iv<date>>               <iv<date>>
#> 1 Nike             [2018-09-22, 2018-10-21) [2018-09-22, 2018-10-24)
#> 2 Nike             [2018-10-19, 2018-10-24) [2018-09-22, 2018-10-24)
#> 3 Adidas           [2018-09-01, 2018-09-05) [2018-09-01, 2018-09-07)
#> 4 Adidas           [2018-10-05, 2018-10-15) [2018-10-02, 2018-10-15)
#> 5 Adidas           [2018-09-03, 2018-09-07) [2018-09-01, 2018-09-07)
#> 6 Adidas           [2018-10-02, 2018-10-10) [2018-10-02, 2018-10-15)

这是另一种替代方法,根据实际数据集的结构,它可能会更快

data_recalls %>%
  mutate(Date_Recall = iv(Before_Date_Recall, After_Date_Recall), .keep = "unused") %>%
  mutate(count = iv_count_overlaps(Date_Recall, Date_Recall), .by = Attributed_Brand) %>%
  filter(count > 1)
#> # A tibble: 6 × 3
#>   Attributed_Brand              Date_Recall count
#>   <chr>                          <iv<date>> <int>
#> 1 Nike             [2018-09-22, 2018-10-21)     2
#> 2 Nike             [2018-10-19, 2018-10-24)     2
#> 3 Adidas           [2018-09-01, 2018-09-05)     2
#> 4 Adidas           [2018-10-05, 2018-10-15)     2
#> 5 Adidas           [2018-09-03, 2018-09-07)     2
#> 6 Adidas           [2018-10-02, 2018-10-10)     2

评论

0赞 Niklas Schröder 3/23/2023
感谢您的回复!根据你的示例数据集,我首选的匹配数据集将如下所示: 数据的第三行不会是匹配项,因为从 2018-10-26 到 2018-10-30 的时间间隔没有与其他 Nike 日期重叠df_matches <- data.frame( Attributed_Brand = c("Adidas", "Adidas", "Adidas", "Adidas", "Nike", "Nike"), Before_Date_Recall = as.Date(c("2018-10-05", "2018-10-02", "2018-09-03", "2018-09-01", "2018-10-19", "2018-09-22")), After_Date_Recall = as.Date(c("2018-10-15", "2018-10-10", "2018-09-07", "2018-09-05", "2018-10-24", "2018-10-21")))
0赞 Davis Vaughan 3/23/2023
我已经更新了一个额外的内容,应该会产生你要找的东西filter()
0赞 r2evans 3/23/2023 #2

试试这个:

sqldf::sqldf("
  select distinct t1.*
  from data_recalls t1
    inner join data_recalls t2 on t1.Attributed_Brand = t2.Attributed_Brand
      and ((t1.Before_Date_Recall > t2.Before_Date_Recall and t1.Before_Date_Recall < t2.After_Date_Recall)
           or (t1.After_Date_Recall > t2.Before_Date_Recall and t1.After_Date_Recall < t2.After_Date_Recall))")
#   Attributed_Brand Before_Date_Recall After_Date_Recall
# 1             Nike         2018-09-22        2018-10-21
# 2             Nike         2018-10-19        2018-10-31

我首先尝试使用 sql 的,但这是包容性的,这会导致每一行都匹配自己。我们可以通过增加一些部分来防止这种情况,但这只是一个开始。between

为了允许重叠,您的数据需要为每一行提供唯一的 ID:

data_recalls$id <- seq_len(nrow(data_recalls))

从那里,我们可以使用然后添加字段的不等式。betweenid

sqldf::sqldf("
  select distinct t1.*
  from data_recalls t1
    inner join data_recalls t2 on t1.Attributed_Brand = t2.Attributed_Brand
      and (t1.Before_Date_Recall between t2.Before_Date_Recall and t2.After_Date_Recall
           or t1.After_Date_Recall between t2.Before_Date_Recall and t2.After_Date_Recall)
      and t1.id <> t2.id")
#   Attributed_Brand Before_Date_Recall After_Date_Recall id
# 1             Nike         2018-09-22        2018-10-21  1
# 2             Nike         2018-10-19        2018-10-31  3

评论

0赞 Niklas Schröder 3/23/2023
该代码适用于提供的示例数据。但是,如果我得到如下数据框,则代码不起作用: 在这种情况下,结果应该与数据框类似,因为品牌名称匹配且日期重叠。data_recalls <- data.frame(Attributed_Brand = c("Adidas", "Adidas", "Nike", "Nike"), Before_Date_Recall = c("2018-09-22", "2018-09-20", "2018-11-02", "2018-11-03"), After_Date_Recall = c("2018-10-21", "2018-10-24", "2018-11-10", "2018-11-09"), stringsAsFactors = FALSE)
0赞 r2evans 3/23/2023
也许 stackoverflow.com/a/74679562/3358272