提问人:Niklas Schröder 提问时间:3/23/2023 最后编辑:Niklas Schröder 更新时间:3/23/2023 访问量:59
在 R 中查找匹配品牌的重叠日期
Finding Overlapping Dates for Matching Brands in R
问:
我得到了一个 R 数据帧,并试图查找同时列 data_recalls$Attributed_Brand 的条目匹配且时间段重叠的行。时间段可以在 Before_Date_Recall 和 After_Date_Recall 列中找到。可能的匹配项如下所示:
(1) data_recalls$Attributed_Brand:耐克;Before_Date_Recall: 2018-09-22;After_Date_Recall: 2018-10-21
(2) data_recalls$Attributed_Brand:耐克;Before_Date_Recall: 2018-10-19;After_Date_Recall: 2018-10-24
不幸的是,我只能想出一个解决方案,即代码在列 data_recalls$Attributed_Brand 中查找匹配项,然后查找重叠的日期,这不会创建所需的结果。我的数据框如下所示
data_recalls <- data.frame(Attributed_Brand = c("Nike", "Adidas", "Nike", "Puma"),
Before_Date_Recall = c("2018-09-22", "2018-09-20", "2018-10-19", "2018-11-01"),
After_Date_Recall = c("2018-10-21", "2018-10-24", "2018-10-31", "2018-11-10"),
stringsAsFactors = FALSE)
感谢您的任何建议!
答:
0赞
Davis Vaughan
3/23/2023
#1
听起来 ivs 包适合您。该列允许您识别每个品牌名称中的重叠行。groups
library(ivs)
library(dplyr, warn.conflicts = FALSE)
data_recalls <- tibble(
Attributed_Brand = c(
"Nike", "Nike", "Nike",
"Adidas", "Adidas", "Adidas", "Adidas"
),
Before_Date_Recall = as.Date(c(
"2018-09-22", "2018-10-19", "2018-10-26",
"2018-09-01", "2018-10-05", "2018-09-03", "2018-10-02"
)),
After_Date_Recall = as.Date(c(
"2018-10-21", "2018-10-24", "2018-10-30",
"2018-09-05", "2018-10-15", "2018-09-07", "2018-10-10"
))
)
data_recalls %>%
mutate(Date_Recall = iv(Before_Date_Recall, After_Date_Recall), .keep = "unused") %>%
mutate(group = iv_identify_group(Date_Recall), .by = Attributed_Brand) %>%
filter(n() > 1, .by = c(Attributed_Brand, group))
#> # A tibble: 6 × 3
#> Attributed_Brand Date_Recall group
#> <chr> <iv<date>> <iv<date>>
#> 1 Nike [2018-09-22, 2018-10-21) [2018-09-22, 2018-10-24)
#> 2 Nike [2018-10-19, 2018-10-24) [2018-09-22, 2018-10-24)
#> 3 Adidas [2018-09-01, 2018-09-05) [2018-09-01, 2018-09-07)
#> 4 Adidas [2018-10-05, 2018-10-15) [2018-10-02, 2018-10-15)
#> 5 Adidas [2018-09-03, 2018-09-07) [2018-09-01, 2018-09-07)
#> 6 Adidas [2018-10-02, 2018-10-10) [2018-10-02, 2018-10-15)
这是另一种替代方法,根据实际数据集的结构,它可能会更快
data_recalls %>%
mutate(Date_Recall = iv(Before_Date_Recall, After_Date_Recall), .keep = "unused") %>%
mutate(count = iv_count_overlaps(Date_Recall, Date_Recall), .by = Attributed_Brand) %>%
filter(count > 1)
#> # A tibble: 6 × 3
#> Attributed_Brand Date_Recall count
#> <chr> <iv<date>> <int>
#> 1 Nike [2018-09-22, 2018-10-21) 2
#> 2 Nike [2018-10-19, 2018-10-24) 2
#> 3 Adidas [2018-09-01, 2018-09-05) 2
#> 4 Adidas [2018-10-05, 2018-10-15) 2
#> 5 Adidas [2018-09-03, 2018-09-07) 2
#> 6 Adidas [2018-10-02, 2018-10-10) 2
评论
0赞
Niklas Schröder
3/23/2023
感谢您的回复!根据你的示例数据集,我首选的匹配数据集将如下所示: 数据的第三行不会是匹配项,因为从 2018-10-26 到 2018-10-30 的时间间隔没有与其他 Nike 日期重叠df_matches <- data.frame( Attributed_Brand = c("Adidas", "Adidas", "Adidas", "Adidas", "Nike", "Nike"), Before_Date_Recall = as.Date(c("2018-10-05", "2018-10-02", "2018-09-03", "2018-09-01", "2018-10-19", "2018-09-22")), After_Date_Recall = as.Date(c("2018-10-15", "2018-10-10", "2018-09-07", "2018-09-05", "2018-10-24", "2018-10-21")))
0赞
Davis Vaughan
3/23/2023
我已经更新了一个额外的内容,应该会产生你要找的东西filter()
0赞
r2evans
3/23/2023
#2
试试这个:
sqldf::sqldf("
select distinct t1.*
from data_recalls t1
inner join data_recalls t2 on t1.Attributed_Brand = t2.Attributed_Brand
and ((t1.Before_Date_Recall > t2.Before_Date_Recall and t1.Before_Date_Recall < t2.After_Date_Recall)
or (t1.After_Date_Recall > t2.Before_Date_Recall and t1.After_Date_Recall < t2.After_Date_Recall))")
# Attributed_Brand Before_Date_Recall After_Date_Recall
# 1 Nike 2018-09-22 2018-10-21
# 2 Nike 2018-10-19 2018-10-31
我首先尝试使用 sql 的,但这是包容性的,这会导致每一行都匹配自己。我们可以通过增加一些部分来防止这种情况,但这只是一个开始。between
为了允许重叠,您的数据需要为每一行提供唯一的 ID:
data_recalls$id <- seq_len(nrow(data_recalls))
从那里,我们可以使用然后添加字段的不等式。between
id
sqldf::sqldf("
select distinct t1.*
from data_recalls t1
inner join data_recalls t2 on t1.Attributed_Brand = t2.Attributed_Brand
and (t1.Before_Date_Recall between t2.Before_Date_Recall and t2.After_Date_Recall
or t1.After_Date_Recall between t2.Before_Date_Recall and t2.After_Date_Recall)
and t1.id <> t2.id")
# Attributed_Brand Before_Date_Recall After_Date_Recall id
# 1 Nike 2018-09-22 2018-10-21 1
# 2 Nike 2018-10-19 2018-10-31 3
评论
0赞
Niklas Schröder
3/23/2023
该代码适用于提供的示例数据。但是,如果我得到如下数据框,则代码不起作用: 在这种情况下,结果应该与数据框类似,因为品牌名称匹配且日期重叠。data_recalls <- data.frame(Attributed_Brand = c("Adidas", "Adidas", "Nike", "Nike"), Before_Date_Recall = c("2018-09-22", "2018-09-20", "2018-11-02", "2018-11-03"), After_Date_Recall = c("2018-10-21", "2018-10-24", "2018-11-10", "2018-11-09"), stringsAsFactors = FALSE)
0赞
r2evans
3/23/2023
也许 stackoverflow.com/a/74679562/3358272
评论
data.table
dplyr::join_by
fuzzyjoin::
sqldf::
library(dplyr) data_recalls %>% group_by(Attributed_Brand) %>% filter( any(Before_Date_Recall <= max(Before_Date_Recall)) & any(After_Date_Recall >= min(After_Date_Recall))