如何在两个变量上合并两个数据帧 - 第一个是因子变量的精确匹配，第二个是数值变量的模糊匹配-解网

问：

以下是数据帧（请勿编辑此部分）

library(dplyr)
set.seed(123)
id <- rep(c("A", "B", "C"), each = 5)
score <- sample(1:50, 15)
label <- paste(sample(LETTERS, 15 * 5, replace = TRUE), collapse = "")
label <- substring(label, seq(1, 71, by = 5), seq(5, 75, by = 5))
df1 <- data.frame(id, score, label)
df1 <- df1[order(df1$id, df1$score), ]
row.names(df1) <- 1:nrow(df1)
df1

   id score label
1   A     3 JMGII
2   A    14 KGULO
3   A    15 ISDNQ
4   A    31 CHZGJ
5   A    42 JWUGU
6   B    25 OZPTF
7   B    26 KHVVG
8   B    37 LMRAY
9   B    43 FYBEH
10  B    48 YFUOI
11  C     5 BDMEV
12  C     9 PLYNC
13  C    27 PQVRQ
14  C    28 NYWCH
15  C    40 SYTVY

id <- rep(c("A", "B", "C"), each = 3)
score <- sample(1:50, 9)
df2 <- data.frame(id, score)
df2 <- df2[order(df2$id, df2$score), ]
df2 <- df2 %>% group_by(id) %>% mutate(order = row_number())
row.names(df2) <- 1:nrow(df2)
df2

# A tibble: 9 × 3
# Groups:   id [3]
  id    score order
* <chr> <int> <int>
1 A        14     1
2 A        29     2
3 A        32     3
4 B         3     1
5 B         7     2
6 B        23     3
7 C        15     1
8 C        21     2
9 C        37     3

我在两个数据帧（df1 和 df2）中有两个同名的列。

第一列是因子变量“ID”。第二列是“score”，一个数值变量。

我想在两个条件下进行内部连接：1）完全匹配 ID，2）对于 df2 上的每个分数，连接该 ID 在 df1 上得分最接近的所有观测值（差异限制为 5）。

对于第二个条件，这意味着对于 df2 的每一行，它应该与分数差异最小的 df1 行连接，直到 5 的差异。

示例：对于 df2 的第一行，它应该只与 df1 的第 2 行连接（0 的差异，因为两者都是 14），而不能连接 df1 的第 3 行（1 的差异）。

笔记：

我尝试了像 fuzzyjoin 这样的各种包，但我似乎无法实现模糊匹配部分同时进行，max_dist = 2， distance_col = “距离”。

我已经搜索了类似主题的 stackoverflow 解决方案，但是到目前为止，没有一个工作 - 他们没有相同的 1 个数字模糊组合和 1 个因素精确。.

r 模糊连接

也许尝试一个，然后和？如果没有一个最小的可重复示例，很难知道什么会起作用。请参阅如何制作一个出色的 R 可重现示例，以使此问题更容易回答。full_join(by = ID)group_by(ID)filter(score_x <= score_y + 2 & score_x >= score_y - 2)slice_max(n=1)

0赞 user22746157 10/16/2023

嗨，我现在已经添加了可重现的示例。在这种情况下，我不能使用 slice_max（n=1），因为可以有多个具有相同分数差异的观察值，我需要所有这些观察值。

0赞 jared_mamrot 10/17/2023

感谢您添加可重现的示例@user22746157;我发布了一个答案，但如果我误解了您的预期结果，请发表评论，我会修复它

答：

1赞 neilfws 10/16/2023 #1

编辑

感谢您提供示例数据和更多详细信息。

我仍然认为这在这里很有用：我们可以指定它介于 +/- 5 之间。dplyr::join_bydf2$scoredf1$score

我不知道是否可以将分数的最小差异指定为连接的一部分，但可以在连接后使用 .filter

library(dplyr)


df1 %>% 
  mutate(score_lower = score - 5, 
         score_upper = score + 5) %>% 
  inner_join(df2, 
             by = join_by(id, between(y$score, x$score_lower, x$score_upper))) %>% 
group_by(id) %>% 
filter(abs(score.x - score.y) == min(abs(score.x - score.y))) %>% 
ungroup()

结果，保留所有列以显示联接按预期工作。您可以在代码末尾添加 a 以保留和/或重命名所需的列。dplyr::select

# A tibble: 3 × 7
  id    score.x label score_lower score_upper score.y order
  <chr>   <int> <chr>       <dbl>       <dbl>   <int> <int>
1 A          14 KGULO           9          19      14     1
2 B          25 OZPTF          20          30      23     3
3 C          40 SYTVY          35          45      37     3

library(dplyr)

set.seed(123)
id <- rep(c("A", "B", "C"), each = 5)
score <- sample(1:50, 15)
label <- paste(sample(LETTERS, 15 * 5, replace = TRUE), collapse = "")
label <- substring(label, seq(1, 71, by = 5), seq(5, 75, by = 5))
df1 <- data.frame(id, score, label)
df1 <- df1[order(df1$id, df1$score), ]
row.names(df1) <- 1:nrow(df1)
df1
#>    id score label
#> 1   A     3 JMGII
#> 2   A    14 KGULO
#> 3   A    15 ISDNQ
#> 4   A    31 CHZGJ
#> 5   A    42 JWUGU
#> 6   B    25 OZPTF
#> 7   B    26 KHVVG
#> 8   B    37 LMRAY
#> 9   B    43 FYBEH
#> 10  B    48 YFUOI
#> 11  C     5 BDMEV
#> 12  C     9 PLYNC
#> 13  C    27 PQVRQ
#> 14  C    28 NYWCH
#> 15  C    40 SYTVY

id <- rep(c("A", "B", "C"), each = 3)
score <- sample(1:50, 9)
df2 <- data.frame(id, score)
df2 <- df2[order(df2$id, df2$score), ]
df2 <- df2 %>% group_by(id) %>% mutate(order = row_number())
row.names(df2) <- 1:nrow(df2)
#> Warning: Setting row names on a tibble is deprecated.
df2
#> # A tibble: 9 × 3
#> # Groups:   id [3]
#>   id    score order
#> * <chr> <int> <int>
#> 1 A        14     1
#> 2 A        29     2
#> 3 A        32     3
#> 4 B         3     1
#> 5 B         7     2
#> 6 B        23     3
#> 7 C        15     1
#> 8 C        21     2
#> 9 C        37     3

可能的解决方案：

df1 %>%
  full_join(df2, by = join_by(id), relationship = "many-to-many") %>%
  group_by(id, score.y) %>%
  filter(abs(score.y - score.x) <= 5 & 
           abs(score.y - score.x) == min(abs(score.y - score.x)))
#> # A tibble: 5 × 5
#> # Groups:   id, score.y [5]
#>   id    score.x label score.y order
#>   <chr>   <int> <chr>   <int> <int>
#> 1 A          14 KGULO      14     1
#> 2 A          31 CHZGJ      29     2
#> 3 A          31 CHZGJ      32     3
#> 4 B          25 OZPTF      23     3
#> 5 C          40 SYTVY      37     3

在示例中，df2$score == 29 与 df1$score == 31 'matches';将值 df1$score == 27 添加到 DataFrame（显示与 DF2$score 的距离相同的值都保留）：

df1[5,2] <- 27

df1 %>%
  full_join(df2, by = join_by(id), relationship = "many-to-many") %>%
  group_by(id, score.y) %>%
  filter(abs(score.y - score.x) <= 5 & 
           abs(score.y - score.x) == min(abs(score.y - score.x)))
#> # A tibble: 6 × 5
#> # Groups:   id, score.y [5]
#>   id    score.x label score.y order
#>   <chr>   <dbl> <chr>   <int> <int>
#> 1 A          14 KGULO      14     1
#> 2 A          31 CHZGJ      29     2
#> 3 A          31 CHZGJ      32     3
#> 4 A          27 JWUGU      29     2
#> 5 B          25 OZPTF      23     3
#> 6 C          40 SYTVY      37     3

^{创建于 2023-10-17 with reprex v2.0.2}

上一个：对数据框中同一列中的模糊匹配项进行分组

下一个：将 Id 分配给新表中的模糊匹配名称 - R

如何在两个变量上合并两个数据帧 - 第一个是因子变量的精确匹配，第二个是数值变量的模糊匹配

How to merge two dataframes on two variables - the first an exact match on factor variable, the second a fuzzy match for numeric variable

以下是数据帧（请勿编辑此部分）

评论

评论