使用模糊字符串匹配在文本字符串上联接数据帧 (stringdist_join())

Joining dataframes on text strings using fuzzy string matching (stringdist_join())

提问人:JRock 提问时间:3/7/2023 最后编辑:JRock 更新时间:3/17/2023 访问量:432



df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", 
                  "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", 
                  "cohesion", "cognition")) 
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201", 
                  "bcd0201"), PrimConstruct_b = c("cohesion", "cognition", 
                  "commitment", "Cohesion", "cognitiion")) 
# df2 has misspelling, different capitalization, 
# and entry with no close match

我希望模糊匹配只发生在同一个 ,因此 abc0101 的“内聚”不会与 bcd0201 的“内聚”匹配。我最终希望有一个数据集,它同时保留了 PrimConstruct 列和所有值,但匹配的值在同一行中:SampleID

desireddf <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", "bcd0201", "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", "cohesion", "cognition", "NA"), PrimConstruct_a = c("cohesion", "cognition", "Cohesion", "cognitiion", "commitment")


joined <- stringdist_join(df1,
                         by = c("PrimConstruct_a" = "PrimConstruct_b",
                                "SampleID_a" = "SampleID_b"),
                         mode = "full",
                         method = "jw",
                         max_dist = 2,
                         ignore_case = T)


0赞 DPH 3/7/2023 #1

具有多个列匹配和不同连接函数(在您的例子中为 join equi 和一个模糊连接)的字符串距离可以像这样使用包完成:fuzzyjoin

# dummy data ... I introduced a non matching case
df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", "bcd0201", "proof"), 
                  PrimConstruct_a = c("cohesion", "cognition", "cohesion", "cognition", "00000")) 
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201", "bcd0201"), 
                  PrimConstruct_b = c("cohesion", "cognition", "commitment", "Cohesion", "cognitiion")) 

# build row identification
dplyr::mutate(df1, rn = dplyr::row_number()) %>%
    # perform join
                               # define join columns
                               by=c("SampleID_a" = "SampleID_b",
                                    "PrimConstruct_a" = "PrimConstruct_b"),
                               # list of match functions (first should be clear)
                               match_fun = list(`==`, 
                                                # function which returns boolean vector where maximum allowed string distance is 2 using levenshtein
                                                    stringdist::stringdist(x, y, method="lv") < 2)

  SampleID_a PrimConstruct_a rn SampleID_b PrimConstruct_b
1    abc0101        cohesion  1    abc0101        cohesion
2    abc0101       cognition  2    abc0101       cognition
3    bcd0201        cohesion  3    bcd0201        Cohesion
4    bcd0201       cognition  4    bcd0201      cognitiion
5      proof           00000  5       <NA>            <NA>

您可能已经注意到,匹配可以返回多个列,因为可能有更多的匹配项满足条件。因此,您现在可以按组处理数据,这就是我引入“rn”的原因,因为您可能只想要最接近的匹配项。为了向你展示如何解决这个问题,我将留下一个可选的方法,它不使用 but packge(fuzzyjoin 是基于 packge),因为它可能有助于你更好地理解工作原理:fuzzyjoinstringdist

# build new variable row number to identify df1 row uniuely
res <- dplyr::mutate(df1, rn = dplyr::row_number()) %>%
    # left join by the column that matches exactly (all from df1 and all matching from df2
    dplyr::left_join(df2, by = c("SampleID_a" = "SampleID_b")) %>%
    # stringdist calculus and set result to 0 if NA (no left join match) for next step
    dplyr::mutate(dist = dplyr::coalesce(stringdist::stringdist(PrimConstruct_a,
                                                                PrimConstruct_b), 0)) %>%
    # build grouping by input df1 rows id
    dplyr::group_by(rn) %>%
    # get first row ordered by calculated dist column 
    # this is why dist was set 0 if NA before
    # with_ties false will return only one match even if there are two with the same distance
    dplyr::slice_min(order_by = dist, n = 1, with_ties = FALSE) %>% 
    # release grouping to prevent unwanted behaviour down stream
    dplyr::ungroup() %>%
    # I set dist back to NA where it was (you might already select or perform more calculations
    dplyr::mutate(dist = ifelse(is.na(PrimConstruct_b), NA, dist))

# A tibble: 5 × 5
  SampleID_a PrimConstruct_a    rn PrimConstruct_b  dist
  <chr>      <chr>           <int> <chr>           <dbl>
1 abc0101    cohesion            1 cohesion            0
2 abc0101    cognition           2 cognition           0
3 bcd0201    cohesion            3 Cohesion            1
4 bcd0201    cognition           4 cognitiion          1
5 proof      00000               5 NA                 NA


  1. 您可以在模糊连接的匹配函数中使用 tolower() 或 toupper()(结果相同),或者在 PrimConstruct_a 和 PrimConstruct_b 的连接或距离计算之前,这将解决由大写与小写引起的所有字符串距离,只留下字母差异的问题进行计数。

  2. 读入 and 因此包已实现的不同 stringdistance 函数。根据用例的不同,有一些差异可能会有所帮助stringdistfuzzyjoin


您正在寻找类似完全连接的东西,尽管这并不容易。解决此问题的一种方法是将 prio 步骤中的数据分配给一个新变量(称为“res”),并从 df2 中识别非用例,将它们合并回结果,如下所示:

mis <- df2 %>% 
    # through an antijoin we from df2 to the result we identify non matched cases
    dplyr::anti_join(res, by = c("SampleID_b" = "SampleID_a", 
                                 "PrimConstruct_b")) %>% 
    # format the data to have the same columns as res to be able to unionize/bind them
    dplyr::transmute(SampleID_a = SampleID_b, 
                     PrimConstruct_a = NA, 
                     rn = NA, 
                     dist = NA)

  SampleID_a PrimConstruct_a rn PrimConstruct_b dist
1    bcd0201              NA NA      commitment   NA

dplyr::union(res, mis)
# A tibble: 6 x 5
   SampleID_a PrimConstruct_a    rn PrimConstruct_b  dist
  <chr>      <chr>           <int> <chr>           <dbl>
1 abc0101    cohesion            1 cohesion            0
2 abc0101    cognition           2 cognition           0
3 bcd0201    cohesion            3 Cohesion            1
4 bcd0201    cognition           4 cognitiion          1
5 proof      00000               5 NA                 NA
6 bcd0201    NA                 NA commitment         NA
1赞 moodymudskipper 3/17/2023 #2

您可以使用 {powerjoin}:

df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", 
                                 "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", 
                                                                 "cohesion", "cognition")) 
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201", 
                                 "bcd0201"), PrimConstruct_b = c("cohesion", "cognition", 
                                                                 "commitment", "Cohesion", "cognitiion")) 

powerjoin::power_full_join(df1, df2, by = c(SampleID_a = "SampleID_b", ~ stringdist::stringdist(.x$PrimConstruct_a, .y$PrimConstruct_b) < 2))
#>   SampleID_a PrimConstruct_a PrimConstruct_b
#> 1    abc0101        cohesion        cohesion
#> 2    abc0101       cognition       cognition
#> 3    bcd0201        cohesion        Cohesion
#> 4    bcd0201       cognition      cognitiion
#> 5    bcd0201            <NA>      commitment

创建于 2023-03-16 使用 reprex v2.0.2

请注意,我将 df2 的第二个列重命名为PrimConstruct_b