使用模糊字符串匹配在文本字符串上联接数据帧 (stringdist_join())

Joining dataframes on text strings using fuzzy string matching (stringdist_join())

提问人:JRock 提问时间:3/7/2023 最后编辑:JRock 更新时间:3/17/2023 访问量:432

问:

我正在尝试根据两个变量的值连接两个数据集。两个数据集具有相同的变量名称/列数,但可能具有不同的行数。我想根据分组变量(“SampleID”)和包含文本字符串的变量(“PrimConstruct”)加入它们。我想使用模糊匹配来解释“PrimConstruct”中的轻微拼写错误和大小写差异。还有一种可能性是,一个数据集有一行的文本值,而另一个数据集没有,在这种情况下,我希望它将其保留为单独的行。

df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", 
                  "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", 
                  "cohesion", "cognition")) 
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201", 
                  "bcd0201"), PrimConstruct_b = c("cohesion", "cognition", 
                  "commitment", "Cohesion", "cognitiion")) 
# df2 has misspelling, different capitalization, 
# and entry with no close match

我希望模糊匹配只发生在同一个 ,因此 abc0101 的“内聚”不会与 bcd0201 的“内聚”匹配。我最终希望有一个数据集,它同时保留了 PrimConstruct 列和所有值,但匹配的值在同一行中:SampleID

desireddf <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", "bcd0201", "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", "cohesion", "cognition", "NA"), PrimConstruct_a = c("cohesion", "cognition", "Cohesion", "cognitiion", "commitment")

我试过了,但它最终会以某种方式复制并添加一堆行。stringdist_join

joined <- stringdist_join(df1,
                         df2,
                         by = c("PrimConstruct_a" = "PrimConstruct_b",
                                "SampleID_a" = "SampleID_b"),
                         mode = "full",
                         method = "jw",
                         max_dist = 2,
                         ignore_case = T)

我对不同的匹配方法有点困惑,但我不认为文本字符串的差异会比轻微的拼写错误和大小写的差异多得多,我认为这应该包含在ignore_case中。

R stringdist 模糊连接

评论


答:

0赞 DPH 3/7/2023 #1

具有多个列匹配和不同连接函数(在您的例子中为 join equi 和一个模糊连接)的字符串距离可以像这样使用包完成:fuzzyjoin

# dummy data ... I introduced a non matching case
df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", "bcd0201", "proof"), 
                  PrimConstruct_a = c("cohesion", "cognition", "cohesion", "cognition", "00000")) 
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201", "bcd0201"), 
                  PrimConstruct_b = c("cohesion", "cognition", "commitment", "Cohesion", "cognitiion")) 

# build row identification
dplyr::mutate(df1, rn = dplyr::row_number()) %>%
    # perform join
    fuzzyjoin::fuzzy_left_join(df2,
                               # define join columns
                               by=c("SampleID_a" = "SampleID_b",
                                    "PrimConstruct_a" = "PrimConstruct_b"),
                               # list of match functions (first should be clear)
                               match_fun = list(`==`, 
                                                # function which returns boolean vector where maximum allowed string distance is 2 using levenshtein
                                                function(x,y) 
                                                    stringdist::stringdist(x, y, method="lv") < 2)
     )

  SampleID_a PrimConstruct_a rn SampleID_b PrimConstruct_b
1    abc0101        cohesion  1    abc0101        cohesion
2    abc0101       cognition  2    abc0101       cognition
3    bcd0201        cohesion  3    bcd0201        Cohesion
4    bcd0201       cognition  4    bcd0201      cognitiion
5      proof           00000  5       <NA>            <NA>

您可能已经注意到,匹配可以返回多个列,因为可能有更多的匹配项满足条件。因此,您现在可以按组处理数据,这就是我引入“rn”的原因,因为您可能只想要最接近的匹配项。为了向你展示如何解决这个问题,我将留下一个可选的方法,它不使用 but packge(fuzzyjoin 是基于 packge),因为它可能有助于你更好地理解工作原理:fuzzyjoinstringdist

# build new variable row number to identify df1 row uniuely
res <- dplyr::mutate(df1, rn = dplyr::row_number()) %>%
    # left join by the column that matches exactly (all from df1 and all matching from df2
    dplyr::left_join(df2, by = c("SampleID_a" = "SampleID_b")) %>%
    # stringdist calculus and set result to 0 if NA (no left join match) for next step
    dplyr::mutate(dist = dplyr::coalesce(stringdist::stringdist(PrimConstruct_a,
                                                                PrimConstruct_b), 0)) %>%
    # build grouping by input df1 rows id
    dplyr::group_by(rn) %>%
    # get first row ordered by calculated dist column 
    # this is why dist was set 0 if NA before
    # with_ties false will return only one match even if there are two with the same distance
    dplyr::slice_min(order_by = dist, n = 1, with_ties = FALSE) %>% 
    # release grouping to prevent unwanted behaviour down stream
    dplyr::ungroup() %>%
    # I set dist back to NA where it was (you might already select or perform more calculations
    dplyr::mutate(dist = ifelse(is.na(PrimConstruct_b), NA, dist))

res
# A tibble: 5 × 5
  SampleID_a PrimConstruct_a    rn PrimConstruct_b  dist
  <chr>      <chr>           <int> <chr>           <dbl>
1 abc0101    cohesion            1 cohesion            0
2 abc0101    cognition           2 cognition           0
3 bcd0201    cohesion            3 Cohesion            1
4 bcd0201    cognition           4 cognitiion          1
5 proof      00000               5 NA                 NA

根据您的用例,可能会进行一些调整:

  1. 您可以在模糊连接的匹配函数中使用 tolower() 或 toupper()(结果相同),或者在 PrimConstruct_a 和 PrimConstruct_b 的连接或距离计算之前,这将解决由大写与小写引起的所有字符串距离,只留下字母差异的问题进行计数。

  2. 读入 and 因此包已实现的不同 stringdistance 函数。根据用例的不同,有一些差异可能会有所帮助stringdistfuzzyjoin

编辑

您正在寻找类似完全连接的东西,尽管这并不容易。解决此问题的一种方法是将 prio 步骤中的数据分配给一个新变量(称为“res”),并从 df2 中识别非用例,将它们合并回结果,如下所示:

mis <- df2 %>% 
    # through an antijoin we from df2 to the result we identify non matched cases
    dplyr::anti_join(res, by = c("SampleID_b" = "SampleID_a", 
                                 "PrimConstruct_b")) %>% 
    # format the data to have the same columns as res to be able to unionize/bind them
    dplyr::transmute(SampleID_a = SampleID_b, 
                     PrimConstruct_a = NA, 
                     rn = NA, 
                     PrimConstruct_b,
                     dist = NA)

mis
  SampleID_a PrimConstruct_a rn PrimConstruct_b dist
1    bcd0201              NA NA      commitment   NA


dplyr::union(res, mis)
# A tibble: 6 x 5
   SampleID_a PrimConstruct_a    rn PrimConstruct_b  dist
  <chr>      <chr>           <int> <chr>           <dbl>
1 abc0101    cohesion            1 cohesion            0
2 abc0101    cognition           2 cognition           0
3 bcd0201    cohesion            3 Cohesion            1
4 bcd0201    cognition           4 cognitiion          1
5 proof      00000               5 NA                 NA
6 bcd0201    NA                 NA commitment         NA
1赞 moodymudskipper 3/17/2023 #2

您可以使用 {powerjoin}:

df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", 
                                 "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", 
                                                                 "cohesion", "cognition")) 
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201", 
                                 "bcd0201"), PrimConstruct_b = c("cohesion", "cognition", 
                                                                 "commitment", "Cohesion", "cognitiion")) 

powerjoin::power_full_join(df1, df2, by = c(SampleID_a = "SampleID_b", ~ stringdist::stringdist(.x$PrimConstruct_a, .y$PrimConstruct_b) < 2))
#>   SampleID_a PrimConstruct_a PrimConstruct_b
#> 1    abc0101        cohesion        cohesion
#> 2    abc0101       cognition       cognition
#> 3    bcd0201        cohesion        Cohesion
#> 4    bcd0201       cognition      cognitiion
#> 5    bcd0201            <NA>      commitment

创建于 2023-03-16 使用 reprex v2.0.2

请注意,我将 df2 的第二个列重命名为PrimConstruct_b