提问人:JRock 提问时间:3/7/2023 最后编辑:JRock 更新时间:3/17/2023 访问量:432
使用模糊字符串匹配在文本字符串上联接数据帧 (stringdist_join())
Joining dataframes on text strings using fuzzy string matching (stringdist_join())
问:
我正在尝试根据两个变量的值连接两个数据集。两个数据集具有相同的变量名称/列数,但可能具有不同的行数。我想根据分组变量(“SampleID”)和包含文本字符串的变量(“PrimConstruct”)加入它们。我想使用模糊匹配来解释“PrimConstruct”中的轻微拼写错误和大小写差异。还有一种可能性是,一个数据集有一行的文本值,而另一个数据集没有,在这种情况下,我希望它将其保留为单独的行。
df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201",
"bcd0201"), PrimConstruct_a = c("cohesion", "cognition",
"cohesion", "cognition"))
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201",
"bcd0201"), PrimConstruct_b = c("cohesion", "cognition",
"commitment", "Cohesion", "cognitiion"))
# df2 has misspelling, different capitalization,
# and entry with no close match
我希望模糊匹配只发生在同一个 ,因此 abc0101 的“内聚”不会与 bcd0201 的“内聚”匹配。我最终希望有一个数据集,它同时保留了 PrimConstruct 列和所有值,但匹配的值在同一行中:SampleID
desireddf <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", "bcd0201", "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", "cohesion", "cognition", "NA"), PrimConstruct_a = c("cohesion", "cognition", "Cohesion", "cognitiion", "commitment")
我试过了,但它最终会以某种方式复制并添加一堆行。stringdist_join
joined <- stringdist_join(df1,
df2,
by = c("PrimConstruct_a" = "PrimConstruct_b",
"SampleID_a" = "SampleID_b"),
mode = "full",
method = "jw",
max_dist = 2,
ignore_case = T)
我对不同的匹配方法有点困惑,但我不认为文本字符串的差异会比轻微的拼写错误和大小写的差异多得多,我认为这应该包含在ignore_case中。
答:
具有多个列匹配和不同连接函数(在您的例子中为 join equi 和一个模糊连接)的字符串距离可以像这样使用包完成:fuzzyjoin
# dummy data ... I introduced a non matching case
df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", "bcd0201", "proof"),
PrimConstruct_a = c("cohesion", "cognition", "cohesion", "cognition", "00000"))
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201", "bcd0201"),
PrimConstruct_b = c("cohesion", "cognition", "commitment", "Cohesion", "cognitiion"))
# build row identification
dplyr::mutate(df1, rn = dplyr::row_number()) %>%
# perform join
fuzzyjoin::fuzzy_left_join(df2,
# define join columns
by=c("SampleID_a" = "SampleID_b",
"PrimConstruct_a" = "PrimConstruct_b"),
# list of match functions (first should be clear)
match_fun = list(`==`,
# function which returns boolean vector where maximum allowed string distance is 2 using levenshtein
function(x,y)
stringdist::stringdist(x, y, method="lv") < 2)
)
SampleID_a PrimConstruct_a rn SampleID_b PrimConstruct_b
1 abc0101 cohesion 1 abc0101 cohesion
2 abc0101 cognition 2 abc0101 cognition
3 bcd0201 cohesion 3 bcd0201 Cohesion
4 bcd0201 cognition 4 bcd0201 cognitiion
5 proof 00000 5 <NA> <NA>
您可能已经注意到,匹配可以返回多个列,因为可能有更多的匹配项满足条件。因此,您现在可以按组处理数据,这就是我引入“rn”的原因,因为您可能只想要最接近的匹配项。为了向你展示如何解决这个问题,我将留下一个可选的方法,它不使用 but packge(fuzzyjoin 是基于 packge),因为它可能有助于你更好地理解工作原理:fuzzyjoin
stringdist
# build new variable row number to identify df1 row uniuely
res <- dplyr::mutate(df1, rn = dplyr::row_number()) %>%
# left join by the column that matches exactly (all from df1 and all matching from df2
dplyr::left_join(df2, by = c("SampleID_a" = "SampleID_b")) %>%
# stringdist calculus and set result to 0 if NA (no left join match) for next step
dplyr::mutate(dist = dplyr::coalesce(stringdist::stringdist(PrimConstruct_a,
PrimConstruct_b), 0)) %>%
# build grouping by input df1 rows id
dplyr::group_by(rn) %>%
# get first row ordered by calculated dist column
# this is why dist was set 0 if NA before
# with_ties false will return only one match even if there are two with the same distance
dplyr::slice_min(order_by = dist, n = 1, with_ties = FALSE) %>%
# release grouping to prevent unwanted behaviour down stream
dplyr::ungroup() %>%
# I set dist back to NA where it was (you might already select or perform more calculations
dplyr::mutate(dist = ifelse(is.na(PrimConstruct_b), NA, dist))
res
# A tibble: 5 × 5
SampleID_a PrimConstruct_a rn PrimConstruct_b dist
<chr> <chr> <int> <chr> <dbl>
1 abc0101 cohesion 1 cohesion 0
2 abc0101 cognition 2 cognition 0
3 bcd0201 cohesion 3 Cohesion 1
4 bcd0201 cognition 4 cognitiion 1
5 proof 00000 5 NA NA
根据您的用例,可能会进行一些调整:
您可以在模糊连接的匹配函数中使用 tolower() 或 toupper()(结果相同),或者在 PrimConstruct_a 和 PrimConstruct_b 的连接或距离计算之前,这将解决由大写与小写引起的所有字符串距离,只留下字母差异的问题进行计数。
读入 and 因此包已实现的不同 stringdistance 函数。根据用例的不同,有一些差异可能会有所帮助
stringdist
fuzzyjoin
编辑
您正在寻找类似完全连接的东西,尽管这并不容易。解决此问题的一种方法是将 prio 步骤中的数据分配给一个新变量(称为“res”),并从 df2 中识别非用例,将它们合并回结果,如下所示:
mis <- df2 %>%
# through an antijoin we from df2 to the result we identify non matched cases
dplyr::anti_join(res, by = c("SampleID_b" = "SampleID_a",
"PrimConstruct_b")) %>%
# format the data to have the same columns as res to be able to unionize/bind them
dplyr::transmute(SampleID_a = SampleID_b,
PrimConstruct_a = NA,
rn = NA,
PrimConstruct_b,
dist = NA)
mis
SampleID_a PrimConstruct_a rn PrimConstruct_b dist
1 bcd0201 NA NA commitment NA
dplyr::union(res, mis)
# A tibble: 6 x 5
SampleID_a PrimConstruct_a rn PrimConstruct_b dist
<chr> <chr> <int> <chr> <dbl>
1 abc0101 cohesion 1 cohesion 0
2 abc0101 cognition 2 cognition 0
3 bcd0201 cohesion 3 Cohesion 1
4 bcd0201 cognition 4 cognitiion 1
5 proof 00000 5 NA NA
6 bcd0201 NA NA commitment NA
您可以使用 {powerjoin}:
df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201",
"bcd0201"), PrimConstruct_a = c("cohesion", "cognition",
"cohesion", "cognition"))
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201",
"bcd0201"), PrimConstruct_b = c("cohesion", "cognition",
"commitment", "Cohesion", "cognitiion"))
powerjoin::power_full_join(df1, df2, by = c(SampleID_a = "SampleID_b", ~ stringdist::stringdist(.x$PrimConstruct_a, .y$PrimConstruct_b) < 2))
#> SampleID_a PrimConstruct_a PrimConstruct_b
#> 1 abc0101 cohesion cohesion
#> 2 abc0101 cognition cognition
#> 3 bcd0201 cohesion Cohesion
#> 4 bcd0201 cognition cognitiion
#> 5 bcd0201 <NA> commitment
创建于 2023-03-16 使用 reprex v2.0.2
请注意,我将 df2 的第二个列重命名为PrimConstruct_b
上一个:R 中的模糊匹配玩家名称
评论