问：

我有两个 DataFrame，我想根据一列进行合并。但是，由于拼写不同、空格数量不同、没有/存在变音符号，我希望能够合并，只要它们彼此相似。

任何相似性算法都可以（soundex、Levenshtein、difflib's）。

假设一个 DataFrame 具有以下数据：

df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])

       number
one         1
two         2
three       3
four        4
five        5

df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

      letter
one        a
too        b
three      c
fours      d
five       e

然后我想得到得到的 DataFrame

       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e

Python 熊猫

1赞 citynorman 1/20/2019

是的，使用 d6tjoin 请参阅 MergeTop1 笔记本

2赞 yatu 9/15/2020

在未找到接近匹配项的情况下，接受的解决方案将失败。有关简单的解决方法，请参阅此替代方法

答：

7赞 jassinm 12/1/2012 #1

http://pandas.pydata.org/pandas-docs/dev/merging.html 没有钩子功能来即时执行此操作。不过会很好...

我只会做一个单独的步骤，并使用 difflib getclosest_matches 在 2 个数据帧之一中创建一个新列，并在模糊匹配列上进行合并/连接

4赞 Andy Hayden 12/3/2012

您能解释一下如何使用创建这样的列然后合并吗？difflib.get_closest_matches

109赞 Andy Hayden 12/3/2012 #2

与@locojay建议类似，您可以将 difflib 的 get_close_matches 应用于的索引，然后应用连接：df2

In [23]: import difflib 

In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>

In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

In [26]: df2
Out[26]: 
      letter
one        a
two        b
three      c
four       d
five       e

In [31]: df1.join(df2)
Out[31]: 
       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e

如果这些是列，同样，您可以应用于列，然后合并：

df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)

1赞 As3adTintin 10/30/2015

有谁知道是否有办法在一列的行之间执行此操作？我正在尝试查找可能有错别字的重复项

2赞 Bastian 7/13/2016

可以使用 n=1 将结果限制为 1。docs.python.org/3/library/......

4赞 famargar 1/23/2018

如果两个数据帧的长度不同，该怎么办？

3赞 Maxim 10/14/2020

对于那些说它失败的人来说，我认为这更像是一个如何在你的管道中实现它的问题，而不是解决方案的错，它简单而优雅。

1赞 F. Leone 10/23/2020

我使用了类似的解决方案，但用于调整结果列表的大小并确保它不会抛出[:1]get_close_matchesKeyError

2赞 Luke 8/8/2014 #3

请注意，这基本上是有效的，除非没有找到匹配项，或者任一列中都有 NaN。与其直接应用，我发现应用以下功能更容易。NaN 替代品的选择很大程度上取决于您的数据集。get_close_matches

def fuzzy_match(a, b):
    left = '1' if pd.isnull(a) else a
    right = b.fillna('2')
    out = difflib.get_close_matches(left, right)
    return out[0] if out else np.NaN

15赞 lostsoul29 5/29/2016 #4

我会使用 Jaro-Winkler，因为它是目前可用的性能最高、最准确的近似字符串匹配算法之一 [Cohen， et al.]， [Winkler]。

这就是我用水母包中的 Jaro-Winkler 做的方式：

def get_closest_match(x, list_strings):

  best_match = None
  highest_jw = 0

  for current_string in list_strings:
    current_score = jellyfish.jaro_winkler(x, current_string)

    if(current_score > highest_jw):
      highest_jw = current_score
      best_match = current_string

  return best_match

df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))

df1.join(df2)

输出：

    number  letter
one     1   a
two     2   b
three   3   c
four    4   d
five    5   e

0赞 andreabedini 8/31/2016

def get_closest_match（x， list_strings）怎么样： return sorted（list_strings， key=lambda y： jellyfish.jaro_winkler（x， y）， reverse=True）[0]

3赞 citynorman 11/9/2017

有什么方法可以加快速度吗？此代码不能很好地扩展。

24赞 RobinL 12/2/2017 #5

我编写了一个旨在解决这个问题的 Python 包：

pip install fuzzymatcher

您可以在此处找到存储库，在此处找到文档。

基本用法：

给定两个要模糊连接的数据帧和，可以编写以下内容：df_leftdf_right

from fuzzymatcher import link_table, fuzzy_left_join

# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)

或者，如果您只想链接到最接近的匹配项：

fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)

3赞 Erfan 5/27/2019

老实说，如果它没有那么多依赖项，那就太棒了，首先我必须安装 Visual Studio 构建工具，现在我收到错误：no such module: fts4

2赞 Fatima 1/28/2020

name 'fuzzymatcher' is not defined

2赞 TaL 10/11/2020

@RobinL请详细说明如何解决：问题？我一直在尝试这样做，但收效甚微。no such module: fts4

0赞 Anakin Skywalker 12/6/2021

@TaL，你解决了吗？

1赞 TaL 12/7/2021

@AnakinSkywalker - 我想我使用了 reddy 下面的答案。但是我花了很多汗水来解决这个问题

2赞 citynorman 8/15/2018 #6

你可以使用 d6tjoin

import d6tjoin.top1
d6tjoin.top1.MergeTop1(df1.reset_index(),df2.reset_index(),
       fuzzy_left_on=['index'],fuzzy_right_on=['index']).merge()['merged']

index number index_right letter 0 one 1 one a 1 two 2 too b 2 three 3 three c 3 four 4 fours d 4 five 5 five e

它具有多种附加功能，例如：

检查加入质量、加入前和加入后
自定义相似度功能，例如编辑距离与汉明距离
指定最大距离
多核计算

有关详细信息，请参阅

MergeTop1 示例 - 最佳匹配联接示例笔记本
预联接示例 - 用于诊断联接问题的示例

0赞 Erfan 5/27/2019

刚刚测试了一下，它给了我奇怪的结果，例如它与匹配，有没有办法配置匹配分数的阈值？governmentbusiness

0赞 citynorman 5/29/2019

是的，请参阅您可以传递的参考文档，并且可能还想更改为倾向于提供更好匹配的文档。top_limitfun_difffun_diff=[affinegap.affineGapDistance]

73赞 Erfan 5/27/2019 #7

用`fuzzywuzzy`

由于该包没有示例，以下是我编写的一个函数，它将根据您可以作为用户设置的阈值返回所有匹配项：fuzzywuzzy

示例 datframe

df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

# df1
          Key
0       Apple
1      Banana
2      Orange
3  Strawberry

# df2
        Key
0      Aple
1     Mango
2      Orag
3     Straw
4  Bannanna
5     Berry

模糊匹配功能

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    :param df_1: the left table to join
    :param df_2: the right table to join
    :param key1: key column of the left table
    :param key2: key column of the right table
    :param threshold: how close the matches should be to return a match, based on Levenshtein distance
    :param limit: the amount of matches that will get returned, these are sorted high to low
    :return: dataframe with boths keys and matches
    """
    s = df_2[key2].tolist()
    
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m
    
    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2
    
    return df_1

在数据帧上使用我们的函数：#1

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)

          Key       matches
0       Apple          Aple
1      Banana      Bannanna
2      Orange          Orag
3  Strawberry  Straw, Berry

在数据帧上使用我们的函数：#2

df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})

fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)

        Col1  matches
0  Microsoft  Mcrsoft
1     Google    gogle
2     Amazon   Amason
3        IBM

安装：

果仁

pip install fuzzywuzzy

蟒蛇

conda install -c conda-forge fuzzywuzzy

7赞 Tinkinc 9/4/2019

有没有办法将 DF2 的所有专栏都带到比赛中？假设 C 是您希望保留的表 2 （DF2）的主键或外键

0赞 Fatima 1/28/2020

你@Tinkinc想出怎么做？

1赞 Umar.H 2/1/2020

嘿 Erfan，当你得到一个 mo 时，你认为你可以更新它以与 pandas 1.0 一起使用吗？我想知道如果您将引擎更改为适用于 Cython 或 Numba 的引擎，它会得到什么样的性能提升

0赞 Django0602 2/11/2020

对于我的问题来说，这个解决方案看起来也非常有希望。但是，您能否解释一下当我在两个数据集中都没有公共列时，这将如何工作？如何在为我提供分数的两个数据集之一中创建匹配列？我已经使用了您的#2解决方案。我不确定为什么运行需要这么多时间。

1赞 suricactus 3/22/2020

如果您还需要匹配的密钥，则可以使用s = df_2.to_dict()[key2]

5赞 reddy 7/13/2019 #8

我使用了 Fuzzymatcher 包，这对我来说效果很好。有关此方面的更多详细信息，请访问此链接。

使用以下命令进行安装

pip install fuzzymatcher

下面是示例代码（上面已经由 RobinL 提交）

from fuzzymatcher import link_table, fuzzy_left_join

# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)

您可能会遇到的错误

ZeroDivisionError：浮点数除以零--->请参阅此链接解决它
OperationalError：没有这样的模块：fts4 --> downlaod sqlite3.dll 从这里替换 python 或 anaconda 中的 DLL 文件 DLL 文件夹。

优点：

工作速度更快。 就我而言，我将一个具有 3000 行的数据帧与具有 170,000 条记录的 anohter dataframe 进行了比较。这也使用跨文本的 SQLite3 搜索。比许多人都快
可以跨多个列和 2 个数据帧进行检查。就我而言，我正在根据地址和公司名称寻找最接近的匹配项。有时，公司名称可能相同，但地址也是检查的好东西。
为同一记录的所有最接近的匹配项提供分数。您可以选择截止分数。

缺点：

原始软件包安装有问题
还安装了所需的 C++ 和可视化工作室
不适用于 64 位 anaconda/Python

0赞 Parseltongue 3/20/2020

谢谢雷迪...目前正在一个有 6000 行的数据集上运行它，与一个有 300 万行的数据集相匹配，并祈祷......你认为这会比 fuzzywuzzy 运行得更快吗？

1赞 reddy 4/20/2020

嗨，@Parseltongue：这个数据对你来说是巨大的。我不认为任何 fuzzywuzzy 似乎对超过 100 万有效，但你绝对可以尝试一下。我跑了 6000 行，对 80 万行，相当不错。

0赞 Anakin Skywalker 12/6/2021

如何安装sqlite模型？尝试了所有可能的选项 - 仍然不起作用:(

1赞 RobinL 12/7/2021

@AnakinSkywalker sqlite 模块是内置的 python，所以你不需要安装！

0赞 Vikranth 4/26/2022

@reddy 我无法弄清楚零浮点除法错误。我在使用 pip 在 colab 中安装后将其安装进去，您能帮我吗？

5赞 camnesia 3/11/2020 #9

有一个名为的包，可以使用、和方法。这里有一些很好的例子fuzzy_pandaslevenshteinjarometaphonebilenco

import pandas as pd
import fuzzy_pandas as fpd

df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

results = fpd.fuzzy_merge(df1, df2,
            left_on='Key',
            right_on='Key',
            method='levenshtein',
            threshold=0.6)

results.head()

  Key    Key
0 Apple  Aple
1 Banana Bannanna
2 Orange Orag

11赞 yatu 3/29/2020 #10

对于一般方法：`fuzzy_merge`

对于更一般的场景，我们要合并包含略有不同的字符串的两个数据帧中的列，以下函数使用difflib.get_close_matches和合并来模仿 pandas 的功能，但具有模糊匹配：merge

import difflib 

def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
    df_other= df2.copy()
    df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff) 
                         for x in df_other[right_on]]
    return df1.merge(df_other, on=left_on, how=how)

def get_closest_match(x, other, cutoff):
    matches = difflib.get_close_matches(x, other, cutoff=cutoff)
    return matches[0] if matches else None

以下是两个示例数据帧的一些用例：

print(df1)

     key   number
0    one       1
1    two       2
2  three       3
3   four       4
4   five       5

print(df2)

                 key_close  letter
0                    three      c
1                      one      a
2                      too      b
3                    fours      d
4  a very different string      e

通过上面的例子，我们会得到：

fuzzy_merge(df1, df2, left_on='key', right_on='key_close')

     key  number key_close letter
0    one       1       one      a
1    two       2       too      b
2  three       3     three      c
3   four       4     fours      d

我们可以做一个左连接：

fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='left')

     key  number key_close letter
0    one       1       one      a
1    two       2       too      b
2  three       3     three      c
3   four       4     fours      d
4   five       5       NaN    NaN

对于右连接，我们将在左侧数据帧中拥有所有不匹配的键，以便：None

fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='right')

     key  number                key_close letter
0    one     1.0                      one      a
1    two     2.0                      too      b
2  three     3.0                    three      c
3   four     4.0                    fours      d
4   None     NaN  a very different string      e

另请注意，如果截止时间内没有匹配的项目，difflib.get_close_matches将返回一个空列表。在共享示例中，如果我们将最后一个索引更改为：df2

print(df2)

                          letter
one                          a
too                          b
three                        c
fours                        d
a very different string      e

我们会得到一个错误：index out of range

df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

IndexError：列出超出范围的索引

为了解决这个问题，上述函数将仅当它实际包含任何匹配项时，才会对返回的列表进行索引，从而返回最接近的匹配项。get_closest_matchdifflib.get_close_matches

0赞 irene 6/13/2020

我建议使用它来使其更快：applydf_other[left_on] = df_other[right_on].apply(lambda x: get_closest_match(x, df1[left_on], cutoff))

0赞 yatu 6/13/2020

Apply 并不比 List comps 快@irene :)检查 stackoverflow.com/questions/16476924/...

0赞 irene 6/13/2020

嗯。。。我只是尝试了相同的代码，对于我拥有的数据来说，它明显更快。也许它依赖于数据？

0赞 yatu 6/13/2020

通常，为了获得可靠的时间，您需要对大样本量进行基准测试。但根据我的经验，list-comps 通常@irene一样快或更快还要注意，apply 基本上也只是在行上循环

1赞 irene 6/15/2020

明白了，下次会尝试列表理解对我来说是慢下来。谢谢！apply

0赞 Bex T. 11/19/2020 #11

对于更复杂的用例，以匹配具有多列的行，可以使用 package。提供所有工具，用于模糊匹配数据帧之间的行，这有助于在合并时对数据进行重复数据删除。我在这里写了一篇关于该软件包的详细文章recordlinkagerecordlinkagepandas

0赞 mccandar 2/28/2021 #12

如果连接轴是数字轴，则还可以用于匹配具有指定容差的索引：

def fuzzy_left_join(df1, df2, tol=None):
    index1 = df1.index.values
    index2 = df2.index.values

    diff = np.abs(index1.reshape((-1, 1)) - index2)
    mask_j = np.argmin(diff, axis=1)  # min. of each column
    mask_i = np.arange(mask_j.shape[0])

    df1_ = df1.iloc[mask_i]
    df2_ = df2.iloc[mask_j]

    if tol is not None:
        mask = np.abs(df2_.index.values - df1_.index.values) <= tol
        df1_ = df1_.loc[mask]
        df2_ = df2_.loc[mask]

    df2_.index = df1_.index

    out = pd.concat([df1_, df2_], axis=1)
    return out

2赞 Matt 3/27/2021 #13

我以非常少的方式使用，同时匹配 in 的现有行为和关键字。fuzzywuzzmergepandas

只需指定您接受的匹配（在和之间）：threshold0100

from fuzzywuzzy import process

def fuzzy_merge(df, df2, on=None, left_on=None, right_on=None, how='inner', threshold=80):
    
    def fuzzy_apply(x, df, column, threshold=threshold):
        if type(x)!=str:
            return None
        
        match, score, *_ = process.extract(x, df[column], limit=1)[0]
            
        if score >= threshold:
            return match

        else:
            return None
    
    if on is not None:
        left_on = on
        right_on = on

    # create temp column as the best fuzzy match (or None!)
    df2['tmp'] = df2[right_on].apply(
        fuzzy_apply, 
        df=df, 
        column=left_on, 
        threshold=threshold
    )

    merged_df = df.merge(df2, how=how, left_on=left_on, right_on='tmp')
    
    del merged_df['tmp']
    
    return merged_df

使用示例数据进行尝试：

df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})

df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

fuzzy_merge(df, df2, on='Key', threshold=80)

0赞 maxbachmann 3/28/2021

可以直接使用 process.extractOne 代替限制为 1 的 process.extract，它只提取最佳匹配项。

4赞 philipnye 3/28/2022 #14

用`thefuzz`

使用 SeatGeek 的出色软件包，它利用了 Levenshtein 距离。这适用于列中保存的数据。它将匹配项添加为行而不是列，以保持整洁的数据集，并允许轻松地将其他列拉取到输出数据帧。thefuzz

示例数据

df1 = pd.DataFrame({'col_a':['one','two','three','four','five'], 'col_b':[1, 2, 3, 4, 5]})

    col_a   col_b
0   one     1
1   two     2
2   three   3
3   four    4
4   five    5

df2 = pd.DataFrame({'col_a':['one','too','three','fours','five'], 'col_b':['a','b','c','d','e']})

    col_a   col_b
0   one     a
1   too     b
2   three   c
3   fours   d
4   five    e

用于进行匹配的函数

def fuzzy_match(
    df_left, df_right, column_left, column_right, threshold=90, limit=1
):
    # Create a series
    series_matches = df_left[column_left].apply(
        lambda x: process.extract(x, df_right[column_right], limit=limit)            # Creates a series with id from df_left and column name _column_left_, with _limit_ matches per item
    )

    # Convert matches to a tidy dataframe
    df_matches = series_matches.to_frame()
    df_matches = df_matches.explode(column_left)     # Convert list of matches to rows
    df_matches[
        ['match_string', 'match_score', 'df_right_id']
    ] = pd.DataFrame(df_matches[column_left].tolist(), index=df_matches.index)       # Convert match tuple to columns
    df_matches.drop(column_left, axis=1, inplace=True)      # Drop column of match tuples

    # Reset index, as in creating a tidy dataframe we've introduced multiple rows per id, so that no longer functions well as the index
    if df_matches.index.name:
        index_name = df_matches.index.name     # Stash index name
    else:
        index_name = 'index'        # Default used by pandas
    df_matches.reset_index(inplace=True)
    df_matches.rename(columns={index_name: 'df_left_id'}, inplace=True)       # The previous index has now become a column: rename for ease of reference

    # Drop matches below threshold
    df_matches.drop(
        df_matches.loc[df_matches['match_score'] < threshold].index,
        inplace=True
    )

    return df_matches

使用函数和合并数据

import pandas as pd
from thefuzz import process

df_matches = fuzzy_match(
    df1,
    df2,
    'col_a',
    'col_a',
    threshold=60,
    limit=1
)

df_output = df1.merge(
    df_matches,
    how='left',
    left_index=True,
    right_on='df_left_id'
).merge(
    df2,
    how='left',
    left_on='df_right_id',
    right_index=True,
    suffixes=['_df1', '_df2']
)

df_output.set_index('df_left_id', inplace=True)       # For some reason the first merge operation wrecks the dataframe's index. Recreated from the value we have in the matches lookup table

df_output = df_output[['col_a_df1', 'col_b_df1', 'col_b_df2']]      # Drop columns used in the matching
df_output.index.name = 'id'

id  col_a_df1   col_b_df1   col_b_df2
0   one         1           a
1   two         2           b
2   three       3           c
3   four        4           d
4   five        5           e

0赞 George 8/8/2022 #15

TheFuzz 是 fuzzywuzzy 的新版本

为了在两个大表中模糊连接字符串元素，您可以这样做：

使用 apply 逐行执行
使用更快的并行、加速和可视化默认应用功能（带有彩色进度条）
使用集合中的 OrderedDict 去除合并输出中的重复项并保留初始顺序
增加限制以查看更多合并选项（存储在具有相似度百分比的元组列表中）thefuzz.process.extract

“*”可以代替仅返回一个最匹配的项（不指定任何限制）。但是，请注意，多个结果可能具有相同的相似度百分比，并且您只会得到其中一个结果。thefuzz.process.extractOnethefuzz.process.extract

'**' 不知何故，在开始实际应用之前，速度更快需要一两分钟。如果需要处理小表，可以跳过此步骤，改用progress_apply

from thefuzz import process
from collections import OrderedDict
import swifter    


def match(x):
    matches = process.extract(x, df1, limit=6)
    matches = list(OrderedDict((x, True) for x in matches).keys())
    print(f'{x:20} : {matches}')

    return str(matches)


df1 = df['name'].values
df2['matches'] = df2['name'].swifter.apply(lambda x: match(x))

0赞 Mehdi Rostami 3/28/2023 #16

我发现这非常有效。功能描述中的详细信息：

from fuzzywuzzy.process import extract

def efficient_matching(df1,
                       col1,
                       df2,
                       col2,
                       limit=3,
                       length_diff=3,
                       first_letter_match=2
                       ):
    """
    For each name that we want to find matches for, it's more efficient to only look at a subset of potential matches. 
    One way to narrow down all the matches to potential matches is length. Here are 2 methods:
    1. If the name is "Markos", we don't need to check how similar markos is to names with length less than 4 or 
    more than 8. This window is determined by length_diff.
    2. We consider names from the corpus whose first 2 letters are similar the first letters of the name we want to find
    the match for.

    limit: Gives how many closest matches to return.
    """
    df1[col1] = df1[col1].astype(str)
    df2[col2] = df2[col2].astype(str)

    df1['_len_'] = df1[col1].apply(len)
    df2['_len_'] = df2[col2].apply(len)
    df2 = df2[df2['_len_'] >= 2]

    matches = df1[[col1, '_len_']].apply(lambda x: 
                                        extract(x[0], 
                                                df2[
                                                    ((df2['_len_'] - x[1]).abs() < length_diff) &
                                                    (df2[col2].str[:first_letter_match]==x[0][:first_letter_match])
                                                    ][col2].tolist(), 
                                                limit = limit
                                                ), 
                                        axis=1
                                        ) 
    return matches

上一个：将 Dictionary 的 String 表示形式转换为字典

下一个：如何使用 pandas 将温度和湿度数据存储在 excel 文件中，并在一段时间后更新该数据，然后将更新的数据存储在每行中？

是否可以使用 Python Pandas 进行模糊匹配合并？

is it possible to do fuzzy match merge with python pandas?

评论

评论

评论

评论

评论

评论

用`fuzzywuzzy`

安装：

评论

评论

对于一般方法：`fuzzy_merge`

评论

评论

用`thefuzz`

是否可以使用 Python Pandas 进行模糊匹配合并？

is it possible to do fuzzy match merge with python pandas?

评论

评论

评论

评论

评论

评论

用fuzzywuzzy

安装：

评论

评论

对于一般方法：fuzzy_merge

评论

评论

用thefuzz

用`fuzzywuzzy`

对于一般方法：`fuzzy_merge`

用`thefuzz`