对数据框中同一列中的模糊匹配项进行分组-解网

问：

我正在尝试基于模糊匹配（在同一列内）对公司名称的相似名称进行分组。但是它们既没有正确分组，我也没有在生成的数据集中拥有相同数量的行。由于一对多匹配，行数比原始数据中的行数多。

包含更多记录的输入文件示例

**法典**

df.loc[:,'Account Name Copy'] = df['Account Name']

compare = pd.MultiIndex.from_product([df['Account Name'],
                                      df['Account Name Copy']]).to_series()

def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])

compare.apply(metrics)

电流输出

P.S. 最终输出中的行数应与原始数据中的行数相同，并对相似的公司名称进行分组。

所需输出

参考了以下主题，但没有获得所需的输出

https://stackoverflow.com/questions/54865890/fuzzy-match-strings-in-one-column-and-create-new-dataframe-using-fuzzywuzzy

https://stackoverflow.com/questions/71427827/fuzzy-matching-and-grouping

https://stackoverflow.com/questions/60987641/check-if-there-is-a-similar-string-in-the-same-column

https://stackoverflow.com/questions/62085777/fuzzy-match-within-the-same-column-python

请帮忙！！

python-3.x python-3.7 fuzzywuzzy fuzzy-comparison

from itertools import product
from fuzzywuzzy import fuzz

df = pd.read_excel("file.xlsx")

RATIO = 80 # <-- adjust the ratio here

tups = list(product(df["Account Name"].unique(),
                    df["Account Name"].str.split(r"[-\s]").str[0].unique()))

matches = [(pair[1].title(), pair[0]) for pair in tuples_list
           if fuzz.partial_ratio(pair[1].lower(), pair[0].lower()) >= RATIO]
    
out = pd.DataFrame(index=pd.MultiIndex.from_tuples(set(matches),
                   names=["Grouped", "Account Name"])).sort_index()

输出：

感谢您的帮助，但它似乎没有产生预期的输出，因为即使根据第一个字母，它也有许多行进行分组，例如 AB 中国人寿保险、ABC 中国金融科技、ABC 中国金融等也被归类为“A”，我对任何其他平台或工具（如 alteryx）持开放态度， VBA，Visual Studio，如果在这种情况下有帮助

0赞 Timeless 4/20/2023

哎呀！但正如您可能知道的那样，我根据给定的示例制作了代码，在这个示例中，我找不到您在评论中提到的名称。无论如何，祝你好运;)

0赞 ss_0708 4/21/2023

对不起 - 但这是一个巨大的数据，我已经编辑了问题以包含很少的此类记录。请查看..

上一个：PySpark 中的模糊匹配优化

下一个：如何在两个变量上合并两个数据帧 - 第一个是因子变量的精确匹配，第二个是数值变量的模糊匹配

对数据框中同一列中的模糊匹配项进行分组

Grouping fuzzy matches within same column in a data frame

评论

评论