Python:删除列中的类似字符串

Python: Removing similar strings in column

提问人:Tyler D 提问时间:3/10/2020 最后编辑:Tyler D 更新时间:11/6/2021 访问量:1886

问:

我有一个 DataFrame,其中一列由字符串组成:

d = pd.DataFrame({'text': ["hello, this is a test. we want to remove entries, where the text is similar to other texts",
                           "hello, this is a test. we want to remove entries, where the text is similar to other texts because",
                           "where are you going",
                           "i'm going to the zoo to pet the animals",
                           "where are you going jane"]})

问题:其中一些字符串可能非常相似,仅在一个或两个单词上有所不同。我想删除所有“重复项”,即删除所有彼此相似的文章。在上面的例子中,由于 1.和 2.行是一样的,我只想保留第一个。同样,第 3 行和第 5 行相似,我只想保留第 3 行。实际数据帧大约有 100k 行。

我的尝试:我认为一个好的起点是将字符串转换为集合,以便轻松有效地进行比较:

d["text"].str.split().apply(set)

接下来,我将编写一个函数,将每一行与所有其他行进行比较,如果它与其他行至少有 90% 的相似度,则将其删除。这是我的做法:

def find_duplicates(df):
    df = df.str.split().apply(set)
    ls_duplicates = []
    for i in range(len(df)):
        doc_i = df.iloc[i]
        for j in range(i+1, len(df)):
            doc_j = df.iloc[j]
            score = len(doc_i.intersection(doc_j)) / len(doc_i)
            if score > 0.9:
                ls_duplicates.append(i)
    return ls_duplicates

find_duplicates(d['text'])

这适用于我的目的,但运行速度非常慢。有没有办法优化它?

Python 熊猫

评论


答:

2赞 ipj 3/10/2020 #1

比较文本是一个广泛的主题,但要从字符串列表中选择最佳匹配项,您可以尝试:

import difflib

phrases =  ["hello, this is a test. we want to remove entries, where the text is similar to other texts",
      "hello, this is a test. we want to remove entries, where the text is similar to other texts because",
      "where are you going",
      "i'm going to the zoo to pet the animals",
      "where are you going jane"]

difflib.get_close_matches('where are you going', phrases)

结果按相似度分数排序:

['where are you going', 'where are you going jane']

方法执行模糊字符串匹配。get_close_matches

您还可以将函数应用于 dataframe:

d['text_similar'] = d.text.apply(lambda row: difflib.get_close_matches(row, list(d[d.text!=row].text), cutoff = 0.8))

输出:

                                                text                                       text_similar
0  hello, this is a test. we want to remove entri...  [hello, this is a test. we want to remove entr...
1  hello, this is a test. we want to remove entri...  [hello, this is a test. we want to remove entr...
2                                where are you going                         [where are you going jane]
3            i'm going to the zoo to pet the animals                                                 []
4                           where are you going jane                              [where are you going]

在上面的例子中没有足够好的类似字符串,当 .i'm going to the zoo to pet the animalscutoff = 0.8

0赞 Naga kiran 3/10/2020 #2

你可以使用 difflib。SequenceMatcher 并根据与其他信息关联的相似度 ( ) 筛选文本行thr

import difflib
# Threshold filter based on Percentage similarity
thr = 0.85
df['Flag'] = 0
for text in df['text'].tolist():
    df['temp'] = [difflib.SequenceMatcher(None, text1,text).ratio() for text1 in df['text'].tolist()]
    df.loc[df['temp'].gt(thr),['Flag']] = df['Flag'].max()+1
df.drop('temp',1)

df.loc[~df['Flag'].duplicated(keep='first')]

外:

    text                                                 Flag   
0   hello, this is a test. we want to remove entri...   2   
2   where are you going                                 5   
3   i'm going to the zoo to pet the animals             4   

实际上,这个问题必须通过聚类模型和根据靠近中心的距离过滤文本信息来处理。

如果您担心降低时间复杂度,则需要通过在文本信息的单热编码向量上应用聚类来使问题复杂化。