Python：删除列中的类似字符串-解网

问：

我有一个 DataFrame，其中一列由字符串组成：

d = pd.DataFrame({'text': ["hello, this is a test. we want to remove entries, where the text is similar to other texts",
                           "hello, this is a test. we want to remove entries, where the text is similar to other texts because",
                           "where are you going",
                           "i'm going to the zoo to pet the animals",
                           "where are you going jane"]})

问题：其中一些字符串可能非常相似，仅在一个或两个单词上有所不同。我想删除所有“重复项”，即删除所有彼此相似的文章。在上面的例子中，由于 1.和 2.行是一样的，我只想保留第一个。同样，第 3 行和第 5 行相似，我只想保留第 3 行。实际数据帧大约有 100k 行。

我的尝试：我认为一个好的起点是将字符串转换为集合，以便轻松有效地进行比较：

d["text"].str.split().apply(set)

接下来，我将编写一个函数，将每一行与所有其他行进行比较，如果它与其他行至少有 90% 的相似度，则将其删除。这是我的做法：

def find_duplicates(df):
    df = df.str.split().apply(set)
    ls_duplicates = []
    for i in range(len(df)):
        doc_i = df.iloc[i]
        for j in range(i+1, len(df)):
            doc_j = df.iloc[j]
            score = len(doc_i.intersection(doc_j)) / len(doc_i)
            if score > 0.9:
                ls_duplicates.append(i)
    return ls_duplicates

find_duplicates(d['text'])

这适用于我的目的，但运行速度非常慢。有没有办法优化它？

Python 熊猫

import difflib

phrases =  ["hello, this is a test. we want to remove entries, where the text is similar to other texts",
      "hello, this is a test. we want to remove entries, where the text is similar to other texts because",
      "where are you going",
      "i'm going to the zoo to pet the animals",
      "where are you going jane"]

difflib.get_close_matches('where are you going', phrases)

结果按相似度分数排序：

['where are you going', 'where are you going jane']

方法执行模糊字符串匹配。get_close_matches

您还可以将函数应用于 dataframe：

d['text_similar'] = d.text.apply(lambda row: difflib.get_close_matches(row, list(d[d.text!=row].text), cutoff = 0.8))

输出：

                                                text                                       text_similar
0  hello, this is a test. we want to remove entri...  [hello, this is a test. we want to remove entr...
1  hello, this is a test. we want to remove entri...  [hello, this is a test. we want to remove entr...
2                                where are you going                         [where are you going jane]
3            i'm going to the zoo to pet the animals                                                 []
4                           where are you going jane                              [where are you going]

在上面的例子中没有足够好的类似字符串，当 .i'm going to the zoo to pet the animalscutoff = 0.8

0赞 Naga kiran 3/10/2020 #2

你可以使用 difflib。SequenceMatcher 并根据与其他信息关联的相似度（）筛选文本行thr

import difflib
# Threshold filter based on Percentage similarity
thr = 0.85
df['Flag'] = 0
for text in df['text'].tolist():
    df['temp'] = [difflib.SequenceMatcher(None, text1,text).ratio() for text1 in df['text'].tolist()]
    df.loc[df['temp'].gt(thr),['Flag']] = df['Flag'].max()+1
df.drop('temp',1)

df.loc[~df['Flag'].duplicated(keep='first')]

外：

    text                                                 Flag   
0   hello, this is a test. we want to remove entri...   2   
2   where are you going                                 5   
3   i'm going to the zoo to pet the animals             4

实际上，这个问题必须通过聚类模型和根据靠近中心的距离过滤文本信息来处理。

如果您担心降低时间复杂度，则需要通过在文本信息的单热编码向量上应用聚类来使问题复杂化。

上一个：按指数逐元素乘以 pandas

下一个：指针向量：为什么在外部更改指针不会更改向量元素？

Python：删除列中的类似字符串

Python: Removing similar strings in column

评论