Python：将列中的每一行与所有其他条目进行比较-解网

问：

我有一个 pandas 数据帧，其中一列包含如下字符串：

d = pd.DataFrame({'text': ["hello, this is a test. we want to remove entries, where the text is similar to other texts",
                           "hello, this is a test. we want to remove entries, where the text is similar to other texts because",
                           "where are you going",
                           "i'm going to the zoo to pet the animals",
                           "where are you going jane",
                           "where are you going asd"]})

我想删除句子与前一行相似的行。在这种情况下，“相似”意味着它们共享 75% 的相同单词。

以下是我目前的做法（使用 for 循环）：

def find_duplicates(df):
    df = df.str.split().apply(set)
    ls_duplicates = []
    for i in range(len(df)):
        doc_i = df.iloc[i]
        for j in range(i+1, len(df)):
            doc_j = df.iloc[j]
            score = len(doc_i.intersection(doc_j)) / len(doc_j)
            if score > 0.7:
                ls_duplicates.append(j)
    return ls_duplicates

d.iloc[find_duplicates(d['text'])]

这将给出所需的输出：

                                                text
1  hello, this is a test. we want to remove entri...
4                           where are you going jane
5                            where are you going asd
5                            where are you going asd

现在，当我的数据帧很大（>10k 行）时，这运行得非常慢。有没有办法优化for循环？

Python 熊猫

df = pd.DataFrame({'text': ["hello, this is a test. we want to remove entries, where the text is similar to other texts",
                           "hello, this is a test. we want to remove entries, where the text is similar to other texts because",
                           "where are you going",
                           "i'm going to the zoo to pet the animals",
                           "where are you going jane",
                           "where are you going asd"]})


df['prev_text'] = df.text.shift(-1)
df.fillna('NA', inplace=True)

def find_duplicates(x):
    text = set(x.text.split())
    prev_text = set(x.prev_text.split())

    return len(text.intersection(prev_text))/len(prev_text)

df['score'] = df.apply(find_duplicates, axis=1)

print(df)

print(df[df.score < 0.7].text)

测试它快了 65%。

Python：将列中的每一行与所有其他条目进行比较

Python: Compare every single row in a column to all other entries

评论

评论