Python:将列中的每一行与所有其他条目进行比较

Python: Compare every single row in a column to all other entries

提问人:Tyler D 提问时间:3/11/2020 更新时间:3/11/2020 访问量:1015

问:

我有一个 pandas 数据帧,其中一列包含如下字符串:

d = pd.DataFrame({'text': ["hello, this is a test. we want to remove entries, where the text is similar to other texts",
                           "hello, this is a test. we want to remove entries, where the text is similar to other texts because",
                           "where are you going",
                           "i'm going to the zoo to pet the animals",
                           "where are you going jane",
                           "where are you going asd"]})

我想删除句子与前一行相似的行。在这种情况下,“相似”意味着它们共享 75% 的相同单词。

以下是我目前的做法(使用 for 循环):

def find_duplicates(df):
    df = df.str.split().apply(set)
    ls_duplicates = []
    for i in range(len(df)):
        doc_i = df.iloc[i]
        for j in range(i+1, len(df)):
            doc_j = df.iloc[j]
            score = len(doc_i.intersection(doc_j)) / len(doc_j)
            if score > 0.7:
                ls_duplicates.append(j)
    return ls_duplicates

d.iloc[find_duplicates(d['text'])]

这将给出所需的输出:

                                                text
1  hello, this is a test. we want to remove entri...
4                           where are you going jane
5                            where are you going asd
5                            where are you going asd

现在,当我的数据帧很大(>10k 行)时,这运行得非常慢。有没有办法优化for循环?

Python 熊猫

评论

1赞 ak_slick 3/11/2020
单词是否需要按相同的顺序才能被视为相似?例如:“我是一只猫”和“我是一只猫”是 100% 还是 0% 匹配?
1赞 Tyler D 3/11/2020
这将是 100% 匹配
0赞 ak_slick 3/11/2020
好的,给我几分钟时间。我想我有一个好办法给你
0赞 ak_slick 3/11/2020
您希望如何处理不同长度的字符串?例如:“嗨,你好吗”和“嗨,你好”。当将第一个与第二个进行比较时,我会覆盖 100% 的单词,但它不会有相同数量的单词。你想如何对待这个问题?
0赞 ak_slick 3/11/2020
哦,我刚刚意识到它只到上一行

答:

2赞 Diablo 3/11/2020 #1
df = pd.DataFrame({'text': ["hello, this is a test. we want to remove entries, where the text is similar to other texts",
                           "hello, this is a test. we want to remove entries, where the text is similar to other texts because",
                           "where are you going",
                           "i'm going to the zoo to pet the animals",
                           "where are you going jane",
                           "where are you going asd"]})


df['prev_text'] = df.text.shift(-1)
df.fillna('NA', inplace=True)

def find_duplicates(x):
    text = set(x.text.split())
    prev_text = set(x.prev_text.split())

    return len(text.intersection(prev_text))/len(prev_text)

df['score'] = df.apply(find_duplicates, axis=1)

print(df)

print(df[df.score < 0.7].text)

测试它快了 65%。

评论

0赞 ak_slick 3/11/2020
这个答案很好。
0赞 Tyler D 3/12/2020
这是一个很好的答案 - 但为什么只移动 1?我想我们必须从 1 到数组的长度切换整数?
0赞 Diablo 3/12/2020
我认为换挡负数 -1 会给你上一行。