提问人:Tyler D 提问时间:3/11/2020 更新时间:3/11/2020 访问量:1015
Python:将列中的每一行与所有其他条目进行比较
Python: Compare every single row in a column to all other entries
问:
我有一个 pandas 数据帧,其中一列包含如下字符串:
d = pd.DataFrame({'text': ["hello, this is a test. we want to remove entries, where the text is similar to other texts",
"hello, this is a test. we want to remove entries, where the text is similar to other texts because",
"where are you going",
"i'm going to the zoo to pet the animals",
"where are you going jane",
"where are you going asd"]})
我想删除句子与前一行相似的行。在这种情况下,“相似”意味着它们共享 75% 的相同单词。
以下是我目前的做法(使用 for 循环):
def find_duplicates(df):
df = df.str.split().apply(set)
ls_duplicates = []
for i in range(len(df)):
doc_i = df.iloc[i]
for j in range(i+1, len(df)):
doc_j = df.iloc[j]
score = len(doc_i.intersection(doc_j)) / len(doc_j)
if score > 0.7:
ls_duplicates.append(j)
return ls_duplicates
d.iloc[find_duplicates(d['text'])]
这将给出所需的输出:
text
1 hello, this is a test. we want to remove entri...
4 where are you going jane
5 where are you going asd
5 where are you going asd
现在,当我的数据帧很大(>10k 行)时,这运行得非常慢。有没有办法优化for循环?
答:
2赞
Diablo
3/11/2020
#1
df = pd.DataFrame({'text': ["hello, this is a test. we want to remove entries, where the text is similar to other texts",
"hello, this is a test. we want to remove entries, where the text is similar to other texts because",
"where are you going",
"i'm going to the zoo to pet the animals",
"where are you going jane",
"where are you going asd"]})
df['prev_text'] = df.text.shift(-1)
df.fillna('NA', inplace=True)
def find_duplicates(x):
text = set(x.text.split())
prev_text = set(x.prev_text.split())
return len(text.intersection(prev_text))/len(prev_text)
df['score'] = df.apply(find_duplicates, axis=1)
print(df)
print(df[df.score < 0.7].text)
测试它快了 65%。
评论
0赞
ak_slick
3/11/2020
这个答案很好。
0赞
Tyler D
3/12/2020
这是一个很好的答案 - 但为什么只移动 1?我想我们必须从 1 到数组的长度切换整数?
0赞
Diablo
3/12/2020
我认为换挡负数 -1 会给你上一行。
下一个:对变量超出范围的引用
评论