提问人:Luiscri 提问时间:11/27/2019 更新时间:1/30/2020 访问量:2410
spaCy - 按标签对实体进行排序的最有效方法
spaCy - Most efficient way to sort entities by label
问:
我正在使用 spaCy 管道从文章中提取所有实体。我需要将这些实体保存在变量上,具体取决于它们被标记的标签。现在我有这个解决方案,但我认为这不是最合适的解决方案,因为我需要遍历每个标签的所有实体:
nlp = spacy.load("es_core_news_md")
text = # I upload my text here
doc = nlp(text)
personEntities = list(set([e.text for e in doc.ents if e.label_ == "PER"]))
locationEntities = list(set([e.text for e in doc.ents if e.label_ == "LOC"]))
organizationEntities = list(set([e.text for e in doc.ents if e.label_ == "ORG"]))
在spaCy中是否有直接的方法来获取每个标签的所有实体,或者我需要这样做吗?for ent in ents: if... elif... elif...
答:
5赞
Wiktor Stribiżew
11/27/2019
#1
我建议使用以下方法:groupby
itertools
from itertools import *
#...
entities = {key: list(g) for key, g in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_)}
或者,如果您只需要提取唯一值:
entities = {key: list(set(map(lambda x: str(x), g))) for key, g in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_)}
然后,您可以使用
print(entities['ORG'])
如果你需要获取实体对象的唯一匹配项,而不仅仅是字符串,你可以使用
import spacy
from itertools import *
nlp = spacy.load("en_core_web_sm")
s = "Hello, Mr. Wood! We are in New York. Mrs. Winston is not coming, John hasn't sent her any invite. They will meet in California next time. General Motors and Toyota are companies."
doc = nlp(s * 2)
entities = dict()
for key, g in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_):
seen = set()
l = []
for ent in list(g):
if ent.text not in seen:
seen.add(ent.text)
l.append(ent)
entities[key] = l
的输出在这里。print(entities['GPE'][0].text)
New York
评论
0赞
Luiscri
11/27/2019
我怎样才能实现没有重复元素的列表输出?我试着做我以前做过的同样的事情,但似乎没有用entities['ORG']
list(set(entities['ORG']))
0赞
Wiktor Stribiżew
11/27/2019
@Luiscri它应该可以工作,你能展示一下输出吗?entities['ORG']
0赞
Luiscri
11/27/2019
print(entities['ORG']
返回以下内容:[Asociación Española de Fabricantes de Automóviles y Camiones, PSA, Ford, Renault, Unión Europea, UE, IDAE, Ejecutivo, IDAE, Instituto para la Diversificación, IDAE, Administración]
1赞
Wiktor Stribiżew
11/27/2019
不确定,但也请尝试。但是,该代码在一小句话上进行了测试,并且有效。list(set(map(lambda x: str(x), entities['ORG']) ))
1赞
Wiktor Stribiżew
11/27/2019
@Luiscri是的,list(g)
=> list(set(map(lambda x: str(x), g)))
评论
itertools
groupby
groupby
from itertools import *
entities = {key: list(g) for key, g in groupby(doc.ents, lambda x: x.label_)}
print(entities['DATE'])
entities = {key: list(g) for key, g in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_)}