spaCy - 按标签对实体进行排序的最有效方法-解网

问：

我正在使用 spaCy 管道从文章中提取所有实体。我需要将这些实体保存在变量上，具体取决于它们被标记的标签。现在我有这个解决方案，但我认为这不是最合适的解决方案，因为我需要遍历每个标签的所有实体：

nlp = spacy.load("es_core_news_md")
text = # I upload my text here
doc = nlp(text)

personEntities = list(set([e.text for e in doc.ents if e.label_ == "PER"]))
locationEntities = list(set([e.text for e in doc.ents if e.label_ == "LOC"]))
organizationEntities = list(set([e.text for e in doc.ents if e.label_ == "ORG"]))

在spaCy中是否有直接的方法来获取每个标签的所有实体，或者我需要这样做吗？for ent in ents: if... elif... elif...

python 空间命名实体识别

评论

0赞 Wiktor Stribiżew 11/27/2019

用于实体。itertoolsgroupby

0赞 Dani Mesejo 11/27/2019

使用字典按实体分组，其中键是实体类型

0赞 AMC 11/27/2019

@WiktorStribiżew 您能详细介绍一下您将如何使用它吗？groupby

0赞 Wiktor Stribiżew 11/27/2019

加。然后使用 .稍后，您将能够像等一样使用它。from itertools import *entities = {key: list(g) for key, g in groupby(doc.ents, lambda x: x.label_)}print(entities['DATE'])

1赞 Wiktor Stribiżew 11/27/2019

@DanielMesejo没问题，我们可以先对它们进行排序，entities = {key: list(g) for key, g in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_)}

答：

5赞 Wiktor Stribiżew 11/27/2019 #1

我建议使用以下方法：groupbyitertools

from itertools import *
#...
entities = {key: list(g) for key, g in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_)}

或者，如果您只需要提取唯一值：

entities = {key: list(set(map(lambda x: str(x), g))) for key, g in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_)}

然后，您可以使用

print(entities['ORG'])

如果你需要获取实体对象的唯一匹配项，而不仅仅是字符串，你可以使用

import spacy
from itertools import *

nlp = spacy.load("en_core_web_sm")
s = "Hello, Mr. Wood! We are in New York. Mrs. Winston is not coming, John hasn't sent her any invite. They will meet in California next time. General Motors and Toyota are companies."
doc = nlp(s * 2)

entities = dict()
for key, g in groupby(sorted(doc.ents, key=lambda x: x.label_), lambda x: x.label_):
    seen = set()
    l = []
    for ent in list(g):
      if ent.text not in seen:
        seen.add(ent.text)
        l.append(ent)
    entities[key] = l

的输出在这里。print(entities['GPE'][0].text)New York

评论

0赞 Luiscri 11/27/2019

我怎样才能实现没有重复元素的列表输出？我试着做我以前做过的同样的事情，但似乎没有用entities['ORG']list(set(entities['ORG']))

0赞 Wiktor Stribiżew 11/27/2019

@Luiscri它应该可以工作，你能展示一下输出吗？entities['ORG']

0赞 Luiscri 11/27/2019

print(entities['ORG']返回以下内容：

[Asociación Española de Fabricantes de Automóviles y Camiones, PSA, Ford, Renault, Unión Europea, UE, IDAE, Ejecutivo, IDAE, Instituto para la Diversificación, IDAE, Administración]

1赞 Wiktor Stribiżew 11/27/2019

不确定，但也请尝试。但是，该代码在一小句话上进行了测试，并且有效。list(set(map(lambda x: str(x), entities['ORG']) ))

1赞 Wiktor Stribiżew 11/27/2019

@Luiscri是的，list(g) => list(set(map(lambda x: str(x), g)))

上一个：熊猫按天重新采样，而不会填写缺失的日期

下一个：默认情况下，Vue 是否为 XSS 提供安全保护或防御？