将 BERT 代币指数映射到 Spacy 代币指数

Map BERT token indices to Spacy token indices

提问人:lrthistlethwaite 提问时间:10/25/2023 最后编辑:lrthistlethwaite 更新时间:10/26/2023 访问量:48

问:

我正在尝试使 Bert 的 () 标记化令牌索引(不是 ids,令牌索引)映射到 Spacy 的标记化令牌索引。在下面的示例中,我的方法不起作用,因为 Spacy 的标记化行为比我预期的要复杂一些。关于解决这个问题的想法?bert-base-uncased

import spacy
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
nlp = spacy.load("en_core_web_sm")

sent = nlp("BRITAIN'S railways cost £20.7bn during the 2020-21 financial year, with £2.5bn generated through fares and other income, £1.3bn through other sources and £16.9bn from government, figures released by the regulator the Office of Rail and Road (ORR) on November 30 revealed.")
# Get spacy word index to BERT token indice mapping
wd_to_tok_map = [wd.i for wd in sent for el in tokenizer.encode(wd.text, add_special_tokens=False)]
len(sent) # 55
len(wd_to_tok_map) # 67     <- Should be 65

input_ids = tokenizer.encode(sent.text, add_special_tokens=False)
len(input_ids) # 65

我可以打印两个标记并寻找完美的文本匹配,但我遇到的问题是,如果一个单词在标记化中重复两次怎么办?查找单词匹配将在句子的不同部分返回两个索引。

[el.text for el in sent]
['BRITAIN', ''S', 'railways', 'cost', '£', '20.7bn', 'during', 'the', '2020', '-', '21', 'financial', 'year', ',', 'with', '£','2.5bn','generated','through', 'fares', 'and','other', 'income', ',', '£', '1.3bn', 'through', 'other', 'sources', 'and', '£', '16.9bn', 'from', 'government', ',', 'figures', 'released', 'by', 'the', 'regulator', 'the', 'Office', 'of', 'Rail', 'and', 'Road', '(', 'ORR', ')', 'on', 'November', '30', 'revealed', '.']

[tokenizer.ids_to_tokens[el] for el in input_ids]
['britain',''', 's', 'railways', 'cost', '£2', '##0', '.', '7', '##bn', 'during', 'the', '2020', '-', '21', 'financial', 'year', ',', 'with', '£2', '.', '5', '##bn', 'generated', 'through', 'fares', 'and', 'other', 'income', ',', '£1', '.', '3', '##bn', 'through', 'other', 'sources', 'and', '£1', '##6', '.', '9', '##bn', 'from', 'government', ',', 'figures', 'released', 'by', 'the', 'regulator', 'the', 'office', 'of', 'rail', 'and', 'road', '(', 'orr', ')', 'on', 'november', '30', 'revealed', '.']

decode() 似乎没有给我想要的东西,因为我在追求索引。

python 映射 spacy tokenize bert-language-model

评论


答:

0赞 aab 10/26/2023 #1

使用快速分词器直接从转换器分词器获取字符偏移量,然后根据需要将这些偏移量映射到空间标记:return_offsets_mapping=True

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "BRITAIN'S railways cost £20.7bn"
output = tokenizer([text], return_offsets_mapping=True)

print(output["input_ids"])
# [[101, 3725, 1005, 1055, 7111, 3465, 21853, 2692, 1012, 1021, 24700, 102]]

print(tokenizer.convert_ids_to_tokens(output["input_ids"][0]))
# ['[CLS]', 'britain', "'", 's', 'railways', 'cost', '£2', '##0', '.', '7', '##bn', '[SEP]']

print(output["offset_mapping"])
# [[(0, 0), (0, 7), (7, 8), (8, 9), (10, 18), (19, 23), (24, 26), (26, 27), (27, 28), (28, 29), (29, 31), (0, 0)]]