在增加词汇量后，DistilBert 标记化不会在单词内标记的开头添加磅（##）-解网

问：

我正在用来自新语料库的新令牌来丰富 DistilBert tokenizer。使用分词器，并基于 Huggingface NLP 课程，通过从单词开头找到“尽可能长的标记”，将其拆分，然后对单词的其余部分执行相同的操作来完成推理。DistilBertWordPiece

然而，在我的分词器中，我有、、、标记，但是在标记化检查时，分词器会提出以下标记： .inspectinspec##ec##t['insp', 'ec', '##t']

我希望分词器只返回一个标记：“inspect”。即使它分裂了，我也希望它至少会回来.['insp', '##ec', '##t']

这是一个错误还是我的代码的某些部分不正确？

最小工作示例：

>> from transformers import AutoTokenizer

>> model_checkpoint = 'elastic/distilbert-base-uncased-finetuned-conll03-english'
>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

>> ('inspect' in tokenizer.vocab, 'insp' in tokenizer.vocab, 'ec' in tokenizer.vocab, '##ec' in tokenizer.vocab, '##t' in tokenizer.vocab)
# (True, False, True, True, True)
>> tokenizer.convert_ids_to_tokens(tokenizer.encode('inspect'))
# ['[CLS]', 'inspect', '[SEP]']

>> tokenizer.add_tokens(['insp'])
# 1
>> ('inspect' in tokenizer.vocab, 'insp' in tokenizer.vocab, 'ec' in tokenizer.vocab, '##ec' in tokenizer.vocab, '##t' in tokenizer.vocab)
# (True, True, True, True, True)
>> tokenizer.convert_ids_to_tokens(tokenizer.encode('inspect'))
# ['[CLS]', 'insp', 'ec', '##t', '[SEP]']

python nlp huggingface-transformers tokenize huggingface-tokenizers

在增加词汇量后，DistilBert 标记化不会在单词内标记的开头添加磅（##）

DistilBert tokenization does not add pounds (##) at the start of in-word tokens after increasing vocabulary

评论

在增加词汇量后，DistilBert 标记化不会在单词内标记的开头添加磅 （##）

DistilBert tokenization does not add pounds (##) at the start of in-word tokens after increasing vocabulary

评论

在增加词汇量后，DistilBert 标记化不会在单词内标记的开头添加磅（##）