提问人:alvas 提问时间:6/15/2023 最后编辑:alvas 更新时间:8/9/2023 访问量:1749
“enforce_stop_tokens”如何在LangChain中使用Huggingface模型?
How does `enforce_stop_tokens` work in LangChain with Huggingface models?
问:
当我们查看 HuggingFaceHub 模型的使用情况时,有一部分作者不知道如何停止生成,https://github.com/hwchase17/langchain/blob/master/langchain/llms/huggingface_pipeline.py#L182:langchain
class HuggingFacePipeline(LLM):
...
def _call(
...
if stop is not None:
# This is a bit hacky, but I can't figure out a better way to enforce
# stop tokens when making calls to huggingface_hub.
text = enforce_stop_tokens(text, stop)
return text
我应该使用什么将停止令牌添加到模板的末尾?
如果我们看一下 https://github.com/hwchase17/langchain/blob/master/langchain/llms/utils.py,它只是一个正则表达式拆分,它根据非索引字列表拆分输入字符串,然后取re.split
re.split("|".join(stop), text)[0]
让我们尝试从 Huggingface 模型中获取生成输出,例如
from transformers import pipeline
from transformers import GPT2LMHeadModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
output = generator("Hey Pizza! ")
output
[输出]:
[{'generated_text': 'Hey Pizza! 」\n\n「Hurry up, leave the place! 」\n\n「Oi! 」\n\nWhile eating pizza and then, Yuigahama came in contact with Ruriko in the middle of the'}]
如果我们应用:re.split
import re
def enforce_stop_tokens(text, stop):
"""Cut off the text as soon as any stop words occur."""
return re.split("|".join(stop), text)[0]
stop = ["up", "then"]
text = output[0]['generated_text']
re.split("|".join(stop), text)
[输出]:
['Hey Pizza! 」\n\n「Hurry ',
', leave the place! 」\n\n「Oi! 」\n\nWhile eating pizza and ',
', Yuigahama came in contact with Ruriko in the middle of the']
但这没有用,我想在一代结束时分裂。我使用什么代币来“enforce_stop_tokens”?
答:
2赞
Jess
8/9/2023
#1
你可以通过将eos_token_id设置为你的止损项来做到这一点——在我的测试中,它似乎适用于列表。见下文:正则表达式切断了停用词,eos_token_id停用词之后切断了(“once upon a time”与“once upon a”)
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import regex as re
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Define your custom stop terms
stop_terms = [ "right", "time"]
# Ensure the stop terms are in the tokenizer's vocabulary
for term in stop_terms:
if term not in tokenizer.get_vocab():
tokenizer.add_tokens([term])
model.resize_token_embeddings(len(tokenizer))
def enforce_stop_tokens(text, stop):
"""Cut off the text as soon as any stop words occur."""
return re.split("|".join(stop), text)[0]
# Get the token IDs for your custom stop terms
eos_token_ids_custom = [tokenizer.encode(term, add_prefix_space=True)[0] for term in stop_terms]
# Generate text
input_text = "Once upon "
input_ids = tokenizer.encode(input_text, return_tensors='pt')
output_ids = model.generate(input_ids, eos_token_id=eos_token_ids_custom, max_length=50)
# Decode the output IDs to text
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(generated_text) # Once upon a time
print("ENFORCE STOP TOKENS")
truncated_text = enforce_stop_tokens(generated_text, stop_terms)
print(truncated_text) # Once upon a
评论
0赞
alvas
8/9/2023
这不总是用一句话结束一代人吗?
0赞
Jess
8/10/2023
@alvas我不这么认为——在我的 [colab.research.google.com/drive/... colab) 输入文本:“我是”,没有停止令牌强制执行:“# 不停止令牌强制执行:”我不喜欢“大预算”电影的想法。我认为这是浪费钱。我认为这是浪费时间......”上面的代码 + 停用词 [“money”, “time”] 它在第二句话结束。HF 文档
评论