提问人:DTJ 提问时间:11/17/2023 更新时间:11/17/2023 访问量:16
Falcon 微调模型无法正常工作
Falcon Fine-tuned Model Unable to Work Properly
问:
我是机器学习的新手,尤其是法学硕士,我正在训练一个基于预训练的模型是猎鹰 7B,以完成参加 SAT 考试的任务。
这是我用来训练模型的数据集:
DatasetDict({
train: Dataset({
features: ['Problem', 'Rationale', 'category', 'correct', 'id', 'options'],
num_rows: 3597
})
})
以下是我用来评估经过微调的模型的数据:
Dataset({
features: ['Problem', '_id', 'category', 'options'],
num_rows: 4720
})
一切都很好(直到我尝试加载模型进行评估,但无法加载:\
这是我的代码:
from transformers import AutoModelForCausalLM, AutoTokenizer
PEFT_MODEL = "/content/SAT_AI_CHALLENGE_2023"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
config = PeftConfig.from_pretrained(PEFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
# config.base_model_name_or_path,
pretrained_model_name_or_path = "vilsonrodrigues/falcon-7b-instruct-sharded",
device_map='cuda:0',
return_dict=True,
quantization_config=bnb_config,
# device_map="auto",
trust_remote_code=True
)
tokenizer=AutoTokenizer.from_pretrained(config.base_model_name_or_path)
columns_to_tokenize = ['Problem', '_id', 'category', 'options']
columns_list = [f'{column}' for column in columns_to_tokenize]
for column in columns_list:
tokenized_data = tokenizer.batch_encode_plus(
df[column].astype(str).values.tolist(), truncation=True
)
tokenizer.pad_token = tokenizer.eos_token #special token used to pad sequences to a consistent length during tokenization.
max_token_length = max(len(tokenized_data) for token_list in tokenized_data['Problem'])
model = PeftModel.from_pretrained(model, PEFT_MODEL)
我得到的错误 - 其中大部分都是关于标记化过程的:
Some weights of FalconForCausalLM were not initialized from the model checkpoint at vilsonrodrigues/falcon-7b-instruct-sharded and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 33>:33 │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:238 in │
│ __getitem__ │
│ │
│ 235 │ │ If the key is an integer, get the `tokenizers.Encoding` for batch item with inde │
│ 236 │ │ """ │
│ 237 │ │ if isinstance(item, str): │
│ ❱ 238 │ │ │ return self.data[item] │
│ 239 │ │ elif self._encodings is not None: │
│ 240 │ │ │ return self._encodings[item] │
│ 241 │ │ else: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'Problem'
我在googleCollab上运行它,并使用GPU运行。我真的需要有人解释我做错了什么以及错误消息中的建议。以及我如何解决这个问题。
提前致谢
我的尝试:我试图调整从数据集调用的数据/调整解析到分词化函数的数据。但这些尝试大多都失败了
答: 暂无答案
评论