Falcon 微调模型无法正常工作

Falcon Fine-tuned Model Unable to Work Properly

提问人:DTJ 提问时间:11/17/2023 更新时间:11/17/2023 访问量:16

问:

我是机器学习的新手,尤其是法学硕士,我正在训练一个基于预训练的模型是猎鹰 7B,以完成参加 SAT 考试的任务。

这是我用来训练模型的数据集:

DatasetDict({
    train: Dataset({
        features: ['Problem', 'Rationale', 'category', 'correct', 'id', 'options'],
        num_rows: 3597
    })
})

以下是我用来评估经过微调的模型的数据:

Dataset({
    features: ['Problem', '_id', 'category', 'options'],
    num_rows: 4720
})

一切都很好(直到我尝试加载模型进行评估,但无法加载:\

这是我的代码:

from transformers import AutoModelForCausalLM, AutoTokenizer

PEFT_MODEL = "/content/SAT_AI_CHALLENGE_2023"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)


config = PeftConfig.from_pretrained(PEFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    # config.base_model_name_or_path,
    pretrained_model_name_or_path = "vilsonrodrigues/falcon-7b-instruct-sharded",
    device_map='cuda:0',
    return_dict=True,
    quantization_config=bnb_config,
    # device_map="auto",
    trust_remote_code=True
)

tokenizer=AutoTokenizer.from_pretrained(config.base_model_name_or_path)
columns_to_tokenize = ['Problem', '_id', 'category', 'options']
columns_list = [f'{column}' for column in columns_to_tokenize]
for column in columns_list:
  tokenized_data = tokenizer.batch_encode_plus(
      df[column].astype(str).values.tolist(), truncation=True
  )

tokenizer.pad_token = tokenizer.eos_token  #special token used to pad sequences to a consistent length during tokenization.
max_token_length = max(len(tokenized_data) for token_list in tokenized_data['Problem'])




model = PeftModel.from_pretrained(model, PEFT_MODEL)


我得到的错误 - 其中大部分都是关于标记化过程的:


Some weights of FalconForCausalLM were not initialized from the model checkpoint at vilsonrodrigues/falcon-7b-instruct-sharded and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 33>:33                                                                            │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:238 in           │
│ __getitem__                                                                                      │
│                                                                                                  │
│    235 │   │   If the key is an integer, get the `tokenizers.Encoding` for batch item with inde  │
│    236 │   │   """                                                                               │
│    237 │   │   if isinstance(item, str):                                                         │
│ ❱  238 │   │   │   return self.data[item]                                                        │
│    239 │   │   elif self._encodings is not None:                                                 │
│    240 │   │   │   return self._encodings[item]                                                  │
│    241 │   │   else:                                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'Problem'


我在googleCollab上运行它,并使用GPU运行。我真的需要有人解释我做错了什么以及错误消息中的建议。以及我如何解决这个问题。

提前致谢

我的尝试:我试图调整从数据集调用的数据/调整解析到分词化函数的数据。但这些尝试大多都失败了

机器学习 HuggingFace-Transformer 大型语言模型 Falcon

评论


答: 暂无答案