CodeShell-7B-Chat微调报错

Question

CodeShell-7B-Chat微调报错

RoseKindom opened this issue 8 months ago · comments

用run_finetune.sh脚本跑lora，稍微修改了脚本：

#!/bin/bash

export WANDB_DISABLED=true

project_dir=$(cd "$(dirname $0)"; pwd)

model=$1
data_path=$2
exp_id=$3

output_dir=${project_dir}/output_models/${exp_id}
log_dir=${project_dir}/log/${exp_id}
mkdir -p ${output_dir} ${log_dir}

# deepspeed_args="--master_port=23333 --hostfile=${project_dir}/configs/hostfile.txt --master_addr=10.0.0.16"      # Default argument
# deepspeed_args="--master_port=$((10000 + RANDOM % 20000)) --include=localhost:0,1,2,3"      # Default argument
deepspeed_args="--master_port=$((10000 + RANDOM % 20000)) --include=localhost:1,2,3,4,5,6,7"      # Default argument

deepspeed ${deepspeed_args} ${project_dir}/finetune.py \
    --use_lora \
    --deepspeed ${project_dir}/ds_config_zero3.json \
    --model_name_or_path ${model} \
    --data_path ${data_path} \
    --model_max_length 4096 \
    --output_dir ${output_dir} \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --gradient_checkpointing True \
    --lr_scheduler_type cosine \
    --logging_steps 2 \
    --save_steps 100 \
    --learning_rate 1e-5 \
    --num_train_epochs 20 \
    --fp16 \
    | tee ${log_dir}/train.log \
    2> ${log_dir}/train.err

报错如下：
0%| | 0/420 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
Traceback (most recent call last):
File "/root/miniconda3/envs/codeshell/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 748, in convert_to_tensors
tensor = as_tensor(value)
File "/root/miniconda3/envs/codeshell/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 720, in as_tensor
return torch.tensor(value)
ValueError: expected sequence of length 353 at dim 1 (got 563)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/root/train/codeshell/finetune/finetune.py", line 220, in
train()
File "/root/train/codeshell/finetune/finetune.py", line 214, in train
trainer.train()
File "/root/miniconda3/envs/codeshell/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
return inner_training_loop(
File "/root/miniconda3/envs/codeshell/lib/python3.10/site-packages/transformers/trainer.py", line 1870, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/root/miniconda3/envs/codeshell/lib/python3.10/site-packages/accelerate/data_loader.py", line 448, in iter
current_batch = next(dataloader_iter)
File "/root/miniconda3/envs/codeshell/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in next
data = self._next_data()
File "/root/miniconda3/envs/codeshell/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/root/miniconda3/envs/codeshell/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/root/miniconda3/envs/codeshell/lib/python3.10/site-packages/transformers/trainer_utils.py", line 737, in call
return self.data_collator(features)
File "/root/miniconda3/envs/codeshell/lib/python3.10/site-packages/transformers/data/data_collator.py", line 249, in call
batch = self.tokenizer.pad(
File "/root/miniconda3/envs/codeshell/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3303, in pad
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
File "/root/miniconda3/envs/codeshell/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 223, in init
self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
File "/root/miniconda3/envs/codeshell/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 764, in convert_to_tensors
raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (labels in this case) have excessive nesting (inputs type list where type int is expected).

transformers版本是4.34.0，一直搜不到解决方法，求问