[Usage] errors when restore checkpoint using lora finetuning

Question

[Usage] errors when restore checkpoint using lora finetuning

wenyisir opened this issue 21 days ago · comments

Describe the issue

Issue:
When using LoRA fine-tuning to restore from a checkpoint, an error occurs, while there is no issue when not using LoRA fine-tuning to restore from a checkpoint. Can you explain why? How should I modify to save more parameters?

Command:

/home/wyxu/miniconda3/envs/llava/bin/deepspeed --master_port 25675 \
          --include localhost:3,4,5,6 \
          /home/wyxu/LLaVA/llava/train/train_mem.py \
          --lora_enable True \
          --deepspeed /home/wyxu/LLaVA/scripts/zero2.json \
          --model_name_or_path /data/wyxu/LLaVA/checkpoints/vicuna-7b-v1.3 \
          --version v1 \
          --data_path /data/wyxu/MIC_sampled/data/ \
          --image_folder /data/wyxu/MIC_sampled/data/ \
          --vision_tower /data/wyxu/LLaVA/checkpoints/clip-vit-large-patch14 \
          --pretrain_mm_mlp_adapter /data/wyxu/LLaVA/checkpoints/llava-vicuna-7b-v1.3-pretrain/mm_projector.bin \
          --mm_vision_select_layer -2 \
          --mm_use_im_start_end False \
          --mm_use_im_patch_token False \
          --bf16 True \
          --output_dir /data/wyxu/LLaVA/checkpoints/llava-vicuna-7b-v1.3-finetune-on-mic_sampled-lora \
          --num_train_epochs 10 \
          --per_device_train_batch_size 4 \
          --per_device_eval_batch_size 1 \
          --gradient_accumulation_steps 4 \
          --evaluation_strategy no \
          --save_strategy steps \
          --save_steps 90 \
          --save_total_limit 1 \
          --learning_rate 2e-5 \
          --weight_decay 0. \
          --warmup_ratio 0.03 \
          --lr_scheduler_type cosine \
          --logging_steps 1 \
          --tf32 True \
          --model_max_length 2048 \
          --gradient_checkpointing True \
          --dataloader_num_workers 4 \
          --lazy_preprocess True \
          --report_to wandb

Log:

Traceback (most recent call last):
  File "/home/wyxu/LLaVA/llava/train/train_mem.py", line 4, in <module>
    train(attn_implementation="flash_attention_2")
  File "/home/wyxu/LLaVA/llava/train/train.py", line 1037, in train
    trainer.train(resume_from_checkpoint=True)
  File "/home/wyxu/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(File "/home/wyxu/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1708, in _inner_training_loop
  deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
  File "/home/wyxu/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 402, in deepspeed_load_checkpoint
    load_path, _ = deepspeed_engine.load_checkpoint(
  File "/home/wyxu/miniconda3/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2724, in load_checkpoint
    load_path, client_states = self._load_checkpoint(load_dir,
  File "/home/wyxu/miniconda3/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2794, in _load_checkpoint
    self.load_module_state_dict(checkpoint=checkpoint,
  File "/home/wyxu/miniconda3/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2587, in load_module_state_dict
    self.module.load_state_dict(
  File "/home/wyxu/miniconda3/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
    Missing key(s) in state_dict: "base_model.model.model.embed_tokens.weight", "base_model.model.model.layers.0.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.0.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.0.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.0.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.0.mlp.gate_proj.base_layer.weight", "base_model.model.model.layers.0.mlp.up_proj.base_layer.weight"
......

Screenshots:

1uciusy · Answer 1 · Wed May 08 2024 11:59:52 GMT+0800 (China Standard Time)

It will check the folder --output_dir /data/wyxu/LLaVA/checkpoints/llava-vicuna-7b-v1.3-finetune-on-mic_sampled-lora for the latest checkpoint-xxxx and resume to train.

1uciusy · Answer 2 · Wed May 08 2024 12:02:30 GMT+0800 (China Standard Time)

As for the missmatch of state_dict

pip install transformers==4.39.3
pip install accelerate==0.27.2

It is mentioned in some issues, but i forgot which it is

tetsu-kikuchi · Answer 3 · Thu May 09 2024 11:57:19 GMT+0800 (China Standard Time)

might be this one #1200

wenyisir · Answer 4 · Thu May 09 2024 12:41:58 GMT+0800 (China Standard Time)

I fixed this bug by modifying it: site-packages/deepspeed/runtime/engine.py line 2675 load_module_strict=Fasle

1uciusy · Answer 5 · Thu May 09 2024 19:52:19 GMT+0800 (China Standard Time)

Great, so there is no need to change the version of transformers, you could avoid potential troubles as in #1218 when infering