artidoro / qlora

QLoRA: Efficient Finetuning of Quantized LLMs

Home Page:https://arxiv.org/abs/2305.14314

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Could not reproduce the results listed in your paper using a single 3090 card.

LiZhangMing opened this issue · comments

commented

Details:
Here is your result :
image
I used the following commands to reproduce the results of using the LLaMA 7B model on the Guanaco (OASST1) dataset:
CUDA_VISIBLE_DEVICES=2 sh scripts/finetunanaco_7b.sh
and the best reslut is :
image

A 1% difference is not a big deal comrade. That could be noise.

For LLaMA 7B, I only reproduce the result of Alpaca (paper: 38.8), while gain a decrease to be 32.7 for chip2 (paper: 34.5), 30.9 for longform (paper: 32.1) and 33.7 for self-instruct (paper: 36.4).

Does anyone have ideas about this? Thanks!

For LLaMA 7B, I only reproduce the result of Alpaca (paper: 38.8), while gain a decrease to be 32.7 for chip2 (paper: 34.5), 30.9 for longform (paper: 32.1) and 33.7 for self-instruct (paper: 36.4).

Does anyone have ideas about this? Thanks!

about alpaca dataset, how to set the hyperparameters?

For LLaMA 7B, I only reproduce the result of Alpaca (paper: 38.8), while gain a decrease to be 32.7 for chip2 (paper: 34.5), 30.9 for longform (paper: 32.1) and 33.7 for self-instruct (paper: 36.4).
Does anyone have ideas about this? Thanks!

about alpaca dataset, how to set the hyperparameters?

simply follow the bash file in the ./scripts/

Actually, from another issue in this project, one said that they evaluate in the MMLU test set, but the qlora.py reports the performance in MMLU dev set. Hence, you need to modify the py file to add the evaluation on test set. Then the results on alpaca and longform should be reproduced, which I've done.

@Forence1999 could you share how you reproduced it? I only got 32.1 with the original hyperparameters. Thanks!

python qlora.py \
    --model_name_or_path huggyllama/llama-7b \
    --use_auth \
    --output_dir /fly/results/qlora \
    --logging_steps 10 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 500 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --per_device_eval_batch_size 1 \
    --max_new_tokens 32 \
    --dataloader_num_workers 1 \
    --group_by_length \
    --logging_strategy steps \
    --remove_unused_columns False \
    --do_train \
    --do_eval \
    --do_mmlu_eval \
    --lora_r 64 \
    --lora_alpha 16 \
    --lora_modules all \
    --double_quant \
    --quant_type nf4 \
    --bf16 \
    --bits 16 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type constant \
    --gradient_checkpointing \
    --dataset alpaca \
    --source_max_len 16 \
    --target_max_len 512 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --max_steps 1875 \
    --eval_steps 187 \
    --learning_rate 0.0002 \
    --adam_beta2 0.999 \
    --max_grad_norm 0.3 \
    --lora_dropout 0.1 \
    --weight_decay 0.0 \
    --seed 0 \
    --mmlu_split test

@Forence1999 could you share how you reproduced it? I only got 32.1 with the original hyperparameters. Thanks!

python qlora.py \
    --model_name_or_path huggyllama/llama-7b \
    --use_auth \
    --output_dir /fly/results/qlora \
    --logging_steps 10 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 500 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --per_device_eval_batch_size 1 \
    --max_new_tokens 32 \
    --dataloader_num_workers 1 \
    --group_by_length \
    --logging_strategy steps \
    --remove_unused_columns False \
    --do_train \
    --do_eval \
    --do_mmlu_eval \
    --lora_r 64 \
    --lora_alpha 16 \
    --lora_modules all \
    --double_quant \
    --quant_type nf4 \
    --bf16 \
    --bits 16 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type constant \
    --gradient_checkpointing \
    --dataset alpaca \
    --source_max_len 16 \
    --target_max_len 512 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --max_steps 1875 \
    --eval_steps 187 \
    --learning_rate 0.0002 \
    --adam_beta2 0.999 \
    --max_grad_norm 0.3 \
    --lora_dropout 0.1 \
    --weight_decay 0.0 \
    --seed 0 \
    --mmlu_split test

Hi, @Edenzzzz , so sorry for that I cannot provide my scripts anymore, because there has been a long time since I used it. It seems that so many params have been modified in your scripts.

Suggestions:

  1. use the scripts provided by the author, with modifications as less as possible.
  2. build a docker with the dockerfile provided by the author. Ensuring a consistent env is deadly important to reproduce the exact results, and it will also save u lots of time. If you don't wanna build from scratch, u can download my env (docker pull forence/open-instruct:v1) simply. Please note that I build this image with the author's dockerfile as reference, but not exactly follow it.

Hope this could help you a bit!