Could not reproduce the results listed in your paper using a single 3090 card.

Question

Could not reproduce the results listed in your paper using a single 3090 card.

LiZhangMing opened this issue a year ago · comments

Details:
Here is your result :

I used the following commands to reproduce the results of using the LLaMA 7B model on the Guanaco (OASST1) dataset:
CUDA_VISIBLE_DEVICES=2 sh scripts/finetunanaco_7b.sh
and the best reslut is :

Joshua Clymer · Answer 1 · Mon Sep 25 2023 07:02:29 GMT+0800 (China Standard Time)

A 1% difference is not a big deal comrade. That could be noise.

Forence · Answer 2 · Thu Oct 19 2023 20:18:11 GMT+0800 (China Standard Time)

For LLaMA 7B, I only reproduce the result of Alpaca (paper: 38.8), while gain a decrease to be 32.7 for chip2 (paper: 34.5), 30.9 for longform (paper: 32.1) and 33.7 for self-instruct (paper: 36.4).

Does anyone have ideas about this? Thanks!

SifanZhou · Answer 3 · Tue Oct 24 2023 19:52:46 GMT+0800 (China Standard Time)

For LLaMA 7B, I only reproduce the result of Alpaca (paper: 38.8), while gain a decrease to be 32.7 for chip2 (paper: 34.5), 30.9 for longform (paper: 32.1) and 33.7 for self-instruct (paper: 36.4).

Does anyone have ideas about this? Thanks!

about alpaca dataset, how to set the hyperparameters?

Forence · Answer 4 · Wed Oct 25 2023 10:04:39 GMT+0800 (China Standard Time)

For LLaMA 7B, I only reproduce the result of Alpaca (paper: 38.8), while gain a decrease to be 32.7 for chip2 (paper: 34.5), 30.9 for longform (paper: 32.1) and 33.7 for self-instruct (paper: 36.4).
Does anyone have ideas about this? Thanks!

about alpaca dataset, how to set the hyperparameters?

simply follow the bash file in the ./scripts/

Actually, from another issue in this project, one said that they evaluate in the MMLU test set, but the qlora.py reports the performance in MMLU dev set. Hence, you need to modify the py file to add the evaluation on test set. Then the results on alpaca and longform should be reproduced, which I've done.

Wenxuan Tan · Answer 5 · Wed May 29 2024 15:32:28 GMT+0800 (China Standard Time)

@Forence1999 could you share how you reproduced it? I only got 32.1 with the original hyperparameters. Thanks!

python qlora.py \
    --model_name_or_path huggyllama/llama-7b \
    --use_auth \
    --output_dir /fly/results/qlora \
    --logging_steps 10 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 500 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --per_device_eval_batch_size 1 \
    --max_new_tokens 32 \
    --dataloader_num_workers 1 \
    --group_by_length \
    --logging_strategy steps \
    --remove_unused_columns False \
    --do_train \
    --do_eval \
    --do_mmlu_eval \
    --lora_r 64 \
    --lora_alpha 16 \
    --lora_modules all \
    --double_quant \
    --quant_type nf4 \
    --bf16 \
    --bits 16 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type constant \
    --gradient_checkpointing \
    --dataset alpaca \
    --source_max_len 16 \
    --target_max_len 512 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --max_steps 1875 \
    --eval_steps 187 \
    --learning_rate 0.0002 \
    --adam_beta2 0.999 \
    --max_grad_norm 0.3 \
    --lora_dropout 0.1 \
    --weight_decay 0.0 \
    --seed 0 \
    --mmlu_split test

Forence · Answer 6 · Wed May 29 2024 17:18:11 GMT+0800 (China Standard Time)

@Forence1999 could you share how you reproduced it? I only got 32.1 with the original hyperparameters. Thanks!

python qlora.py \
    --model_name_or_path huggyllama/llama-7b \
    --use_auth \
    --output_dir /fly/results/qlora \
    --logging_steps 10 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 500 \
    --save_total_limit 40 \
    --evaluation_strategy steps \
    --eval_dataset_size 1024 \
    --max_eval_samples 1000 \
    --per_device_eval_batch_size 1 \
    --max_new_tokens 32 \
    --dataloader_num_workers 1 \
    --group_by_length \
    --logging_strategy steps \
    --remove_unused_columns False \
    --do_train \
    --do_eval \
    --do_mmlu_eval \
    --lora_r 64 \
    --lora_alpha 16 \
    --lora_modules all \
    --double_quant \
    --quant_type nf4 \
    --bf16 \
    --bits 16 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type constant \
    --gradient_checkpointing \
    --dataset alpaca \
    --source_max_len 16 \
    --target_max_len 512 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --max_steps 1875 \
    --eval_steps 187 \
    --learning_rate 0.0002 \
    --adam_beta2 0.999 \
    --max_grad_norm 0.3 \
    --lora_dropout 0.1 \
    --weight_decay 0.0 \
    --seed 0 \
    --mmlu_split test

Hi, @Edenzzzz , so sorry for that I cannot provide my scripts anymore, because there has been a long time since I used it. It seems that so many params have been modified in your scripts.

Suggestions:

use the scripts provided by the author, with modifications as less as possible.
build a docker with the dockerfile provided by the author. Ensuring a consistent env is deadly important to reproduce the exact results, and it will also save u lots of time. If you don't wanna build from scratch, u can download my env (docker pull forence/open-instruct:v1) simply. Please note that I build this image with the author's dockerfile as reference, but not exactly follow it.

Hope this could help you a bit!