arielnlee / Platypus

Code for fine-tuning Platypus fam LLMs using LoRA

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OOM when training LLaMA-2-70B

0three opened this issue · comments

After 4 step, the OOM occurs.

Reduce Batch_size to 8, the OOM still occurs.

8 A100 80G

Which library are you using-torchrun, accelerate, naive python, etc?

Thanks for your reply, I use torchrun which is the default settings in fine-tuning.sh

Got it! For the 70b model you'll need to use accelerate or some other library that takes advantage of model parallelization (torchrun is data parallelization). See the finetune.py section of our readme for additional details: https://github.com/arielnlee/Platypus#fine-tuning-finetunepy

Thanks for the heads up, I will fix the fine-tuning.sh settings. Please let me know if you have additional questions.

haha. I can only finetune it with lora_rank 8 and cut-off length 512 for a simple finetuning.

I'll try accelerate later.[It might cause performance reduction from my perspective.]

Thanks for your suggestions!

Sorry to hear that! I just used the python finetune.py command in the fine-tuning section (alternative to torchrun) and it worked with lora_r 16 / micro batch size 1 / batch_size 32 on 4 A100 80gb GPUS. Also used cutoff length 4096. About 20 hours to run. Maybe try setting world_size=1 so you have model parallelism?