arielnlee / Platypus

Code for fine-tuning Platypus fam LLMs using LoRA

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wired CUDA memory utilization

lwmlyy opened this issue · comments

commented

Hi, I am using the python launch to lora-finetune Llama2-70b, the training is doing good. But, it seems a bit wired that the memory utilization is quite low, less than 18G. Also, the training speed is relatively slow compared to the codebase in llama-recipes.

The command is:
image

The gpu status during training is:
image

Hi! Thanks for your interest. Have you tried accelerate? That worked for us! The python way also works, but is very slow. Definitely try accelerate, but if you don’t want to I’d at least switch to 4 A100 80gb GPUs.

commented

First run accelerate config to set up accelerate and then replace python finetune.py with accelerate launch finetune.py. If that doesn't work, I'll be happy to get you a script.

To clarify, running python finetune.py will not run as quickly on 4 vs 8 GPUs but when we tried it the native python way, 8 GPUS seemed a bit of a waste, since, as you noticed, utilization isn't great.

commented

First run accelerate config to set up accelerate and then replace python finetune.py with accelerate launch finetune.py. If that doesn't work, I'll be happy to get you a script.

To clarify, running python finetune.py will not run as quickly on 4 vs 8 GPUs but when we tried it the native python way, 8 GPUS seemed a bit of a waste, since, as you noticed, utilization isn't great.

I just tried running the script with accelerate launch(8*a100-80gb), but it went CUDA OOM during model loading. Any advice?

The accelerate config is as follow:
image

The launch config is as follow:
image

same problem. I solve this by reinstall the python package with the version in requirement.txt,i think this is relate with the peft package.
but after that still CUDA memory when the cutoff_len is bigger than 1024