Wired CUDA memory utilization
lwmlyy opened this issue · comments
Hi, I am using the python launch to lora-finetune Llama2-70b, the training is doing good. But, it seems a bit wired that the memory utilization is quite low, less than 18G. Also, the training speed is relatively slow compared to the codebase in llama-recipes.
Hi! Thanks for your interest. Have you tried accelerate? That worked for us! The python way also works, but is very slow. Definitely try accelerate, but if you don’t want to I’d at least switch to 4 A100 80gb GPUs.
First run accelerate config
to set up accelerate and then replace python finetune.py
with accelerate launch finetune.py
. If that doesn't work, I'll be happy to get you a script.
To clarify, running python finetune.py
will not run as quickly on 4 vs 8 GPUs but when we tried it the native python way, 8 GPUS seemed a bit of a waste, since, as you noticed, utilization isn't great.
First run
accelerate config
to set up accelerate and then replacepython finetune.py
withaccelerate launch finetune.py
. If that doesn't work, I'll be happy to get you a script.To clarify, running
python finetune.py
will not run as quickly on 4 vs 8 GPUs but when we tried it the native python way, 8 GPUS seemed a bit of a waste, since, as you noticed, utilization isn't great.
I just tried running the script with accelerate launch(8*a100-80gb), but it went CUDA OOM during model loading. Any advice?
same problem. I solve this by reinstall the python package with the version in requirement.txt,i think this is relate with the peft package.
but after that still CUDA memory when the cutoff_len is bigger than 1024