artidoro / qlora

QLoRA: Efficient Finetuning of Quantized LLMs

Home Page:https://arxiv.org/abs/2305.14314

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multiple GPU inference

Zheng392 opened this issue · comments

I do inference of Llama 70b model using 4 16G V100 GPU. I just use model.generate() to generate content. But I found only one GPU is fully utilized each time. Since 70b model requires at least 40G VRAM to load it, I can't do data parallelism. How can I utilize 4 GPUs fully to increase the speed?

image

The way "accelerate" works is by putting different network layers on different GPUs. When you input your data, it gets processed layer by layer, gpu by gpu.