Multiple GPU inference

Question

Multiple GPU inference

Zheng392 opened this issue 10 months ago · comments

I do inference of Llama 70b model using 4 16G V100 GPU. I just use model.generate() to generate content. But I found only one GPU is fully utilized each time. Since 70b model requires at least 40G VRAM to load it, I can't do data parallelism. How can I utilize 4 GPUs fully to increase the speed?

phalexo · Answer 1 · Sat Aug 19 2023 03:56:18 GMT+0800 (China Standard Time)

The way "accelerate" works is by putting different network layers on different GPUs. When you input your data, it gets processed layer by layer, gpu by gpu.