OOM

Question

OOM

phalexo opened this issue 8 months ago · comments

I have 4 GPUs, with 12.2GB each. I see you are using accelerate and I can see model shards being loaded into 4 GPUs, but it still runs out of VRAM. Why?

6B is a pretty small model, it should fit easily into 48GB+

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 508.00 MiB (GPU 0; 11.93 GiB total capacity; 11.16 GiB already allocated; 370.88 MiB free; 11.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
(FinGPT) developer@ai:~/PROJECTS/FinGPT$

Oliver Wang · Answer 1 · Fri Oct 13 2023 11:20:15 GMT+0800 (China Standard Time)

Hi, phalexo. Maybe you should check your strategy for loading the model. Lets's take this notebook as an example.

Usually, you can load the model in this way:

model = LlamaForCausalLM.from_pretrained(
                         base_model, 
                         trust_remote_code = True, 
                         device_map = "cuda:0", 
                        )

However, this code is only for loading the model into a single GPU (first one) with FP16/FP32/BF16. If you want to load the model into more GPUs. You can change device_map to auto. If it doesn't work or the allocation of VRAM is not balanced, you may also set the device_map to balanced. And if you want to use quantization, you may set the hyperparameter load_in_8bit = True or load_in_4bit = True. So here comes the recommended version for multi-GPUs with quantization:

model = LlamaForCausalLM.from_pretrained(
                          base_model, 
                          trust_remote_code = True, 
                          load_in_8bit = True,
                          device_map = "auto", 
                        )

For more details you may check here and here