AI4Finance-Foundation / FinGPT

FinGPT: Open-Source Financial Large Language Models! Revolutionize 🔥 We release the trained model on HuggingFace.

Home Page:https://ai4finance.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OOM

phalexo opened this issue · comments

I have 4 GPUs, with 12.2GB each. I see you are using accelerate and I can see model shards being loaded into 4 GPUs, but it still runs out of VRAM. Why?

6B is a pretty small model, it should fit easily into 48GB+

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 508.00 MiB (GPU 0; 11.93 GiB total capacity; 11.16 GiB already allocated; 370.88 MiB free; 11.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
(FinGPT) developer@ai:~/PROJECTS/FinGPT$

Hi, phalexo. Maybe you should check your strategy for loading the model. Lets's take this notebook as an example.

Usually, you can load the model in this way:

model = LlamaForCausalLM.from_pretrained(
                         base_model, 
                         trust_remote_code = True, 
                         device_map = "cuda:0", 
                        )

However, this code is only for loading the model into a single GPU (first one) with FP16/FP32/BF16. If you want to load the model into more GPUs. You can change device_map to auto. If it doesn't work or the allocation of VRAM is not balanced, you may also set the device_map to balanced. And if you want to use quantization, you may set the hyperparameter load_in_8bit = True or load_in_4bit = True. So here comes the recommended version for multi-GPUs with quantization:

model = LlamaForCausalLM.from_pretrained(
                          base_model, 
                          trust_remote_code = True, 
                          load_in_8bit = True,
                          device_map = "auto", 
                        )

For more details you may check here and here