OOM
phalexo opened this issue · comments
I have 4 GPUs, with 12.2GB each. I see you are using accelerate and I can see model shards being loaded into 4 GPUs, but it still runs out of VRAM. Why?
6B is a pretty small model, it should fit easily into 48GB+
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 508.00 MiB (GPU 0; 11.93 GiB total capacity; 11.16 GiB already allocated; 370.88 MiB free; 11.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
(FinGPT) developer@ai:~/PROJECTS/FinGPT$
Hi, phalexo. Maybe you should check your strategy for loading the model. Lets's take this notebook as an example.
Usually, you can load the model in this way:
model = LlamaForCausalLM.from_pretrained(
base_model,
trust_remote_code = True,
device_map = "cuda:0",
)
However, this code is only for loading the model into a single GPU (first one) with FP16/FP32/BF16. If you want to load the model into more GPUs. You can change device_map
to auto
. If it doesn't work or the allocation of VRAM is not balanced, you may also set the device_map
to balanced
. And if you want to use quantization, you may set the hyperparameter load_in_8bit = True
or load_in_4bit = True
. So here comes the recommended version for multi-GPUs with quantization:
model = LlamaForCausalLM.from_pretrained(
base_model,
trust_remote_code = True,
load_in_8bit = True,
device_map = "auto",
)