mshumer / gpt-llm-trainer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Merge the model and store in Google Drive (Section)

KabaTubare opened this issue · comments

It always runs out of memory...please remedy this issue. Error I get constantly and I am using Colab Pro V100 which should be enough for this project i think: 0/3 [02:11<?, ?it/s]

OutOfMemoryError Traceback (most recent call last)
in <cell line: 8>()
6
7 # Reload model in FP16 and merge it with LoRA weights
----> 8 base_model = AutoModelForCausalLM.from_pretrained(
9 model_name,
10 low_cpu_mem_usage=True,

4 frames
/usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics)
296 module._parameters[tensor_name] = param_cls(new_value, requires_grad=old_value.requires_grad)
297 elif isinstance(value, torch.Tensor):
--> 298 new_value = value.to(device)
299 else:
300 new_value = torch.tensor(value, device=device)

OutOfMemoryError: CUDA out of memory. Tried to allocate 314.00 MiB (GPU 0; 15.77 GiB total capacity; 14.32 GiB already allocated; 2.12 MiB free; 14.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have this problem too

Hey I use V100 and it works, have you turned on high RAM?

After trained the model and got it in the google drive folder, you should restart the runtime and only run cells which run inference. That will save RAM and Storage to load the model

This still does not work, but I figure eventually the authors of this project or someone else will get it right as there are others coming quickly to fill this turn key solution to llm training. Does not seem the authors relaize that this thread is used by tech firms to find issues, but also see how they are resolved