CUDA out of memory during training

Question

CUDA out of memory during training

shubzk opened this issue 7 months ago · comments

I am training the yolov7 model on a custom dataset on an Azure ML VM with 2 NVIDIA V100 GPUs. I am using the following code:

python train.py --img 3072 --batch 2 --epochs 10 --data {dataset.location}/data.yaml --weights 'best.pt' --device 0,1 --save_period 1.

However, after 29% on the first epoch, I am getting the following error of CUDA out of memory:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 264.00 MiB. GPU 0 has a total capacity of 15.77 GiB of which 173.12 MiB is free. Including non-PyTorch memory, this process has 15.60 GiB memory in use. Of the allocated memory 14.43 GiB is allocated by PyTorch, and 738.57 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Please help.

Prateek Gupta · Answer 1 · Tue Apr 02 2024 14:22:38 GMT+0800 (China Standard Time)

@shubzk reduce the image size from 3072 to 1280.

Shubham Kumbhar · Answer 2 · Thu Apr 04 2024 16:25:59 GMT+0800 (China Standard Time)

@dsbyprateekg Thank you. So basically what I have done is increase my compute power with the A100 as images I am using are becoming unusable below 3072.