WongKinYiu / yolov7

Implementation of paper - YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CUDA out of memory during training

shubzk opened this issue · comments

I am training the yolov7 model on a custom dataset on an Azure ML VM with 2 NVIDIA V100 GPUs. I am using the following code:

python train.py --img 3072 --batch 2 --epochs 10 --data {dataset.location}/data.yaml --weights 'best.pt' --device 0,1 --save_period 1.

However, after 29% on the first epoch, I am getting the following error of CUDA out of memory:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 264.00 MiB. GPU 0 has a total capacity of 15.77 GiB of which 173.12 MiB is free. Including non-PyTorch memory, this process has 15.60 GiB memory in use. Of the allocated memory 14.43 GiB is allocated by PyTorch, and 738.57 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Please help.

@shubzk reduce the image size from 3072 to 1280.

@dsbyprateekg Thank you. So basically what I have done is increase my compute power with the A100 as images I am using are becoming unusable below 3072.