CUDA OOM on batch size 1 (batch norm)

Question

CUDA OOM on batch size 1 (batch norm)

avinashkaur93 opened this issue 5 years ago · comments

Hi @Duankaiwen
I replicated your code and ran several experiments successfully on the COCO dataset using the following env: PyTorch 1.0.0, CUDA: 10.1.168. gcc=5.4.0

On the same environment, with my own dataset, I get a CUDA OOM (single GPU, batch size = 1). My input image size is the same [511, 511]. The training does run for about ~400 steps before it suddenly shows OOM. There's no steady increase in the GPU memory, so this couldn't be any memory leakage as well.

Here's the complete log trace and config:
log.txt

Last few lines of the log:

File "/mnt/dfs/avinashk/CenterNet/CenterNet-owndata-tensorboard/CenterNet/models/py_utils/utils.py", line 15, in forward
bn = self.bn(conv)
File "/home/avinashk/miniconda3/envs/CenterNet-PT10-TF/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/avinashk/miniconda3/envs/CenterNet-PT10-TF/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 76, in forward
exponential_average_factor, self.eps)
File "/home/avinashk/miniconda3/envs/CenterNet-PT10-TF/lib/python3.6/site-packages/torch/nn/functional.py", line 1623, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 3.71 GiB (GPU 0; 10.92 GiB total capacity; 7.31 GiB already allocated; 2.67 GiB free; 25.91 MiB cached)

Mainly it is the error in batch norm that confuses me. I'm a Tensorflow user and fairly new to PyTorch. Any help would be appreciated.