RuntimeError: CUDA error: out of memory

Question

RuntimeError: CUDA error: out of memory

60999 opened this issue 4 years ago · comments

Sorry to bother you,
Under v100-sxm2 GPU 32g,It always appears
`python train.py --output_directory=outdir/ --log_directory=logdir/ -c tacotron2_statedict.pt --warm_start
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
https://github.com/tensorflow/addons
https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

FP16 Run: False
Dynamic Loss Scaling: True
Distributed Run: False
cuDNN Enabled: True
cuDNN Benchmark: False
Traceback (most recent call last):
File "train.py", line 292, in
args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
File "train.py", line 169, in train
model = load_model(hparams)
File "train.py", line 74, in load_model
model = Tacotron2(hparams).cuda()
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 265, in cuda
return self._apply(lambda t: t.cuda(device))
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 199, in _apply
param.data = fn(param.data)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 265, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: out of memory`

I modified hparams.py
batch_size=1
But the mistake remains.

Is this right? How to repair it,please?

Jeevesh Juneja · Answer 1 · Fri May 15 2020 02:47:36 GMT+0800 (China Standard Time)

you can try reducing hparams.mcn . Although, it shouldn't be a problem if you have a 32gb GPU. I am able to run on colab K-80 GPU too, for batch size 1. Changing batch_size, mcn etc. won't work. Because you are not able to load initial model(that is independent of batch size) on your GPU. Please let me know the n_speakers as a distribution is made for each speaker, which could have led to the problem.@60999