CUDA out of memory. Tried to allocate 90.00 MiB (GPU 3; 23.70 GiB total capacity; 22.06 GiB already allocated; 20.56 MiB free; 22.35 GiB reserved in total by PyTorch)

Question

CUDA out of memory. Tried to allocate 90.00 MiB (GPU 3; 23.70 GiB total capacity; 22.06 GiB already allocated; 20.56 MiB free; 22.35 GiB reserved in total by PyTorch)

xxlbigbrother opened this issue 3 years ago · comments

I train on nvidia 3090(24GiB total),it still out of memory,Could you help me see if there is any problem？
thanks！

Fabio Pizzati · Answer 1 · Mon Jun 21 2021 20:52:11 GMT+0800 (China Standard Time)

Hi, can you give more details about your cuda installation and setup? Have you tried running the code through the Docker?

Tom · Answer 2 · Mon Jun 21 2021 21:03:17 GMT+0800 (China Standard Time)

Hi, can you give more details about your cuda installation and setup? Have you tried running the code through the Docker?

My CUDA Version is 11.3，I tried to use mixed precision training way by adding --mixed_precision,but it didn't work. I haven't run the code throngh the Docker.

Fabio Pizzati · Answer 3 · Mon Jun 21 2021 21:12:45 GMT+0800 (China Standard Time)

Ok, I advice you to try with Docker in order to solve any compatibility issue. Also, can you detail more the failure you had in mixed precision?

Tom · Answer 4 · Tue Jun 22 2021 10:24:21 GMT+0800 (China Standard Time)

Ok, I advice you to try with Docker in order to solve any compatibility issue. Also, can you detail more the failure you had in mixed precision?

Global seed set to 1
168780
148470
20310
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

| Name | Type | Params

0 | netG_A | AdaINGen | 15.0 M
1 | netG_B | AdaINGen | 15.0 M
2 | netDRB | DRB | 4.7 M
3 | netD_A | MsImageDis | 8.3 M
4 | netD_B | MsImageDis | 8.3 M
5 | netPhi_net | StyleEncoder | 2.8 M
6 | netPhi_net_A | StyleEncoder | 2.8 M
7 | reconCriterion | L1Loss | 0
8 | criterionPhysics | L1Loss | 0
9 | criterionIdt | L1Loss | 0
10 | instance_norm | InstanceNorm2d | 0
11 | vgg | Vgg16 | 14.7 M

56.9 M Trainable params
14.7 M Non-trainable params
71.6 M Total params
286.272 Total estimated model params size (MB)
Epoch 0: 0%| | 0/148470 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 58, in
start(ap.parse_args())
File "train.py", line 41, in start
trainer.fit(model, dataset)
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
self._run(model)
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
self.dispatch()
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
self.accelerator.start_training(self)
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
return self.run_train()
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
self.train_loop.run_training_epoch()
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 738, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 442, in optimizer_step
using_lbfgs=is_lbfgs,
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 326, in optimizer_step
self.lightning_module, optimizer, opt_idx, lambda_closure, **kwargs
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 86, in pre_optimizer_step
lambda_closure()
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 733, in train_step_and_backward_closure
split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 823, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 290, in training_step
training_step_output = self.trainer.accelerator.training_step(args)
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 204, in training_step
return self.training_type_plugin.training_step(*args)
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 155, in training_step
return self.lightning_module.training_step(*args, **kwargs)
File "/home1/users/dinglihe/low_light_enhancement/CoMoGAN-main/networks/comomunit_model.py", line 396, in training_step
return self.training_step_G()
File "/home1/users/dinglihe/low_light_enhancement/CoMoGAN-main/networks/comomunit_model.py", line 299, in training_step_G
self.y_M = self.netG_B.decode(features_A_physics)
File "/home1/users/dinglihe/low_light_enhancement/CoMoGAN-main/networks/backbones/comomunit.py", line 75, in decode
return self.dec(features)
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home1/users/dinglihe/low_light_enhancement/CoMoGAN-main/networks/backbones/comomunit.py", line 166, in forward
return self.model(x)
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home1/users/dinglihe/low_light_enhancement/CoMoGAN-main/networks/backbones/comomunit.py", line 352, in forward
x = self.norm(x)
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home1/users/dinglihe/low_light_enhancement/CoMoGAN-main/networks/backbones/comomunit.py", line 521, in forward
x = x * self.gamma.view(*shape) + self.beta.view(*shape)
RuntimeError: CUDA out of memory. Tried to allocate 352.00 MiB (GPU 2; 23.70 GiB total capacity; 21.70 GiB already allocated; 2.56 MiB free; 22.36 GiB reserved in total by PyTorch)
Exception ignored in: <function tqdm.del at 0x7fc804bc7560>
Traceback (most recent call last):
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/tqdm/std.py", line 1145, in del
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/tqdm/std.py", line 1299, in close
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/tqdm/std.py", line 1492, in display
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/tqdm/std.py", line 1148, in str
File "/home1/users/dinglihe/.conda/envs/dean/lib/python3.7/site-packages/tqdm/std.py", line 1450, in format_dict
TypeError: cannot unpack non-iterable NoneType object

thanks,and Can you tell me how long the training will take？

Tom · Answer 5 · Tue Jun 22 2021 21:33:59 GMT+0800 (China Standard Time)

Ok, I advice you to try with Docker in order to solve any compatibility issue. Also, can you detail more the failure you had in mixed precision?

I still haven't solved the problem of insufficient memory, have you tried to run the code without docker? thanks!

Fabio Pizzati · Answer 6 · Tue Jun 22 2021 23:54:29 GMT+0800 (China Standard Time)

Yes, the code is tested with and without docker on NVIDIA Tesla V100 and 2080ti. But are you training to reproduce the Waymo day2timelapse experiment or are you adapting the system to new tasks?

Tom · Answer 7 · Wed Jun 23 2021 00:20:11 GMT+0800 (China Standard Time)

Yes, the code is tested with and without docker on NVIDIA Tesla V100 and 2080ti. But are you training to reproduce the Waymo day2timelapse experiment or are you adapting the system to new tasks?

Yes, I want to train Comogan on Nuscenes dataset, the shape of images in nuscenes is 1600*900 ,I tested the code on 3090 and 2080ti but failed with the insufficient memory error , maybe something wrong with dataloader? but code error reporting is not about data.

Fabio Pizzati · Answer 8 · Wed Jun 23 2021 00:42:03 GMT+0800 (China Standard Time)

That's because you are training with full resolution images. For Waymo, we downsampled the images by a factor 4, I guess this is fitting nuScenes as well.

Tom · Answer 9 · Wed Jun 23 2021 01:15:59 GMT+0800 (China Standard Time)

That's because you are training with full resolution images. For Waymo, we downsampled the images by a factor 4, I guess this is fitting nuScenes as well.

It works!!! ah my mistake took up your time. Thank you so much!

Fabio Pizzati · Answer 10 · Wed Jun 23 2021 01:29:02 GMT+0800 (China Standard Time)

Great! I'm closing the issue then.