Cuda out of memory.

Question

Cuda out of memory.

kobeshegu opened this issue 3 years ago · comments

Hi, Edward.
Thanks a lot for your excellent work.
I met this problem when I run the code on machine with 3090 20G *1 after training 2000 iterations.
Any idea how to fix it?
I train the model with your recommended setting:
python train.py --conf configs/flower_lofgan.yaml --output_dir results/flower_lofgan --gpu 0
RuntimeError: CUDA out of memory. Tried to allocate 200.00 MiB (GPU 0; 23.70 GiB total capacity; 18.17 GiB already allocated; 120.56 MiB free; 22.21 GiB reserved in total by PyTorch)

To address this issue, I have tried below in the training loop:

torch.cuda.empty_cache()
del imgs, lable
gc.collect()
However, none of them helps.
I also tried to detach() the loss items, still the same issue after 2000 iterations.

GU Zheng · Answer 1 · Thu Dec 09 2021 14:50:22 GMT+0800 (China Standard Time)

Hi @kobeshegu, I'm sorry that the experiments in the paper are conducted on a V100 GPU, so the problem is missed.

Considering the snapshot_val_iter is set to 2000, I think a possible reason is that the model is evaluated every 2000 iterations. Maybe you can just skip the evaluation period by commenting out the following codes in train.py.

#              if (iterations + 1) % config['snapshot_val_iter'] == 0:
#                  with torch.no_grad():
#                      imgs_test = imgs_test.cuda()
#                      fake_xs = []
#                      for i in range(config['num_generate']):
#                          fake_xs.append(trainer.generate(imgs_test).unsqueeze(1))
#                      fake_xs = torch.cat(fake_xs, dim=1)
#                      write_image(iterations, image_directory, imgs_test.detach(), fake_xs.detach())

Mengping Yang · Answer 2 · Thu Dec 09 2021 15:03:49 GMT+0800 (China Standard Time)

Thanks for your reply, I dont think it happens becaust the eval(), as you use the no.grad() model in the code. Moreover, i stuck at 1400 iterations when change the trainer.cuda() in the loop and detach the loss items: loss_total = loss_adv_dis_real.detach() + loss_adv_dis_fake.detach() + loss_cls_dis.detach()`
Change in the training loop:

while True:
       with torch.autograd.set_detect_anomaly(True):
           imgs_test, _ = iter(test_dataloader).next()
           trainer = Trainer(config)
           iterations = trainer.resume(checkpoint_directory) if args.resume else 0
           for it, (imgs, label) in enumerate(train_dataloader):
               trainer.cuda()
               trainer.update_lr(iterations, max_iter)
               imgs = imgs.cuda()
               label = label.cuda()

               trainer.zero_grad()
               trainer.dis_update(imgs, label)

               trainer.zero_grad()
               trainer.gen_update(imgs, label)

               # try:
               #     trainer.dis_update(imgs, label)
               #     trainer.gen_update(imgs, label)
               # except RuntimeError as exception:
               #     if "out of memory" in str(exception):
               #         print("WARNING: out of memory")
               #         if hasattr(torch.cuda, 'empty_cache'):
               #             torch.cuda.empty_cache()
               #     else:
               #         raise exception

               if (iterations + 1) % config['snapshot_log_iter'] == 0:
                   end = time.time()
                   print("Iteration: [%06d/%06d], time: %d, loss_adv_dis: %04f, loss_adv_gen: %04f"
                         % (iterations + 1, max_iter, end-start, trainer.loss_adv_dis, trainer.loss_adv_gen))
                   write_loss(iterations, trainer, train_writer)
               del imgs, label
               gc.collect()
               torch.cuda.empty_cache()

FYI, the memory of the NVIDIA 3090 GPU I used is 20G, I don`t know how big your V100 is, and I think my hardware should be enough to run the code as I have ran the StyleGAN2 and FUNIT on it.
Looking forward your reply.
Thanks again for your help.

GU Zheng · Answer 3 · Thu Dec 09 2021 18:38:31 GMT+0800 (China Standard Time)

Hi there, I use a 32GB V100 and the memory cost for training is about 21GB. I found the huge memory cost occurs when applying the gradient penalty for the discriminator. I tried to just cancel it and the memory consumption plummeted. But for now, I have no idea how to fix it perfectly. I tried setting inplace=True for the activations, which reduces about 600MB memory cost, but I'm not sure if it works for you. Or you may have to use a smaller batch size...

Mengping Yang · Answer 4 · Thu Dec 09 2021 20:26:54 GMT+0800 (China Standard Time)

Allright then, I`ll try some other ways.
Thanks for your patience^.^

Mengping Yang · Answer 5 · Fri Dec 10 2021 15:54:33 GMT+0800 (China Standard Time)

Sorry, it`s me again. Is the gradient panalty makes big difference to the performance? The error no longer shows when I removed the gradient panalty. THX!

wudiduojimone · Answer 6 · Fri May 06 2022 10:04:42 GMT+0800 (China Standard Time)

how did you deal with it? I have one CUDA with 8GB memory,and i want to run this method. Can you give me some advice about whether it's possible?THX