edward3862 / LoFGAN-pytorch

LoFGAN: Fusing Local Representations for Few-shot Image Generation. (ICCV 2021)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cuda out of memory.

kobeshegu opened this issue · comments

Hi, Edward.
Thanks a lot for your excellent work.
I met this problem when I run the code on machine with 3090 20G *1 after training 2000 iterations.
Any idea how to fix it?
I train the model with your recommended setting:
python train.py --conf configs/flower_lofgan.yaml --output_dir results/flower_lofgan --gpu 0
RuntimeError: CUDA out of memory. Tried to allocate 200.00 MiB (GPU 0; 23.70 GiB total capacity; 18.17 GiB already allocated; 120.56 MiB free; 22.21 GiB reserved in total by PyTorch)

To address this issue, I have tried below in the training loop:

  • torch.cuda.empty_cache()
  • del imgs, lable
  • gc.collect()
    However, none of them helps.
    I also tried to detach() the loss items, still the same issue after 2000 iterations.

Hi @kobeshegu, I'm sorry that the experiments in the paper are conducted on a V100 GPU, so the problem is missed.

Considering the snapshot_val_iter is set to 2000, I think a possible reason is that the model is evaluated every 2000 iterations. Maybe you can just skip the evaluation period by commenting out the following codes in train.py.

#              if (iterations + 1) % config['snapshot_val_iter'] == 0:
#                  with torch.no_grad():
#                      imgs_test = imgs_test.cuda()
#                      fake_xs = []
#                      for i in range(config['num_generate']):
#                          fake_xs.append(trainer.generate(imgs_test).unsqueeze(1))
#                      fake_xs = torch.cat(fake_xs, dim=1)
#                      write_image(iterations, image_directory, imgs_test.detach(), fake_xs.detach())

Thanks for your reply, I dont think it happens becaust the eval(), as you use the no.grad() model in the code. Moreover, i stuck at 1400 iterations when change the trainer.cuda() in the loop and detach the loss items: loss_total = loss_adv_dis_real.detach() + loss_adv_dis_fake.detach() + loss_cls_dis.detach()`
Change in the training loop:

while True:
       with torch.autograd.set_detect_anomaly(True):
           imgs_test, _ = iter(test_dataloader).next()
           trainer = Trainer(config)
           iterations = trainer.resume(checkpoint_directory) if args.resume else 0
           for it, (imgs, label) in enumerate(train_dataloader):
               trainer.cuda()
               trainer.update_lr(iterations, max_iter)
               imgs = imgs.cuda()
               label = label.cuda()

               trainer.zero_grad()
               trainer.dis_update(imgs, label)

               trainer.zero_grad()
               trainer.gen_update(imgs, label)

               # try:
               #     trainer.dis_update(imgs, label)
               #     trainer.gen_update(imgs, label)
               # except RuntimeError as exception:
               #     if "out of memory" in str(exception):
               #         print("WARNING: out of memory")
               #         if hasattr(torch.cuda, 'empty_cache'):
               #             torch.cuda.empty_cache()
               #     else:
               #         raise exception

               if (iterations + 1) % config['snapshot_log_iter'] == 0:
                   end = time.time()
                   print("Iteration: [%06d/%06d], time: %d, loss_adv_dis: %04f, loss_adv_gen: %04f"
                         % (iterations + 1, max_iter, end-start, trainer.loss_adv_dis, trainer.loss_adv_gen))
                   write_loss(iterations, trainer, train_writer)
               del imgs, label
               gc.collect()
               torch.cuda.empty_cache()

FYI, the memory of the NVIDIA 3090 GPU I used is 20G, I don`t know how big your V100 is, and I think my hardware should be enough to run the code as I have ran the StyleGAN2 and FUNIT on it.
Looking forward your reply.
Thanks again for your help.

Hi there, I use a 32GB V100 and the memory cost for training is about 21GB. I found the huge memory cost occurs when applying the gradient penalty for the discriminator. I tried to just cancel it and the memory consumption plummeted. But for now, I have no idea how to fix it perfectly. I tried setting inplace=True for the activations, which reduces about 600MB memory cost, but I'm not sure if it works for you. Or you may have to use a smaller batch size...

Allright then, I`ll try some other ways.
Thanks for your patience^.^

Sorry, it`s me again. Is the gradient panalty makes big difference to the performance? The error no longer shows when I removed the gradient panalty. THX!

how did you deal with it? I have one CUDA with 8GB memory,and i want to run this method. Can you give me some advice about whether it's possible?THX