CUDA out of memory

Question

CUDA out of memory

TornaxO7 opened this issue 3 years ago · comments

Thank you first of all for this awesome program!

I'm getting the following error message if I run your program:

Testing 0 place
/home/tornax/Apps/Real-ESRGAN/inference_realesrgan.py:84: UserWarning: The input image is large, try X2 model for better performance.
  warnings.warn('The input image is large, try X2 model for better performance.')
Error CUDA out of memory. Tried to allocate 1.20 GiB (GPU 0; 3.94 GiB total capacity; 1.54 GiB already allocated; 1.22 GiB free; 1.74 GiB reserved in total by PyTorch)
	Tile 1/1
Error local variable 'output_tile' referenced before assignment
If you encounter CUDA out of memory, try to set --tile with a smaller number.

I have 12GB of RAM and it's fine for me if it's using all of it. How can I set the allowed RAM usage?
I've tried to call it like this:

python inference_realesrgan.py --tile 12000000 --model_path experiments/pretrained_models/RealESRGAN_x4plus_anime_6B.pth --input inputs --outscale 0 ```
but it doesn't work as well (I get the same error message).

Xintao · Answer 1 · Tue Sep 28 2021 10:34:26 GMT+0800 (China Standard Time)

decrease the --tile

such as --tile 800 or smaller than 800

TornaxO7 · Answer 2 · Wed Sep 29 2021 08:20:22 GMT+0800 (China Standard Time)

Ok, thank you!

Orestis Herodotou · Answer 3 · Sat Oct 30 2021 14:40:43 GMT+0800 (China Standard Time)

Sorry to reopen here, let me know if I should open a new issue...

Having the same problem in Google Colab trying to train the finetune_RealESRGANx4plus_400k_pairdata model. My hr images are 256x256. I've also tried with 128x128 inputs using the crop to sub-images, and tried adjusting batch_size_per_gpu all the way down to 1, and num_worker_per_gpu also down to 1, always with same results: RuntimeError: CUDA out of memory. .

Before I run the train command I show:

Gen RAM Free: 12.4 GB  |     Proc size: 1.1 GB
GPU RAM Free: 11441MB | Used: 0MB | Util   0% | Total     11441MB

!python realesrgan/train.py -opt options/finetune_realesrgan_x4plus_pairdata.yml --auto_resume --debug

Errors out right at the first epoch:

Version Information: 
	BasicSR: 1.3.4.6
	PyTorch: 1.9.0+cu111
	TorchVision: 0.10.0+cu111
INFO: 
  name: debug_finetune_RealESRGANx4plus_400k_pairdata
  model_type: RealESRGANModel
  scale: 1
  num_gpu: 1
  manual_seed: 0
  l1_gt_usm: True
  percep_gt_usm: True
  gan_gt_usm: False
  high_order_degradation: False
  datasets:[
    train:[
      name: myexperiment
      type: RealESRGANPairedDataset
      dataroot_gt: datasets/mydataset
      dataroot_lq: datasets/mydataset
      meta_info: datasets/mydataset/meta_info/meta_info_mydataset_pair.txt
      io_backend:[
        type: disk
      ]
      gt_size: 128
      use_hflip: True
      use_rot: False
      use_shuffle: True
      num_worker_per_gpu: 5
      batch_size_per_gpu: 4
      dataset_enlarge_ratio: 1
      prefetch_mode: None
      phase: train
      scale: 1
    ]
  ]
...
...
2021-10-30 06:25:51,531 INFO: Loading UNetDiscriminatorSN model from experiments/pretrained_models/RealESRGAN_x4plus_netD.pth, with param key: [params].
2021-10-30 06:25:51,544 INFO: Loss [L1Loss] is created.
2021-10-30 06:25:53,305 INFO: Loss [PerceptualLoss] is created.
2021-10-30 06:25:53,327 INFO: Loss [GANLoss] is created.
2021-10-30 06:25:53,353 INFO: Model [RealESRGANModel] is created.
2021-10-30 06:25:53,501 INFO: Start training from epoch: 0, iter: 0
Traceback (most recent call last):
  File "realesrgan/train.py", line 11, in <module>
    train_pipeline(root_path)
  File "/usr/local/lib/python3.7/dist-packages/basicsr/train.py", line 169, in train_pipeline
    model.optimize_parameters(current_iter)
  File "/content/gdrive/My Drive/colab-esrgan/Real-ESRGAN/realesrgan/models/realesrgan_model.py", line 193, in optimize_parameters
    self.output = self.net_g(self.lq)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/basicsr/archs/rrdbnet_arch.py", line 113, in forward
    body_feat = self.conv_body(self.body(feat))
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/basicsr/archs/rrdbnet_arch.py", line 60, in forward
    out = self.rdb2(out)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/basicsr/archs/rrdbnet_arch.py", line 36, in forward
    x4 = self.lrelu(self.conv4(torch.cat((x, x1, x2, x3), 1)))

RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 11.17 GiB total capacity; 10.63 GiB already allocated; 25.81 MiB free; 10.65 GiB reserved in total by PyTorch)

Not sure what else to try, is the model too large to train for a colab? Appreciate any help and thanks for a great library and clear instructions on getting setup to train and use this great model!

Cliff Njoroge · Answer 4 · Tue Aug 02 2022 13:47:54 GMT+0800 (China Standard Time)

change your batch size to 1