"CUDA out of memory. Tried to allocate 1.48 GiB" when trying to validate

Question

"CUDA out of memory. Tried to allocate 1.48 GiB" when trying to validate

BuyMyMojo opened this issue 3 years ago · comments

It uses about 3GB when training and then about 5.2 when it starts the validation then it crashes.
My training data is 512x512 jpg files, one frame: https://imgur.com/a/MulGiTE

GPU: 2070 super
CPU: 5600x
cuda 11 installed
torch==1.9.1+cu111 torchvision==0.10.1+cu111

Complete powershell output:

export CUDA_VISIBLE_DEVICES=0
Path already exists. Rename it to [D:\Code\GitHub\BasicSR\experiments\debug_001_template_archived_210928-091440]
21-09-28 09:14:40.677 - INFO:   name: debug_001_template
  use_tb_logger: True
  model: srragan
  scale: 4
  gpu_ids: [0]
  use_amp: False
  use_swa: False
  datasets:[
    train:[
      name: DIV2K
      mode: LRHRC
      dataroot_HR: ..\..\train\hr
      dataroot_LR: ..\..\train\lr
      subset_file: None
      use_shuffle: True
      znorm: False
      n_workers: 6
      batch_size: 8
      virtual_batch_size: 8
      HR_size: 128
      image_channels: 3
      dataroot_kernels: ../training/kernels/results/
      lr_downscale: True
      lr_downscale_types: [1, 2, 777]
      use_flip: True
      use_rot: True
      hr_rrot: False
      lr_blur: False
      lr_blur_types: ['gaussian', 'clean', 'clean', 'clean']
      noise_data: ../noise_patches/normal/
      lr_noise: False
      lr_noise_types: ['gaussian', 'JPEG', 'clean', 'clean', 'clean', 'clean']
      lr_noise2: False
      lr_noise_types2: ['dither', 'dither', 'clean', 'clean']
      hr_noise: False
      hr_noise_types: ['gaussian', 'clean', 'clean', 'clean', 'clean']
      phase: train
      scale: 4
      data_type: img
    ]
    val:[
      name: val_set14_part
      mode: LRHROTF
      dataroot_HR: ..\..\val\hr
      dataroot_LR: ..\..\val\lr
      znorm: False
      lr_downscale: False
      lr_downscale_types: [1, 2]
      phase: val
      scale: 4
      data_type: img
    ]
  ]
  path:[
    strict: False
    root: D:\Code\GitHub\BasicSR
    pretrain_model_G: ..\experiments\pretrained_models\1xPSNR.pth
    experiments_root: D:\Code\GitHub\BasicSR\experiments\debug_001_template
    models: D:\Code\GitHub\BasicSR\experiments\debug_001_template\models
    training_state: D:\Code\GitHub\BasicSR\experiments\debug_001_template\training_state
    log: D:\Code\GitHub\BasicSR\experiments\debug_001_template
    val_images: D:\Code\GitHub\BasicSR\experiments\debug_001_template\val_images
  ]
  network_G:[
    strict: False
    which_model_G: RRDB_net
    norm_type: None
    mode: CNA
    nf: 64
    nb: 23
    nr: 3
    in_nc: 3
    out_nc: 3
    gc: 32
    group: 1
    convtype: Conv2D
    net_act: leakyrelu
    gaussian: True
    plus: False
    scale: 4
  ]
  network_D:[
    strict: True
    which_model_D: discriminator_vgg
    norm_type: batch
    act_type: leakyrelu
    mode: CNA
    nf: 64
    in_nc: 3
    nlayer: 3
    num_D: 3
  ]
  train:[
    lr_G: 0.0001
    weight_decay_G: 0
    beta1_G: 0.9
    lr_D: 0.0001
    weight_decay_D: 0
    beta1_D: 0.9
    lr_scheme: MultiStepLR
    lr_gamma: 0.5
    swa_start_iter: 375000
    swa_lr: 0.0001
    swa_anneal_epochs: 10
    swa_anneal_strategy: cos
    pixel_criterion: l1
    pixel_weight: 0.01
    feature_criterion: l1
    feature_weight: 1
    gan_type: vanilla
    gan_weight: 0.005
    manual_seed: 0
    niter: 500000.0
    val_freq: 8
    metrics: psnr,ssim,lpips
    overwrite_val_imgs: None
    val_comparison: None
    lr_decay_iter: 10
    lr_steps: [50000, 100000, 200000, 300000]
  ]
  logger:[
    print_freq: 2
    save_checkpoint_freq: 8
    overwrite_chkp: False
  ]
  is_train: True

21-09-28 09:14:40.678 - INFO: Random seed: 0
21-09-28 09:14:41.321 - INFO: Dataset [LRHRDataset - DIV2K] is created.
21-09-28 09:14:41.322 - INFO: Number of train images: 63,792, iters: 7,974
21-09-28 09:14:41.323 - INFO: Total epochs needed: 63 for iters 500,000
21-09-28 09:14:41.324 - INFO: Dataset [LRHRDataset - val_set14_part] is created.
21-09-28 09:14:41.324 - INFO: Number of val images in [val_set14_part]: 5
21-09-28 09:14:41.558 - INFO: AMP library available
21-09-28 09:14:42.583 - INFO: Initialization method [kaiming]
21-09-28 09:14:42.799 - INFO: Initialization method [kaiming]
21-09-28 09:14:42.891 - INFO: Loading pretrained model for G [..\experiments\pretrained_models\1xPSNR.pth] ...
21-09-28 09:14:43.753 - INFO: Network G structure: DataParallel - RRDBNet, with parameters: 16,697,987
21-09-28 09:14:43.754 - INFO: Network D structure: DataParallel - Discriminator_VGG, with parameters: 14,502,281
21-09-28 09:14:43.756 - INFO: Model [SRRaGANModel] is created.
21-09-28 09:14:43.757 - INFO: Start training from epoch: 0, iter: 0
21-09-28 09:14:52.560 - INFO: <epoch:  0, iter:       2, lr:1.000e-04, t:-1.0000s, td:3.0840s, eta:0.0000h> pix-l1: 1.6838e-03 fea-vgg19-l1: 1.5493e+00 l_g_gan: 6.9997e-03 l_d_real: 3.2938e-01 l_d_fake: 3.4658e-01 D_real: 5.9246e-01 D_fake: -4.6950e-01
21-09-28 09:14:53.462 - INFO: <epoch:  0, iter:       4, lr:1.000e-04, t:-1.0000s, td:0.0000s, eta:0.0000h> pix-l1: 2.4982e-03 fea-vgg19-l1: 1.7615e+00 l_g_gan: 1.9201e-02 l_d_real: 5.7227e-02 l_d_fake: 6.6596e-02 D_real: 1.0418e+00 D_fake: -2.7365e+00
21-09-28 09:14:54.274 - INFO: <epoch:  0, iter:       6, lr:1.000e-04, t:0.9020s, td:0.0000s, eta:125.2761h> pix-l1: 2.0472e-03 fea-vgg19-l1: 1.7822e+00 l_g_gan: 3.1084e-02 l_d_real: 6.8650e-03 l_d_fake: 3.2773e-03 D_real: 1.3466e+00 D_fake: -4.8651e+00
21-09-28 09:14:55.233 - INFO: <epoch:  0, iter:       8, lr:1.000e-04, t:0.8125s, td:0.0000s, eta:112.8456h> pix-l1: 2.5441e-03 fea-vgg19-l1: 1.4662e+00 l_g_gan: 2.8835e-02 l_d_real: 1.3615e-02 l_d_fake: 5.1962e-03 D_real: 1.5751e+00 D_fake: -4.1826e+00
21-09-28 09:14:55.669 - INFO: Models and training states saved.
Setting up Perceptual loss...
Loading model from: J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\models\modules\LPIPS\lpips_weights\v0.1\squeeze.pth
...[net-lin [squeeze]] initialized
...Done
Traceback (most recent call last):
  File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\train.py", line 416, in <module>
    main()
  File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\train.py", line 412, in main
    fit(model, opt, dataloaders, steps_states, data_params, loggers)
  File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\train.py", line 289, in fit
    model.test()  # run inference
  File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\models\SRRaGAN_model.py", line 387, in test
    self.forward(CEM_net=CEM_net)
  File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\models\SRRaGAN_model.py", line 254, in forward
    self.fake_H = self.netG(self.var_L)  # G(LR)
  File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Program Files\Python39\lib\site-packages\torch\nn\parallel\data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\models\modules\architectures\RRDBNet_arch.py", line 49, in forward
    x = self.model(x)
  File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\container.py", line 139, in forward
    input = module(input)
  File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\models\modules\architectures\block.py", line 195, in forward
    output = x + self.sub(x)
  File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\container.py", line 139, in forward
    input = module(input)
  File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\models\modules\architectures\RRDBNet_arch.py", line 93, in forward
    out = self.RDB3(out)
  File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\models\modules\architectures\RRDBNet_arch.py", line 159, in forward
    x5 = self.conv5(torch.cat((x, x1, x2, x3, x4), 1))
RuntimeError: CUDA out of memory. Tried to allocate 1.48 GiB (GPU 0; 8.00 GiB total capacity; 2.59 GiB already allocated; 332.74 MiB free; 5.41 GiB reserved in total by PyTorch)

victorca25 · Answer 1 · Thu Sep 30 2021 04:59:59 GMT+0800 (China Standard Time)

Hello! Your validation images are probably too big, larger than what can be handled with the available VRAM.

They can be few and small and only required for you to visualize how the model is doing, it doesn't change the end results.

Owen Quinlan · Answer 2 · Thu Sep 30 2021 10:57:20 GMT+0800 (China Standard Time)

Alright cool thank you!