"CUDA out of memory. Tried to allocate 1.48 GiB" when trying to validate
BuyMyMojo opened this issue · comments
It uses about 3GB when training and then about 5.2 when it starts the validation then it crashes.
My training data is 512x512 jpg files, one frame: https://imgur.com/a/MulGiTE
GPU: 2070 super
CPU: 5600x
cuda 11 installed
torch==1.9.1+cu111 torchvision==0.10.1+cu111
Complete powershell output:
export CUDA_VISIBLE_DEVICES=0
Path already exists. Rename it to [D:\Code\GitHub\BasicSR\experiments\debug_001_template_archived_210928-091440]
21-09-28 09:14:40.677 - INFO: name: debug_001_template
use_tb_logger: True
model: srragan
scale: 4
gpu_ids: [0]
use_amp: False
use_swa: False
datasets:[
train:[
name: DIV2K
mode: LRHRC
dataroot_HR: ..\..\train\hr
dataroot_LR: ..\..\train\lr
subset_file: None
use_shuffle: True
znorm: False
n_workers: 6
batch_size: 8
virtual_batch_size: 8
HR_size: 128
image_channels: 3
dataroot_kernels: ../training/kernels/results/
lr_downscale: True
lr_downscale_types: [1, 2, 777]
use_flip: True
use_rot: True
hr_rrot: False
lr_blur: False
lr_blur_types: ['gaussian', 'clean', 'clean', 'clean']
noise_data: ../noise_patches/normal/
lr_noise: False
lr_noise_types: ['gaussian', 'JPEG', 'clean', 'clean', 'clean', 'clean']
lr_noise2: False
lr_noise_types2: ['dither', 'dither', 'clean', 'clean']
hr_noise: False
hr_noise_types: ['gaussian', 'clean', 'clean', 'clean', 'clean']
phase: train
scale: 4
data_type: img
]
val:[
name: val_set14_part
mode: LRHROTF
dataroot_HR: ..\..\val\hr
dataroot_LR: ..\..\val\lr
znorm: False
lr_downscale: False
lr_downscale_types: [1, 2]
phase: val
scale: 4
data_type: img
]
]
path:[
strict: False
root: D:\Code\GitHub\BasicSR
pretrain_model_G: ..\experiments\pretrained_models\1xPSNR.pth
experiments_root: D:\Code\GitHub\BasicSR\experiments\debug_001_template
models: D:\Code\GitHub\BasicSR\experiments\debug_001_template\models
training_state: D:\Code\GitHub\BasicSR\experiments\debug_001_template\training_state
log: D:\Code\GitHub\BasicSR\experiments\debug_001_template
val_images: D:\Code\GitHub\BasicSR\experiments\debug_001_template\val_images
]
network_G:[
strict: False
which_model_G: RRDB_net
norm_type: None
mode: CNA
nf: 64
nb: 23
nr: 3
in_nc: 3
out_nc: 3
gc: 32
group: 1
convtype: Conv2D
net_act: leakyrelu
gaussian: True
plus: False
scale: 4
]
network_D:[
strict: True
which_model_D: discriminator_vgg
norm_type: batch
act_type: leakyrelu
mode: CNA
nf: 64
in_nc: 3
nlayer: 3
num_D: 3
]
train:[
lr_G: 0.0001
weight_decay_G: 0
beta1_G: 0.9
lr_D: 0.0001
weight_decay_D: 0
beta1_D: 0.9
lr_scheme: MultiStepLR
lr_gamma: 0.5
swa_start_iter: 375000
swa_lr: 0.0001
swa_anneal_epochs: 10
swa_anneal_strategy: cos
pixel_criterion: l1
pixel_weight: 0.01
feature_criterion: l1
feature_weight: 1
gan_type: vanilla
gan_weight: 0.005
manual_seed: 0
niter: 500000.0
val_freq: 8
metrics: psnr,ssim,lpips
overwrite_val_imgs: None
val_comparison: None
lr_decay_iter: 10
lr_steps: [50000, 100000, 200000, 300000]
]
logger:[
print_freq: 2
save_checkpoint_freq: 8
overwrite_chkp: False
]
is_train: True
21-09-28 09:14:40.678 - INFO: Random seed: 0
21-09-28 09:14:41.321 - INFO: Dataset [LRHRDataset - DIV2K] is created.
21-09-28 09:14:41.322 - INFO: Number of train images: 63,792, iters: 7,974
21-09-28 09:14:41.323 - INFO: Total epochs needed: 63 for iters 500,000
21-09-28 09:14:41.324 - INFO: Dataset [LRHRDataset - val_set14_part] is created.
21-09-28 09:14:41.324 - INFO: Number of val images in [val_set14_part]: 5
21-09-28 09:14:41.558 - INFO: AMP library available
21-09-28 09:14:42.583 - INFO: Initialization method [kaiming]
21-09-28 09:14:42.799 - INFO: Initialization method [kaiming]
21-09-28 09:14:42.891 - INFO: Loading pretrained model for G [..\experiments\pretrained_models\1xPSNR.pth] ...
21-09-28 09:14:43.753 - INFO: Network G structure: DataParallel - RRDBNet, with parameters: 16,697,987
21-09-28 09:14:43.754 - INFO: Network D structure: DataParallel - Discriminator_VGG, with parameters: 14,502,281
21-09-28 09:14:43.756 - INFO: Model [SRRaGANModel] is created.
21-09-28 09:14:43.757 - INFO: Start training from epoch: 0, iter: 0
21-09-28 09:14:52.560 - INFO: <epoch: 0, iter: 2, lr:1.000e-04, t:-1.0000s, td:3.0840s, eta:0.0000h> pix-l1: 1.6838e-03 fea-vgg19-l1: 1.5493e+00 l_g_gan: 6.9997e-03 l_d_real: 3.2938e-01 l_d_fake: 3.4658e-01 D_real: 5.9246e-01 D_fake: -4.6950e-01
21-09-28 09:14:53.462 - INFO: <epoch: 0, iter: 4, lr:1.000e-04, t:-1.0000s, td:0.0000s, eta:0.0000h> pix-l1: 2.4982e-03 fea-vgg19-l1: 1.7615e+00 l_g_gan: 1.9201e-02 l_d_real: 5.7227e-02 l_d_fake: 6.6596e-02 D_real: 1.0418e+00 D_fake: -2.7365e+00
21-09-28 09:14:54.274 - INFO: <epoch: 0, iter: 6, lr:1.000e-04, t:0.9020s, td:0.0000s, eta:125.2761h> pix-l1: 2.0472e-03 fea-vgg19-l1: 1.7822e+00 l_g_gan: 3.1084e-02 l_d_real: 6.8650e-03 l_d_fake: 3.2773e-03 D_real: 1.3466e+00 D_fake: -4.8651e+00
21-09-28 09:14:55.233 - INFO: <epoch: 0, iter: 8, lr:1.000e-04, t:0.8125s, td:0.0000s, eta:112.8456h> pix-l1: 2.5441e-03 fea-vgg19-l1: 1.4662e+00 l_g_gan: 2.8835e-02 l_d_real: 1.3615e-02 l_d_fake: 5.1962e-03 D_real: 1.5751e+00 D_fake: -4.1826e+00
21-09-28 09:14:55.669 - INFO: Models and training states saved.
Setting up Perceptual loss...
Loading model from: J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\models\modules\LPIPS\lpips_weights\v0.1\squeeze.pth
...[net-lin [squeeze]] initialized
...Done
Traceback (most recent call last):
File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\train.py", line 416, in <module>
main()
File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\train.py", line 412, in main
fit(model, opt, dataloaders, steps_states, data_params, loggers)
File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\train.py", line 289, in fit
model.test() # run inference
File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\models\SRRaGAN_model.py", line 387, in test
self.forward(CEM_net=CEM_net)
File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\models\SRRaGAN_model.py", line 254, in forward
self.fake_H = self.netG(self.var_L) # G(LR)
File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Program Files\Python39\lib\site-packages\torch\nn\parallel\data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\models\modules\architectures\RRDBNet_arch.py", line 49, in forward
x = self.model(x)
File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\container.py", line 139, in forward
input = module(input)
File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\models\modules\architectures\block.py", line 195, in forward
output = x + self.sub(x)
File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\container.py", line 139, in forward
input = module(input)
File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\models\modules\architectures\RRDBNet_arch.py", line 93, in forward
out = self.RDB3(out)
File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "J:\Videos\ESRGAN\DATASET\traiNNer-2.0\codes\models\modules\architectures\RRDBNet_arch.py", line 159, in forward
x5 = self.conv5(torch.cat((x, x1, x2, x3, x4), 1))
RuntimeError: CUDA out of memory. Tried to allocate 1.48 GiB (GPU 0; 8.00 GiB total capacity; 2.59 GiB already allocated; 332.74 MiB free; 5.41 GiB reserved in total by PyTorch)
Hello! Your validation images are probably too big, larger than what can be handled with the available VRAM.
They can be few and small and only required for you to visualize how the model is doing, it doesn't change the end results.
Alright cool thank you!