JingyunLiang / MANet

Official PyTorch code for Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution (MANet, ICCV2021)

Home Page:https://arxiv.org/abs/2108.05302

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training and OOM

hcleung3325 opened this issue · comments

Thanks for your code.
I tried to train the model with train_stage1.yml, and the Cuda OOM.
I am using 2080 Ti, I tried to reduce the batch size from 16 to 2 and the GT_size from 192 to 48.
However, the training still OOM.
May I know is there anything I missed?
Thanks.

MANet training doesn't take much memory. Did you turn on cal_lr_psnr?

cal_lr_psnr: False # calculate lr psnr consumes huge memory

MANet training doesn't take much memory. Did you turn on cal_lr_psnr?

cal_lr_psnr: False # calculate lr psnr consumes huge memory

Thanks for reply.
No, it keeps false.


#### general settings
name: 001_MANet_aniso_x4_TMO_40_stage1
use_tb_logger: true
model: blind
distortion: sr
scale: 4
gpu_ids: [1]
kernel_size: 21
code_length: 15
# train
sig_min: 0.7 # 0.7, 0.525, 0.35 for x4, x3, x2
sig_max: 10.0  # 10, 7.5, 5 for x4, x3, x2
train_noise: False
noise_high: 15
train_jpeg: False
jpeg_low: 70
# validation
sig: 1.6
sig1: 6 # 6, 5, 4 for x4, x3, x2
sig2: 1
theta: 0
rate_iso: 0 # 1 for iso, 0 for aniso
test_noise: False
noise: 15
test_jpeg: False
jpeg: 70
pca_path: ./pca_matrix_aniso21_15_x4.pth
cal_lr_psnr: False # calculate lr psnr consumes huge memory


#### datasets
datasets:
  train:
    name: TMO
    mode: GT
    dataroot_GT: ../datasets/HR
    dataroot_LQ: ~

    use_shuffle: true
    n_workers: 8
    batch_size: 4
    GT_size: 192
    LR_size: ~
    use_flip: true
    use_rot: true
    color: RGB
  val:
    name: Set5
    mode: GT
    dataroot_GT: ../../data
    dataroot_LQ: ~


#### network structures
network_G:
  which_model_G: MANet_s1
  in_nc: 3
  out_nc: ~
  nf: ~
  nb: ~
  gc: ~
  manet_nf: 128
  manet_nb: 1
  split: 2


#### path
path:
  pretrain_model_G: ~
  strict_load: true
  resume_state:  ~ #../experiments/001_MANet_aniso_x4_DIV2K_40_stage1/training_state/5000.state


#### training settings: learning rate scheme, loss
train:
  lr_G: !!float 2e-4
  lr_scheme: MultiStepLR
  beta1: 0.9
  beta2: 0.999
  niter: 300000
  warmup_iter: -1
  lr_steps: [100000, 150000, 200000, 250000]
  lr_gamma: 0.5
  restarts: ~
  restart_weights: ~
  eta_min: !!float 1e-7

  kernel_criterion: l1
  kernel_weight: 1.0

  manual_seed: 0
  val_freq: !!float 2e7


#### logger
logger:
  print_freq: 200
  save_checkpoint_freq: !!float 2e4

It's strange because MANet is a tiny model and consumes little memory. Do you have any problems testing the model? Can you try to set manet_nf=32 in training?

It's strange because MANet is a tiny model and consumes little memory. Do you have any problems testing the model? Can you try to set manet_nf=32 in training?

Thanks for reply.
I have tried the manet_nf=32 still OOM.

is that to run
python train.py --opt options/train/train_stage1.yml?

I think it's the problem of your GPU. Can you train other models normally? Can you test the MANet on your GPU?

My Gpu is 2080 Ti only get 11GB. Is that need a gpu with bigger ram to train it?

I don't think so. 2080 should at least be enough when manet_nf=32. Can you try to monitor the gpu usage by watch -d -n 0.5 nvidia-smi when you start to train the model?

Thanks a lot. The problem is solved. I can run the training now.