JingyunLiang / SwinIR

SwinIR: Image Restoration Using Swin Transformer (official repository)

Home Page:https://arxiv.org/abs/2108.10257

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training time of SwinIR; Impact of learning rate (fix the lr to 1e-5 for x4 fine-tuning is slightly better)

shengkelong opened this issue · comments

In my opinion, the transformer costs much memory. And the paper pointed out that although swinir has fewer parameters, its speed is much slower than RCAN. So I'm curious about the cost of training, thank you.

Experiments are conducted on a machine with 8 Nvidia 2080 Ti GPUs. We use batch_size=32 and less total iterations to save time.

classical_sr_x2 (trained on DIV2K, patch size = 48x48) takes about 1.75 days to train for 500K iterations.

classical_sr_x4 (trained on DIV2K, patch size = 48x48) takes about 0.95 days to train for 250K iterations. Note that we fine-tune x3/x4/8 models from x2 and halve the learning rate and total training iterations, for the benefit of reducing training time.

Thanks, so does it mean that it is difficult to train on a single card because of the memory even use 16 batchsize?

For a middle size SwinIR (used for classical image SR, around 12M parameters), we need about 24GB (2x12GB GPUs) for batch_size=16.
For a small size SwinIR (used for lightweight image SR, around 900k parameters), we need about 12GB(1x12GB GPUs) for batch_size=16.

@JingyunLiang
Hi, thanks for your work!
I find something strange. You use half learning rate to fine-tune BIX3-4-8, why only use half lr for training?
If you use 1e-4 to train swinIR-BIX2 and then use 1e-5 to fine-tune the BIX4, you can get much much better result(even unbelieveable result, PSNR/SSIM so high) instead of half lr=5e-5.
Is this kind of training trick cheating?(still improve 0.2 PSNR on Manga109, 0.0x improve on other image SR benchmark)

I only use half lr (2e-4 may be too large, so we use 1e-4 for fine-tuning) for fine-tuning to save half of the training time. Fine-tuning from x2 is a common practice, such as RCAN, ECCV2018.

As for using 1e-4 to train SwinIR-BIX2, I never tried it before. Your observation is really surprising! According to my experience, the learning rate doesn't have much impact if you decrease it gradually. Maybe Transformers have different characteristics compared with CNNs in learning rate selection.

@Senwang98 Thank you for your reporting. Can you post more details here? Is it classical SR or lightweight SR? What are your PSNR values on five benchmarks? Is your network architecture identical to ours (seemodels/network_swinir.py)?

I will try it and validate your finding~~ My results will be updated here.

@JingyunLiang
Thanks for your quick reply!
The result is found on CNNs network. I am not sure the code is wrong because I used to use EDSR-pytorch repo to train my model.
Several days ago, I conducted an experiment on RCAN, I use RCAN-BIX2.pt (which is 1e-4 for training and decay half per 200000 iters), then I used 1e-5 to fine-tune the RCAN-BIX4(In this time, I didn't decay learning rate per 200000 iters. That is to say I use 1e-5 for the whole training without changing lr!)
I think you are expert in this field, so do you think this is training trick cheating?
If I change the lr when train BIX4, the result is ok. If I use a much small lr to train and don't change lr, the final result is better(For RCAN, the BIX4-Manga109 performance can improve from 31.22 to round 31.45).
Can you give me some suggestions?(I don't mean your swinIR is wrong, I just want to explain this strange thing!)
Thanks for your interesting work again, and maybe you can use siwnIR-BIX2 to fine-tune BIX4 without change lr during training!

It is possible for RCAN as it is a very deep network and should have strong representation ability. Better training strategy or longer training time may help.

In my view, if all other settings are the same as original RCAN (same datasets, same patch size, same training iterations, same optimizer, etc.), changing the lr could be a good trick. I think it is fair. If it is useful for all other CNNs, all future works should use this strategy from then on! However, maybe we should point the lr strategy out in the paper and do some ablation studies if we compare it with these old methods.

As for your advice on fine-tuning SwinIR-BIX4 by using fixed lr (1e-5), I will try it and keep you updated. Thank you.

@JingyunLiang
Yes, you are wright, this training setting should be reported in paper and some study should also be done to support this trick.
I will test training other cnn-based later. If it acturally works, I will tell you.
Thanks for your reply!

  • Update: I compared three kinds of learning rate strategies when finetuning x4 from x2 classical SR models.
Case Init LR LR_step Set5 Set14 BSD100 Urban100 Manga109
1 (used in the paper) 1e-4 [125000, 200000, 225000, 237500] (total_iter=250000) 32.72/0.9021 28.94/0.7914 27.83/0.7459 27.07/0.8164 31.67/0.9226
2 1e-5 [None] (total_iter=250000) 32.69/0.9018 28.96/0.7920 27.84/0.7463 27.07/0.8165 31.73/0.9227
3 1e-5 [None] (total_iter=500000) 32.69/0.9020 28.96/0.7918 27.84/0.7462 27.08/0.8168 31.69/0.9228
  • Conclusion: We get a PSNR improvement from -0.03 to 0.06. Using the second lr strategy (fix the lr to 1e-5 for x4 fine-tuning) is only slightly better.

@JingyunLiang
Ok, I will check. Maybe it is more useful for CNN-based model. (Alough I think this strategy should not work, the results are relly better in my repo 23333333. )

Feel free to open it if you have more questions.