Training time of SwinIR; Impact of learning rate (fix the lr to 1e-5 for x4 fine-tuning is slightly better)

Question

Training time of SwinIR; Impact of learning rate (fix the lr to 1e-5 for x4 fine-tuning is slightly better)

shengkelong opened this issue 3 years ago · comments

In my opinion, the transformer costs much memory. And the paper pointed out that although swinir has fewer parameters, its speed is much slower than RCAN. So I'm curious about the cost of training, thank you.

Jingyun Liang · Answer 1 · Wed Aug 25 2021 22:01:16 GMT+0800 (China Standard Time)

Experiments are conducted on a machine with 8 Nvidia 2080 Ti GPUs. We use batch_size=32 and less total iterations to save time.

classical_sr_x2 (trained on DIV2K, patch size = 48x48) takes about 1.75 days to train for 500K iterations.

classical_sr_x4 (trained on DIV2K, patch size = 48x48) takes about 0.95 days to train for 250K iterations. Note that we fine-tune x3/x4/8 models from x2 and halve the learning rate and total training iterations, for the benefit of reducing training time.

shengkelong · Answer 2 · Wed Aug 25 2021 22:19:37 GMT+0800 (China Standard Time)

Thanks, so does it mean that it is difficult to train on a single card because of the memory even use 16 batchsize?

Jingyun Liang · Answer 3 · Thu Aug 26 2021 01:33:41 GMT+0800 (China Standard Time)

For a middle size SwinIR (used for classical image SR, around 12M parameters), we need about 24GB (2x12GB GPUs) for batch_size=16.
For a small size SwinIR (used for lightweight image SR, around 900k parameters), we need about 12GB(1x12GB GPUs) for batch_size=16.

Egqawkq · Answer 4 · Thu Aug 26 2021 13:56:56 GMT+0800 (China Standard Time)

@JingyunLiang
Hi, thanks for your work!
I find something strange. You use half learning rate to fine-tune BIX3-4-8, why only use half lr for training?
If you use 1e-4 to train swinIR-BIX2 and then use 1e-5 to fine-tune the BIX4, you can get much much better result(even unbelieveable result, PSNR/SSIM so high) instead of half lr=5e-5.
Is this kind of training trick cheating?(still improve 0.2 PSNR on Manga109, 0.0x improve on other image SR benchmark)

Jingyun Liang · Answer 5 · Thu Aug 26 2021 14:08:09 GMT+0800 (China Standard Time)

I only use half lr (2e-4 may be too large, so we use 1e-4 for fine-tuning) for fine-tuning to save half of the training time. Fine-tuning from x2 is a common practice, such as RCAN, ECCV2018.

As for using 1e-4 to train SwinIR-BIX2, I never tried it before. Your observation is really surprising! According to my experience, the learning rate doesn't have much impact if you decrease it gradually. Maybe Transformers have different characteristics compared with CNNs in learning rate selection.

@Senwang98 Thank you for your reporting. Can you post more details here? Is it classical SR or lightweight SR? What are your PSNR values on five benchmarks? Is your network architecture identical to ours (seemodels/network_swinir.py)?

I will try it and validate your finding~~ My results will be updated here.

Egqawkq · Answer 6 · Thu Aug 26 2021 14:28:42 GMT+0800 (China Standard Time)

@JingyunLiang
Thanks for your quick reply!
The result is found on CNNs network. I am not sure the code is wrong because I used to use EDSR-pytorch repo to train my model.
Several days ago, I conducted an experiment on RCAN, I use RCAN-BIX2.pt (which is 1e-4 for training and decay half per 200000 iters), then I used 1e-5 to fine-tune the RCAN-BIX4(In this time, I didn't decay learning rate per 200000 iters. That is to say I use 1e-5 for the whole training without changing lr!)
I think you are expert in this field, so do you think this is training trick cheating?
If I change the lr when train BIX4, the result is ok. If I use a much small lr to train and don't change lr, the final result is better(For RCAN, the BIX4-Manga109 performance can improve from 31.22 to round 31.45).
Can you give me some suggestions?(I don't mean your swinIR is wrong, I just want to explain this strange thing!)
Thanks for your interesting work again, and maybe you can use siwnIR-BIX2 to fine-tune BIX4 without change lr during training!

Jingyun Liang · Answer 7 · Thu Aug 26 2021 14:55:12 GMT+0800 (China Standard Time)

It is possible for RCAN as it is a very deep network and should have strong representation ability. Better training strategy or longer training time may help.

In my view, if all other settings are the same as original RCAN (same datasets, same patch size, same training iterations, same optimizer, etc.), changing the lr could be a good trick. I think it is fair. If it is useful for all other CNNs, all future works should use this strategy from then on! However, maybe we should point the lr strategy out in the paper and do some ablation studies if we compare it with these old methods.

As for your advice on fine-tuning SwinIR-BIX4 by using fixed lr (1e-5), I will try it and keep you updated. Thank you.

Egqawkq · Answer 8 · Thu Aug 26 2021 15:15:24 GMT+0800 (China Standard Time)

@JingyunLiang
Yes, you are wright, this training setting should be reported in paper and some study should also be done to support this trick.
I will test training other cnn-based later. If it acturally works, I will tell you.
Thanks for your reply!

Jingyun Liang · Answer 9 · Fri Aug 27 2021 17:18:12 GMT+0800 (China Standard Time)

Update: I compared three kinds of learning rate strategies when finetuning x4 from x2 classical SR models.

Case	Init LR	LR_step	Set5	Set14	BSD100	Urban100	Manga109
1 (used in the paper)	1e-4	[125000, 200000, 225000, 237500] (total_iter=250000)	32.72/0.9021	28.94/0.7914	27.83/0.7459	27.07/0.8164	31.67/0.9226
2	1e-5	[None] (total_iter=250000)	32.69/0.9018	28.96/0.7920	27.84/0.7463	27.07/0.8165	31.73/0.9227
3	1e-5	[None] (total_iter=500000)	32.69/0.9020	28.96/0.7918	27.84/0.7462	27.08/0.8168	31.69/0.9228

Conclusion: We get a PSNR improvement from -0.03 to 0.06. Using the second lr strategy (fix the lr to 1e-5 for x4 fine-tuning) is only slightly better.

Egqawkq · Answer 10 · Fri Aug 27 2021 17:24:03 GMT+0800 (China Standard Time)

@JingyunLiang
Ok, I will check. Maybe it is more useful for CNN-based model. (Alough I think this strategy should not work, the results are relly better in my repo 23333333. )

Jingyun Liang · Answer 11 · Mon Sep 27 2021 15:37:53 GMT+0800 (China Standard Time)

Feel free to open it if you have more questions.