About training code

Question

About training code

Pexure opened this issue 2 years ago · comments

Pexure commented 2 years ago

Hi, there. Thanks for your amazing work, but I have some questions about the training code.

Do we need to modify main_train_psnr.py (KAIR) to set training iterations to 500K? It's 1M epochs in the original file.
I ran training python -m torch.distributed.launch --nproc_per_node=8 --master_port=1234 main_train_psnr.py --opt options/swinir/train_swinir_sr_classical.json --dist True on 8 RTX 3090 GPUs and the dataset is DIV2K train split (default X2). The estimated training time for 500K iters is ~3.5days (1min/100 iters), much longer than your 1.8 days on 8 2080 Ti GPUs. Do you have any idea about that?

Jingyun Liang · Answer 1 · Mon Nov 29 2021 18:19:35 GMT+0800 (China Standard Time)

1, Yes, 500K is enough for SR.
2, No idea. Maybe you can add more n_workers. Or you can try the codes here.

Pexure · Answer 2 · Tue Nov 30 2021 09:24:18 GMT+0800 (China Standard Time)

Thanks for your reply. I have found the reason. I'm new to SR and missed data preparation described in BasicSR. I think it would be better to make it clear in KAIR :)

Jingyun Liang · Answer 3 · Tue Nov 30 2021 17:00:33 GMT+0800 (China Standard Time)

Thanks for you advice.