JingyunLiang / SwinIR

SwinIR: Image Restoration Using Swin Transformer (official repository)

Home Page:https://arxiv.org/abs/2108.10257

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Questions about the training effiency

XiaoqiangZhou opened this issue · comments

Thanks for releasing the code of SiwnIR, which is a really great work for low-level vision tasks.

However, when I train the SwinIR model with the guidance provided in the repo, I find the training efficiency is relatively low.

Specifically, the GPU utilization rate keeps 0 for a while from time to time (run 14 seconds and sleep 14 seconds). When the GPU utilization rate is 0, the CPU utilization is also 0. It's worth noting that I use the DDP training on 8 TITAN-RTX GPU cards with the default batch_size. I train the classic SR task with DIV2K dataset on X2 scale. After half-day training, The epoch, iteration and PSNR on Set5 are about 1500, 42000 and 35.73dB, respectively. So, it will takes about 5 days to finish the 500k iterations, far exceeding the 2 days reported in the README.

Could you please help me to figure out the reason for training efficiency?

It's strange. When I use DDP, the GPU utilization fluctuates but is always high (70%-100%). Can you try to train the model using one GPU and check the GPU utilization?

@JingyunLiang Thanks for your quick reply~

Following your instruction, I try to train the model using one GPU to verify the GPU utilization.
After changing the gpu_ids in the config file from [0,1,2,3,4,5,6,7] to [0] and changing to corresponding dataloader_batch_size from 32 to 4, I have tried two ways to train the model, i.e., DDP and DP, with one Titan-RTX GPU card.

When I use DP, by running python main_train_psnr.py --opt options/swinir/the_config_file.json, the GPU utilization keeps around 50%. Maybe I can try to increase the batch_size to fully utilize the GPU capacity under DP mode?

When I use DDP, by running python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 main_train_psnr.py --opt options/swinir/the_config_file.json --dist True, the low-efficiency phenomenon still exists. So the problem may be caused by the DDP training process. I think I may try to adjust some other configurations such as dataloader_num_workers if I insist on using DDP mode. Do you have any other suggestion?

I will try to solve this problem in the coming days and update my progress here. If there is no progress on the GPU utilization, I will close this issue.

Thanks.

By the way, I'm using the torch==1.7.0.

@XiaoqiangZhou any update on this? I am also facing similarly slow training time. With batch size 16 and 1000 iterations per epoch, it takes about 1000 seconds to run a single epoch, any insights on this @JingyunLiang?

I have the same problem, GPU utilization is very low.

各位解决这个问题了么?

You can try to cut the large pictures in the training set into small pictures, because a lot of IO time is spent reading high-resolution images.

After using this method, the GPU utilization rate did not appear 0 during my training. If there are still problems, the CPU performance in the server is probably insufficient.

ezoic increase your site revenue