Questions about the training effiency
XiaoqiangZhou opened this issue · comments
Thanks for releasing the code of SiwnIR, which is a really great work for low-level vision tasks.
However, when I train the SwinIR model with the guidance provided in the repo, I find the training efficiency is relatively low.
Specifically, the GPU utilization rate keeps 0 for a while from time to time (run 14 seconds and sleep 14 seconds). When the GPU utilization rate is 0, the CPU utilization is also 0. It's worth noting that I use the DDP training on 8 TITAN-RTX GPU cards with the default batch_size. I train the classic SR task with DIV2K dataset on X2 scale. After half-day training, The epoch, iteration and PSNR on Set5 are about 1500, 42000 and 35.73dB, respectively. So, it will takes about 5 days to finish the 500k iterations, far exceeding the 2 days reported in the README.
Could you please help me to figure out the reason for training efficiency?
It's strange. When I use DDP, the GPU utilization fluctuates but is always high (70%-100%). Can you try to train the model using one GPU and check the GPU utilization?
@JingyunLiang Thanks for your quick reply~
Following your instruction, I try to train the model using one GPU to verify the GPU utilization.
After changing the gpu_ids
in the config file from [0,1,2,3,4,5,6,7]
to [0]
and changing to corresponding dataloader_batch_size
from 32 to 4, I have tried two ways to train the model, i.e., DDP and DP, with one Titan-RTX GPU card.
When I use DP, by running python main_train_psnr.py --opt options/swinir/the_config_file.json
, the GPU utilization keeps around 50%. Maybe I can try to increase the batch_size to fully utilize the GPU capacity under DP mode?
When I use DDP, by running python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 main_train_psnr.py --opt options/swinir/the_config_file.json --dist True
, the low-efficiency phenomenon still exists. So the problem may be caused by the DDP training process. I think I may try to adjust some other configurations such as dataloader_num_workers
if I insist on using DDP mode. Do you have any other suggestion?
I will try to solve this problem in the coming days and update my progress here. If there is no progress on the GPU utilization, I will close this issue.
Thanks.
By the way, I'm using the torch==1.7.0
.
@XiaoqiangZhou any update on this? I am also facing similarly slow training time. With batch size 16 and 1000 iterations per epoch, it takes about 1000 seconds to run a single epoch, any insights on this @JingyunLiang?
I have the same problem, GPU utilization is very low.
各位解决这个问题了么?
You can try to cut the large pictures in the training set into small pictures, because a lot of IO time is spent reading high-resolution images.
After using this method, the GPU utilization rate did not appear 0 during my training. If there are still problems, the CPU performance in the server is probably insufficient.