[GPU Utilization] DataLoader iteration speed quite low at the start of every epoch

Question

[GPU Utilization] DataLoader iteration speed quite low at the start of every epoch

CiaoHe opened this issue 2 years ago · comments

Hi, there
Thanks for your great job!
During my train (ResNext50 on Imagenet-LT), I used 8 A100 (total-batch size: 1024). I found at the start of every epoch, the dataloader will stuck for around 30 seconds. Even I change the code into DDP mode, there still get stuck 30 seconds. I wonder whether there's some problem related to the ImageNetLTDataLoader?

Best,

XuDong Frank Wang · Answer 1 · Tue Oct 04 2022 07:07:13 GMT+0800 (China Standard Time)

Hi! We are not experiencing similar problems. Have you checked that you are using a sufficient number of workers? And, if you save checkpoints at each epoch, you might also want to check the speed of saving checkpoints. Currently, we have not found any bugs in our ImageNetLTDataLoader. If you find any bugs, please feel free to let us know. Thanks!

He Cao · Answer 2 · Tue Oct 04 2022 08:22:32 GMT+0800 (China Standard Time)

Thanks for your sharing! My num_workers=12 when GPU number is 8. The speed of saving ckpt is allright. When the batchsize bigger, the stuck phenomenon gets worse. But currently I set the total-batchsize back to 256 as default, the speed is acceptable. Anyway, thanks for your response!