frank-xwang / RIDE-LongTailRecognition

[ICLR 2021 Spotlight] Code release for "Long-tailed Recognition by Routing Diverse Distribution-Aware Experts."

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[GPU Utilization] DataLoader iteration speed quite low at the start of every epoch

CiaoHe opened this issue · comments

Hi, there
Thanks for your great job!
During my train (ResNext50 on Imagenet-LT), I used 8 A100 (total-batch size: 1024). I found at the start of every epoch, the dataloader will stuck for around 30 seconds. Even I change the code into DDP mode, there still get stuck 30 seconds. I wonder whether there's some problem related to the ImageNetLTDataLoader?

Best,

Hi! We are not experiencing similar problems. Have you checked that you are using a sufficient number of workers? And, if you save checkpoints at each epoch, you might also want to check the speed of saving checkpoints. Currently, we have not found any bugs in our ImageNetLTDataLoader. If you find any bugs, please feel free to let us know. Thanks!

Thanks for your sharing! My num_workers=12 when GPU number is 8. The speed of saving ckpt is allright. When the batchsize bigger, the stuck phenomenon gets worse. But currently I set the total-batchsize back to 256 as default, the speed is acceptable. Anyway, thanks for your response!