ViT training terminates without error messages
DianCh opened this issue · comments
Hi! Thank you for releasing this wonderful work.
I've been trying out the code of ViT training but have run into some errors: specifically, I kicked off the training script without modifications, but the training fails after several epochs (each time in the middle of 7th epoch) with the following message:
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 11410) of binary: /opt/conda/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
without showing the actual errors which is not very informative. Setting NCCL_DEBUG=INFO
didn't reveal more messages than the above. Any chance you have encountered similar issue? I'm using Pytorch 1.9.0 with CUDA 11.1. Could you please share the environment you used?
Thank you very much!
Hi, thanks for your interest! I am sorry that I didn't encounter this issue.
Our experimental environment is Pytorch 1.7.1 with CUDA 11.0. Hope this is helpful to you.