ViT training terminates without error messages

Question

ViT training terminates without error messages

DianCh opened this issue a year ago · comments

Hi! Thank you for releasing this wonderful work.

I've been trying out the code of ViT training but have run into some errors: specifically, I kicked off the training script without modifications, but the training fails after several epochs (each time in the middle of 7th epoch) with the following message:

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 11410) of binary: /opt/conda/bin/python           
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

without showing the actual errors which is not very informative. Setting NCCL_DEBUG=INFO didn't reveal more messages than the above. Any chance you have encountered similar issue? I'm using Pytorch 1.9.0 with CUDA 11.1. Could you please share the environment you used?

Thank you very much!

wuhaixu2016 · Answer 1 · Sun Jan 29 2023 21:57:59 GMT+0800 (China Standard Time)

Hi, thanks for your interest! I am sorry that I didn't encounter this issue.
Our experimental environment is Pytorch 1.7.1 with CUDA 11.0. Hope this is helpful to you.