thuml / Flowformer

About Code release for "Flowformer: Linearizing Transformers with Conservation Flows" (ICML 2022), https://arxiv.org/pdf/2202.06258.pdf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ViT training terminates without error messages

DianCh opened this issue · comments

Hi! Thank you for releasing this wonderful work.

I've been trying out the code of ViT training but have run into some errors: specifically, I kicked off the training script without modifications, but the training fails after several epochs (each time in the middle of 7th epoch) with the following message:

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 11410) of binary: /opt/conda/bin/python           
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

without showing the actual errors which is not very informative. Setting NCCL_DEBUG=INFO didn't reveal more messages than the above. Any chance you have encountered similar issue? I'm using Pytorch 1.9.0 with CUDA 11.1. Could you please share the environment you used?

Thank you very much!

Hi, thanks for your interest! I am sorry that I didn't encounter this issue.
Our experimental environment is Pytorch 1.7.1 with CUDA 11.0. Hope this is helpful to you.