训练数据过多时报错Socket Timeout

Question

onair1314 opened this issue a year ago · comments

8卡A100采用200万数据按照脚本训练，运行正常，但超过230万后，会在Running tokenizer on train dataset这一步时报错：
This may indicate a possible application crash on rank 0 or a network set up issue.[7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:604 (most recent call first)
不知各位大神是否遇到同样问题

已解决，需修改deepspeed的timeout参数