训练数据过多时报错Socket Timeout
onair1314 opened this issue · comments
8卡A100采用200万数据按照脚本训练,运行正常,但超过230万后,会在Running tokenizer on train dataset这一步时报错:
This may indicate a possible application crash on rank 0 or a network set up issue.[7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:604 (most recent call first)
不知各位大神是否遇到同样问题
已解决,需修改deepspeed的timeout参数