SpongebBob / Finetune-ChatGLM2-6B

ChatGLM2-6B 全参数微调,支持多轮对话的高效微调。

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

训练数据过多时报错Socket Timeout

onair1314 opened this issue · comments

8卡A100采用200万数据按照脚本训练,运行正常,但超过230万后,会在Running tokenizer on train dataset这一步时报错:
This may indicate a possible application crash on rank 0 or a network set up issue.[7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:604 (most recent call first)
不知各位大神是否遇到同样问题

已解决,需修改deepspeed的timeout参数