Blank Filling(Interactive)脚本问题
maojinyang opened this issue · comments
TUR1NG commented
在执行Blank Filling(Interactive)脚本后,程序运行到 initialize_distributed(args)
方法中的 torch.distributed.init_process_group
时会卡住,然后在1800000 ms后超时退出。报错信息如下:
Traceback (most recent call last):
File "/Multilingual-GLM/generate_samples.py", line 165, in <module>
main(args)
File "/Multilingual-GLM/generate_samples.py", line 53, in main
initialize_distributed(args)
File "/Multilingual-GLM/SwissArmyTransformer/training/deepspeed_training.py", line 529, in initialize_distributed
torch.distributed.init_process_group(
File "/miniconda3/envs/GLM-130B/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/miniconda3/envs/GLM-130B/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/miniconda3/envs/GLM-130B/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 153, in _create_c10d_store
tcp_store = TCPStore(hostname, port, world_size, False, timeout)
RuntimeError: connect() timed out. Original timeout was 1800000 ms.
Mengyang Sun commented
请问你的开发环境是什么呢?pip环境的版本是否都一致?
TUR1NG commented
开发环境是Linux服务器,torch版本是1.10+cu111
Mengyang Sun commented
开发环境是Linux服务器,torch版本是1.10+cu111
是否可以试一下torch 1.9?我们在多台机器上测试过,在README的配置下是可以正常运行的。torch版本和deepspeed版本可能比较关键。
另外,这一套代码是原生GLM代码的一个旧版本,新版的GLM代码可能修复了一些问题。您也可以用最新的GLM代码加载我们的模型,但是需要您手动来修改或添加一些配置(model_config 或者 tokenizer等)
TUR1NG commented
明白了,非常感谢!