THUDM / Multilingual-GLM

The multilingual variant of GLM, a general language model trained with autoregressive blank infilling objective

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Blank Filling(Interactive)脚本问题

maojinyang opened this issue · comments

在执行Blank Filling(Interactive)脚本后,程序运行到 initialize_distributed(args) 方法中的 torch.distributed.init_process_group 时会卡住,然后在1800000 ms后超时退出。报错信息如下:


Traceback (most recent call last):
  File "/Multilingual-GLM/generate_samples.py", line 165, in <module>
    main(args)
  File "/Multilingual-GLM/generate_samples.py", line 53, in main
    initialize_distributed(args)
  File "/Multilingual-GLM/SwissArmyTransformer/training/deepspeed_training.py", line 529, in initialize_distributed
    torch.distributed.init_process_group(
  File "/miniconda3/envs/GLM-130B/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/miniconda3/envs/GLM-130B/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/miniconda3/envs/GLM-130B/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 153, in _create_c10d_store
    tcp_store = TCPStore(hostname, port, world_size, False, timeout)
RuntimeError: connect() timed out. Original timeout was 1800000 ms.

请问你的开发环境是什么呢?pip环境的版本是否都一致?

开发环境是Linux服务器,torch版本是1.10+cu111

开发环境是Linux服务器,torch版本是1.10+cu111

是否可以试一下torch 1.9?我们在多台机器上测试过,在README的配置下是可以正常运行的。torch版本和deepspeed版本可能比较关键。

另外,这一套代码是原生GLM代码的一个旧版本,新版的GLM代码可能修复了一些问题。您也可以用最新的GLM代码加载我们的模型,但是需要您手动来修改或添加一些配置(model_config 或者 tokenizer等)

明白了,非常感谢!