TsinghuaAI / CPM-2-Pretrain

Code for CPM-2 Pre-Train

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

help 我单机测试两台机器都能正常,但是多机器并行后会出现环境问题

XiaoqingNLP opened this issue · comments

我单机测试两台机器都能正常,但是多机器并行后会出现环境问题

ip:   File "/path/to//src/pretrain_enc_dec.py", line 823, in <module>
ip:     main()
ip:   File "/path/to//src/pretrain_enc_dec.py", line 684, in main
ip:     model, optimizer, lr_scheduler = setup_model_and_optimizer(args, tokenizer.vocab_size)
ip:   File "/path/to//src/pretrain_enc_dec.py", line 157, in setup_model_and_optimizer
ip:     model, optimizer, _, lr_scheduler = deepspeed.initialize(
ip:   File "/path/to//lib/python3.8/site-packages/deepspeed/__init__.py", line 110, in initialize
ip:     engine = DeepSpeedEngine(args=args,
ip:   File "/path/to//lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 198, in __init__
ip:     util_ops = UtilsBuilder().load()
ip:   File "/path/to//lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 176, in load
ip:     return self.jit_load(verbose)
ip:   File "/path/to//lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 204, in jit_load
ip:     op_module = load(
ip:   File "/path/to//lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1079, in load
ip:     return _jit_compile(
ip:   File "/path/to//lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1292, in _jit_compile
ip:     _write_ninja_file_and_build_library(
ip:   File "/path/to//lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1373, in _write_ninja_file_and_build_library
ip:     verify_ninja_availability()
ip:   File "/path/to//lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1429, in verify_ninja_availability
ip:     raise RuntimeError("Ninja is required to load C++ extensions")
ip: RuntimeError: Ninja is required to load C++ extensions

你好,多机环境解决没了,“RuntimeError: Ninja is required to load C++ extensions”应该是读不到conda的环境,添加软链接可解;之后又碰到新问题:
ImportError: cannot import name 'helpers' from 'data' (/home/jovyan/xz_nlp/icode/CPM-2-Pretrain-master/src/data/init.py)
RuntimeError: Connection reset by peer
helper这个之前make没问题,现在又提示:helpers.cpp:26:10: fatal error: pybind11/pybind11.h: 没有那个文件或目录

已经解决了,感谢