BaguaSys / bagua

Bagua Speeds up PyTorch

Home Page:https://tutorials.baguasys.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

issue:merge with wenet training code

CaRRotOne opened this issue · comments

I had trained successfully on ResNet50 Alg on 2*4 GPUs with bagua. Now I am trying to merge bagua with training code of wenet, changed the training code following examples and tutorials. After I change the code, I encounter a problem, the program is suspended without any mention. I debug the code and found the program is suspended at here . It seems call a rust module.
I am not familiar with rust. Please give some info to solve the problem.

> /opt/conda/lib/python3.7/site-packages/bagua/torch_api/communication.py(339)get_communicator()
    338         device_id=get_local_rank(),
--> 339         stream_ptr=pg.stream.cuda_stream,
    340         nccl_unique_id_str=nccl_unique_id,

2022-03-15 06:19:07,718 DEBUG Using selector: EpollSelector
ipdb> s
> /opt/conda/lib/python3.7/site-packages/bagua/torch_api/communication.py(340)get_communicator()
    339         stream_ptr=pg.stream.cuda_stream,
--> 340         nccl_unique_id_str=nccl_unique_id,
    341     )

2022-03-15 06:19:09,976 DEBUG Using selector: EpollSelector
ipdb> pp nccl_unique_id
'AgDQScCooVoAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA='
2022-03-15 06:19:21,795 DEBUG Using selector: EpollSelector
ipdb> s