issue:merge with wenet training code
CaRRotOne opened this issue · comments
I had trained successfully on ResNet50 Alg on 2*4 GPUs with bagua. Now I am trying to merge bagua with training code of wenet, changed the training code following examples and tutorials. After I change the code, I encounter a problem, the program is suspended without any mention. I debug the code and found the program is suspended at here . It seems call a rust module.
I am not familiar with rust. Please give some info to solve the problem.
> /opt/conda/lib/python3.7/site-packages/bagua/torch_api/communication.py(339)get_communicator()
338 device_id=get_local_rank(),
--> 339 stream_ptr=pg.stream.cuda_stream,
340 nccl_unique_id_str=nccl_unique_id,
2022-03-15 06:19:07,718 DEBUG Using selector: EpollSelector
ipdb> s
> /opt/conda/lib/python3.7/site-packages/bagua/torch_api/communication.py(340)get_communicator()
339 stream_ptr=pg.stream.cuda_stream,
--> 340 nccl_unique_id_str=nccl_unique_id,
341 )
2022-03-15 06:19:09,976 DEBUG Using selector: EpollSelector
ipdb> pp nccl_unique_id
'AgDQScCooVoAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA='
2022-03-15 06:19:21,795 DEBUG Using selector: EpollSelector
ipdb> s