Srun with intelmpi hang if multi node.
vyscenkoh opened this issue · comments
Describe the bug
This is the last output line before it hang forever
[0] MPI_Startup(): libfabric provider: verbs;ofi_rxm
No error reported. No traffic flow between nodes.
To Reproduce
I have a fresh environment with intelmpi2021.11, Libfabric 1.18.1-ipmi, slurm 21.08.8-2, and RoCEv2 network
Both intelmpi and openmpi using mpirun single/multi node: ok.
Openmpi using srun single/multi node: ok
Intelmpi using srun with single node: ok
Intelmpi using srun with multi node: not ok
Environment:
Rockylinux 8.6
Sorry for the late response.
If you set FI_LOG_LEVEL=warn
I would expect to see some warning messages about connection failure. There may be something wrong in the network setup that prevented rdma-cm from working properly.
@vyscenkoh any update on this? Do you still see the issue or can this be closed?