ofiwg / libfabric

Open Fabric Interfaces

Home Page:http://libfabric.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Srun with intelmpi hang if multi node.

vyscenkoh opened this issue · comments

Describe the bug
This is the last output line before it hang forever
[0] MPI_Startup(): libfabric provider: verbs;ofi_rxm
No error reported. No traffic flow between nodes.

To Reproduce
I have a fresh environment with intelmpi2021.11, Libfabric 1.18.1-ipmi, slurm 21.08.8-2, and RoCEv2 network
Both intelmpi and openmpi using mpirun single/multi node: ok.
Openmpi using srun single/multi node: ok
Intelmpi using srun with single node: ok
Intelmpi using srun with multi node: not ok

Environment:
Rockylinux 8.6

Sorry for the late response.

If you set FI_LOG_LEVEL=warn I would expect to see some warning messages about connection failure. There may be something wrong in the network setup that prevented rdma-cm from working properly.

@vyscenkoh any update on this? Do you still see the issue or can this be closed?