libfabric + intel MPI over fi_mlx with multiple IB cards on 4OAM PVC
paboyle opened this issue · comments
I'm running on a cluster (Dawn@Cambridge) with 4OAM PVC nodes, and 4x mlx5 cards, appearing as mlx5_0 ... mlx5_3
on ibstat.
Intel MPI runs with performance that is only commensurate with using one of the mlx5 HDR 200 cards.
(200Gbit/s x send + receive) = 50GB/s bidrectional.
I expect nearer 200GB/s bidirectional out the node when running 8 MPI tasks per node.
Setting I_MPI_DEBUG=5, it displays the provider info.
MPI startup(): libfabric version: 1.18.1-impi
MPI startup(): libfabric provider: mlx
This works and is "slow".
I understand that provider fi_mlx uses UCX underneath.
To get multirail working (one rail per MPI rank), I tried running through a wrapper
scripe that uses $SLURM_LOCALID to set $UCX_NET_DEVICES:
#!/bin/bash
mellanox_cards=(0 1 2 3)
mellanox=mlx5_${mellanox_cards[$SLURM_LOCALID]}
export UCX_NET_DEVICES=$mellanox
$*
But this results in:
select.c:627 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
Any ideas about how to get using multiple adaptors under Intel MPI and the fi_mlx transport?
Is this the right direction?
Is there something I should do different?
At first, the IB port number needs to be included in the net device specification, e.g., UCX_NET_DEVICES=mlx5_0:1
.
Secondly, this may or may not make difference because by default UCX auto selects the device.
Thirdly, UCX supports multi-rail (2 rails by default), you can try UCX_MAX_RNDV_RAILS=4 to see if that makes any difference.