libfabric + intel MPI over fi_mlx with multiple IB cards on 4OAM PVC

Question

libfabric + intel MPI over fi_mlx with multiple IB cards on 4OAM PVC

paboyle opened this issue 2 months ago · comments

I'm running on a cluster (Dawn@Cambridge) with 4OAM PVC nodes, and 4x mlx5 cards, appearing as mlx5_0 ... mlx5_3
on ibstat.

Intel MPI runs with performance that is only commensurate with using one of the mlx5 HDR 200 cards.
(200Gbit/s x send + receive) = 50GB/s bidrectional.

I expect nearer 200GB/s bidirectional out the node when running 8 MPI tasks per node.

Setting I_MPI_DEBUG=5, it displays the provider info.

MPI startup(): libfabric version: 1.18.1-impi
MPI startup(): libfabric provider:  mlx

This works and is "slow".

I understand that provider fi_mlx uses UCX underneath.
To get multirail working (one rail per MPI rank), I tried running through a wrapper
scripe that uses $SLURM_LOCALID to set $UCX_NET_DEVICES:

#!/bin/bash
mellanox_cards=(0 1 2 3)
mellanox=mlx5_${mellanox_cards[$SLURM_LOCALID]}
export UCX_NET_DEVICES=$mellanox
$*

But this results in:

select.c:627 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable

Any ideas about how to get using multiple adaptors under Intel MPI and the fi_mlx transport?
Is this the right direction?
Is there something I should do different?

Jianxin Xiong · Answer 1 · Fri May 10 2024 03:36:09 GMT+0800 (China Standard Time)

At first, the IB port number needs to be included in the net device specification, e.g., UCX_NET_DEVICES=mlx5_0:1.

Secondly, this may or may not make difference because by default UCX auto selects the device.

Thirdly, UCX supports multi-rail (2 rails by default), you can try UCX_MAX_RNDV_RAILS=4 to see if that makes any difference.