Multi-GPU Support with External Pinning
frobnitzem opened this issue · comments
David M. Rogers commented
In my HPC environment, srun accomplishes pinning of MPI ranks to specific cores and GPU-s (by setting ROCR_VISIBLE_DEVICES). However, this conflicts with rccl-tests, which tries to manually select GPUs based on the MPI rank.
I have fixed this in my own build (frobnitzem@5b347ee) by always running the step gpuid = gpuid % args->localNumDevices
, regardless of whether args->enable_multiranks
is true or not.
I suggest adopting this change, and reverting the update: d16d1fb which throws an error in this case instead.