Topology Discovery Regression
willgleich opened this issue · comments
We recently upgraded the following package in our containers (eks) running on p4de from 1.5.0 to 1.7.3. https://github.com/aws/aws-ofi-nccl
Unfortunately, after the upgrade, it seems the ability of the package to discover the underlying system/topology has degraded.
When we run the same workload with 1.5.0 of the plugin it is able to correctly infer the topology file as /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4de-24xl-topo.xml
However, after the upgrade it is required we set the env var NCCL_TOPO_FILE, is there any way to have this automatically infer?
Hello. If I understand the problem correctly, when you use 1.5.0, the plugin automatically sets NCCL_TOPO_FILE
to the correct topology file. With version 1.7.3, the plugin does not automatically set NCCL_TOPO_FILE
, and you have to do it manually.
Version 1.7.3 should not have regressed in this feature. Can you please check whether you compiled the plugin with AWS optimizations? These optimizations will automatically be enabled if the configure script detects an AWS instance, or you can add --enable-platform-aws
as a flag to configure (recommended when running on AWS). If these optimizations are enabled, you should see the following log line when running the plugin with the environment variable NCCL_DEBUG=Info
set:
NCCL INFO NET/OFI Configuring AWS-specific options
I can confirm that adding the--enable-platform-aws
fixes the issue with the NCCL INFO NET/OFI Configuring AWS-specific options
as you suggested.
I still suspect there is a regression here - in 1.5.0 we didn't have to set this flag for the topology discovery to automatically happen.