aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Topology Discovery Regression

willgleich opened this issue · comments

We recently upgraded the following package in our containers (eks) running on p4de from 1.5.0 to 1.7.3. https://github.com/aws/aws-ofi-nccl

Unfortunately, after the upgrade, it seems the ability of the package to discover the underlying system/topology has degraded.

When we run the same workload with 1.5.0 of the plugin it is able to correctly infer the topology file as /opt/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4de-24xl-topo.xml

However, after the upgrade it is required we set the env var NCCL_TOPO_FILE, is there any way to have this automatically infer?

Hello. If I understand the problem correctly, when you use 1.5.0, the plugin automatically sets NCCL_TOPO_FILE to the correct topology file. With version 1.7.3, the plugin does not automatically set NCCL_TOPO_FILE, and you have to do it manually.

Version 1.7.3 should not have regressed in this feature. Can you please check whether you compiled the plugin with AWS optimizations? These optimizations will automatically be enabled if the configure script detects an AWS instance, or you can add --enable-platform-aws as a flag to configure (recommended when running on AWS). If these optimizations are enabled, you should see the following log line when running the plugin with the environment variable NCCL_DEBUG=Info set:

NCCL INFO NET/OFI Configuring AWS-specific options

I can confirm that adding the--enable-platform-aws fixes the issue with the NCCL INFO NET/OFI Configuring AWS-specific options as you suggested.

I still suspect there is a regression here - in 1.5.0 we didn't have to set this flag for the topology discovery to automatically happen.