aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Plugin fails if compiled against Libfabric 1.18 but run against Libfabric 1.17 or older.

nvcastet opened this issue · comments

Hello,

Using a fresh deployment of ubuntu 22.04 AMI on p4d.24xlarge instances and installing baremetal the CUDA stack followed by EFA using those commands:

curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
tar -xf aws-efa-installer-latest.tar.gz
sudo ./aws-efa-installer/efa_installer.sh -y

Then running a container image that was built with those commands in its Dockerfile:

...
RUN cd /root \
    && curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \
    && tar -xf /root/aws-efa-installer-latest.tar.gz \
    && cd aws-efa-installer \
    && apt-get update \
    && apt-get install -y libhwloc-dev \
    && ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify \
    && apt-get install -y libfabric-bin \
    && rm -rf /var/lib/apt/lists/*

RUN git clone https://github.com/aws/aws-ofi-nccl.git /opt/aws-ofi-nccl \
    && cd /opt/aws-ofi-nccl \
    && git checkout v1.7.1-aws \
    && ./autogen.sh \
    && ./configure --prefix=/opt/aws-ofi-nccl/ \
       --with-libfabric=/opt/amazon/efa/ \
       --with-cuda=/usr/local/cuda \
    && make && make install
...

We get this error running the NCCL tests between 2 containers (one on each instance):

configure_sendrecv_inorder:213 NCCL WARN NET/OFI Couldn't set FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES. RC: -92, ERROR: Protocol not available
nccl_net_ofi_init:1163 NCCL WARN NET/OFI aws-ofi-nccl initialization failed

Running fi_info -p efa -t FI_EP_RDM we get:

provider: efa
    fabric: EFA-fe80::8d3:d9ff:fe9a:6059
    domain: rdmap197s0-rdm
    version: 111.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
    fabric: EFA-fe80::8d3:d9ff:fe9a:6059
    domain: rdmap197s0-dgrm
    version: 111.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXD

^^ We see 2 provider sections per EFA adapter (this example is for 1 adapter per instance).

WORKAROUND:
To solve this issue, we need to reinstall efa on the containers with:

./efa_installer.sh -y -k --uninstall
./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify

Then the NCCL tests will work fine with EFA and for some reason fi_info -p efa -t FI_EP_RDM will return a single provider section per EFA adapter:

provider: efa
    fabric: EFA-fe80::8d3:d9ff:fe9a:6059
    domain: rdmap197s0-rdm
    version: 111.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA

Do you know why 2 libfabric provider sections show up in the broken scenario and why we need to re-install EFA on the containers after running them?

What distro are you using for the container image? I think there's a series of issues all stacked on top of each other, and being able to duplicate in house would be helpful.

As an aside, please don't clone the aws-ofi-nccl master branch for production workloads. We provide release tarballs or, if you really want to git clone, tag releases. We do incremental development on master and do not guarantee that it is always in a stable state.

Thank you @bwbarrett for the quick response.
We are using our public NGC container image nvcr.io/nvidia/dgl:23.07-py3 which uses Ubuntu 22.04 . So both container and AMI are Ubuntu 22.04.

As an aside, please don't clone the aws-ofi-nccl master branch for production workloads

We are using the v1.7.1-aws branch, see the git checkout after the git clone.

As an aside, please don't clone the aws-ofi-nccl master branch for production workloads

We are using the v1.7.1-aws branch, see the git checkout after the git clone.

I said that poorly. Please don't use the HEAD of a branch; we do have tarball releases and tag releases. We do not guarantee that any branch is in a stable state (other than at tagged points). We obviously try to maintain stability, but make no guarantees or support statements for random HEAD commits.

In fact v1.7.1-aws is a tag and not a branch, so it is not moving. But i get what you are saying.

In fact v1.7.1-aws is a tag and not a branch, so it is not moving. But i get what you are saying.

Apologies; I have not had enough coffee this morning!

And back to your real problem; I think I know what is happening; need to run some tests before I reply to that part, so I don't look like an idiot (again).

Ok, there are a couple things happening here.

First, it looks like your container image has the Ubuntu 22.04-provided Libfabric pre-installed. Ubuntu 22.04 ships with Libfabric 1.11.0, which includes EFA support, as well as a number of utility providers. When you run fi_info -p efa -t FI_EP_RDM when using Ubuntu's build of Libfabric 1.11.0, it finds two ways to satisfy those constraints. First, the EFA provider itself provides an RDM endpoint type. Second, the RXD utility provider layered over EFA's DGRAM endpoint ends up with an RDM endpoint type. Utility providers always have lower priority than native providers, so the NCCL plugin would only use the native EFA provider and so the multiple responses to the fi_info search is a harmless oddity of Libfabric. However. The EFA installer package does not build the RXD utility provider, mostly to avoid this confusion.

On to the two bugs you're tripping over. First, when you run the EFA installer the first time, it appears that the system Libfabric is already installed. You almost certainly end up with both packages installed. If you ran a package search, you'd likely see something like:

dpkg -l '*libfabric*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                Version      Architecture Description
+++-===================-============-============-==========================================================
ii  libfabric-aws-bin   1.18.1       amd64        Diagnosis programs for the libfabric communication library
ii  libfabric-aws-dev   1.18.1       amd64        Development files for libfabric1
ii  libfabric-bin       1.11.0-3     amd64        Diagnosis programs for the libfabric communication library
ii  libfabric-dev:amd64 1.11.0-3     amd64        Development files for libfabric1
ii  libfabric1:amd64    1.11.0-3     amd64        libfabric communication library
ii  libfabric1-aws      1.18.1       amd64        libfabric communication library

There's a bug in the EFA installer (I've filed a ticket so that we try to fix it in future releases, but I think there's some rdma-core dependency issues with fixing it) so that when you run the EFA installer to uninstall the packages, it removes both the system and AWS specific libfabric installs. Then when you run the EFA installer to install the second time, the system Libfabric isn't there, so everything uses the AWS specific libfabric. You could also get to the same place by running apt remove libfabric before you run the EFA installer the first time and skip all the uninstall business.

When you built the plugin, it found the EFA installer version of Libfabric (ie, Libfabric 1.18.1), which includes support for options to check certain behaviors of EFA. For reasons I can't entirely duplicate (but I'm not using a container, just a normal Ubuntu 22.04 instance), the plugin when you run your NCCL application is picking up the system Libfabric (the 1.11 version), which doesn't support those features. We should handle that case just fine, but there is a bug in the plugin and we error instead of handling that case correctly. So that's the second bug you found with this scenario.

I'm going to try to get the plugin bug fixed as part of the 1.7.2 release. In the short term, I think a simpler workaround than the one you have is to just remove the system libfabric packages before installing the EFA installer, and you should avoid the plugin bug entirely (because the feature we're using will be implemented). Alternatively, you could just use the system-installed Libfabric and skip the EFA installer, but that would result in lower performance and that's probably not what you want.

Perfect, thanks a lot for the debugging, Brian! That makes sense. Yes, we will remove the system libfabric before installing it in the Dockerfile.

In fact system libfabric was not in the Ubuntu 22.04 original base image:
In the Dockerfile we run:

RUN cd /root \
    && curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \
    && tar -xf /root/aws-efa-installer-latest.tar.gz \
    && cd aws-efa-installer \
    && apt-get update \
    && apt-get install -y libhwloc-dev \
    && ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify \
    && apt-get install -y libfabric-bin \
    && rm -rf /var/lib/apt/lists/*

Notice the line apt-get install -y libfabric-bin after the EFA installer.
I added that to be able to get the fi_info command but it installed the system libfabric1 as a dependency.
I did not realize that fi_info was already there but not in PATH at /opt/amazon/efa/bin/fi_info.

ah, yes, that would do it.

The installer creates a file /etc/profile.d/efa.sh that will add /opt/amazon/efa/bin to your path on the next shell login. For your installer script, a . /etc/profile.d/efa.sh after you run the installer should get you the Libfabric (and MPI) utilities in your path without starting a new shell (assuming you're using an sh-derivative, of course).

I'm going to leave this issue open until we fix the Libfabric version bug, just so we don't forget.

I didn't get this fixed in time for the 1.7.2 release, but am definitely trying to get it in 1.7.3.

Patches are now in the v1.7.x-aws branch; next step is release.