Plugin fails if compiled against Libfabric 1.18 but run against Libfabric 1.17 or older.
nvcastet opened this issue · comments
Hello,
Using a fresh deployment of ubuntu 22.04 AMI on p4d.24xlarge
instances and installing baremetal the CUDA stack followed by EFA using those commands:
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
tar -xf aws-efa-installer-latest.tar.gz
sudo ./aws-efa-installer/efa_installer.sh -y
Then running a container image that was built with those commands in its Dockerfile:
...
RUN cd /root \
&& curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \
&& tar -xf /root/aws-efa-installer-latest.tar.gz \
&& cd aws-efa-installer \
&& apt-get update \
&& apt-get install -y libhwloc-dev \
&& ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify \
&& apt-get install -y libfabric-bin \
&& rm -rf /var/lib/apt/lists/*
RUN git clone https://github.com/aws/aws-ofi-nccl.git /opt/aws-ofi-nccl \
&& cd /opt/aws-ofi-nccl \
&& git checkout v1.7.1-aws \
&& ./autogen.sh \
&& ./configure --prefix=/opt/aws-ofi-nccl/ \
--with-libfabric=/opt/amazon/efa/ \
--with-cuda=/usr/local/cuda \
&& make && make install
...
We get this error running the NCCL tests between 2 containers (one on each instance):
configure_sendrecv_inorder:213 NCCL WARN NET/OFI Couldn't set FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES. RC: -92, ERROR: Protocol not available
nccl_net_ofi_init:1163 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
Running fi_info -p efa -t FI_EP_RDM
we get:
provider: efa
fabric: EFA-fe80::8d3:d9ff:fe9a:6059
domain: rdmap197s0-rdm
version: 111.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::8d3:d9ff:fe9a:6059
domain: rdmap197s0-dgrm
version: 111.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
^^ We see 2 provider
sections per EFA adapter (this example is for 1 adapter per instance).
WORKAROUND:
To solve this issue, we need to reinstall efa on the containers with:
./efa_installer.sh -y -k --uninstall
./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify
Then the NCCL tests will work fine with EFA and for some reason fi_info -p efa -t FI_EP_RDM
will return a single provider
section per EFA adapter:
provider: efa
fabric: EFA-fe80::8d3:d9ff:fe9a:6059
domain: rdmap197s0-rdm
version: 111.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
Do you know why 2 libfabric provider sections show up in the broken scenario and why we need to re-install EFA on the containers after running them?
What distro are you using for the container image? I think there's a series of issues all stacked on top of each other, and being able to duplicate in house would be helpful.
As an aside, please don't clone the aws-ofi-nccl master branch for production workloads. We provide release tarballs or, if you really want to git clone, tag releases. We do incremental development on master and do not guarantee that it is always in a stable state.
Thank you @bwbarrett for the quick response.
We are using our public NGC container image nvcr.io/nvidia/dgl:23.07-py3
which uses Ubuntu 22.04
. So both container and AMI are Ubuntu 22.04
.
As an aside, please don't clone the aws-ofi-nccl master branch for production workloads
We are using the v1.7.1-aws
branch, see the git checkout
after the git clone
.
As an aside, please don't clone the aws-ofi-nccl master branch for production workloads
We are using the
v1.7.1-aws
branch, see thegit checkout
after thegit clone
.
I said that poorly. Please don't use the HEAD of a branch; we do have tarball releases and tag releases. We do not guarantee that any branch is in a stable state (other than at tagged points). We obviously try to maintain stability, but make no guarantees or support statements for random HEAD commits.
In fact v1.7.1-aws
is a tag and not a branch, so it is not moving. But i get what you are saying.
In fact
v1.7.1-aws
is a tag and not a branch, so it is not moving. But i get what you are saying.
Apologies; I have not had enough coffee this morning!
And back to your real problem; I think I know what is happening; need to run some tests before I reply to that part, so I don't look like an idiot (again).
Ok, there are a couple things happening here.
First, it looks like your container image has the Ubuntu 22.04-provided Libfabric pre-installed. Ubuntu 22.04 ships with Libfabric 1.11.0, which includes EFA support, as well as a number of utility providers. When you run fi_info -p efa -t FI_EP_RDM
when using Ubuntu's build of Libfabric 1.11.0, it finds two ways to satisfy those constraints. First, the EFA provider itself provides an RDM endpoint type. Second, the RXD utility provider layered over EFA's DGRAM endpoint ends up with an RDM endpoint type. Utility providers always have lower priority than native providers, so the NCCL plugin would only use the native EFA provider and so the multiple responses to the fi_info
search is a harmless oddity of Libfabric. However. The EFA installer package does not build the RXD utility provider, mostly to avoid this confusion.
On to the two bugs you're tripping over. First, when you run the EFA installer the first time, it appears that the system Libfabric is already installed. You almost certainly end up with both packages installed. If you ran a package search, you'd likely see something like:
dpkg -l '*libfabric*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===================-============-============-==========================================================
ii libfabric-aws-bin 1.18.1 amd64 Diagnosis programs for the libfabric communication library
ii libfabric-aws-dev 1.18.1 amd64 Development files for libfabric1
ii libfabric-bin 1.11.0-3 amd64 Diagnosis programs for the libfabric communication library
ii libfabric-dev:amd64 1.11.0-3 amd64 Development files for libfabric1
ii libfabric1:amd64 1.11.0-3 amd64 libfabric communication library
ii libfabric1-aws 1.18.1 amd64 libfabric communication library
There's a bug in the EFA installer (I've filed a ticket so that we try to fix it in future releases, but I think there's some rdma-core dependency issues with fixing it) so that when you run the EFA installer to uninstall the packages, it removes both the system and AWS specific libfabric installs. Then when you run the EFA installer to install the second time, the system Libfabric isn't there, so everything uses the AWS specific libfabric. You could also get to the same place by running apt remove libfabric
before you run the EFA installer the first time and skip all the uninstall business.
When you built the plugin, it found the EFA installer version of Libfabric (ie, Libfabric 1.18.1), which includes support for options to check certain behaviors of EFA. For reasons I can't entirely duplicate (but I'm not using a container, just a normal Ubuntu 22.04 instance), the plugin when you run your NCCL application is picking up the system Libfabric (the 1.11 version), which doesn't support those features. We should handle that case just fine, but there is a bug in the plugin and we error instead of handling that case correctly. So that's the second bug you found with this scenario.
I'm going to try to get the plugin bug fixed as part of the 1.7.2 release. In the short term, I think a simpler workaround than the one you have is to just remove the system libfabric packages before installing the EFA installer, and you should avoid the plugin bug entirely (because the feature we're using will be implemented). Alternatively, you could just use the system-installed Libfabric and skip the EFA installer, but that would result in lower performance and that's probably not what you want.
Perfect, thanks a lot for the debugging, Brian! That makes sense. Yes, we will remove the system libfabric before installing it in the Dockerfile.
In fact system libfabric was not in the Ubuntu 22.04 original base image:
In the Dockerfile we run:
RUN cd /root \
&& curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \
&& tar -xf /root/aws-efa-installer-latest.tar.gz \
&& cd aws-efa-installer \
&& apt-get update \
&& apt-get install -y libhwloc-dev \
&& ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify \
&& apt-get install -y libfabric-bin \
&& rm -rf /var/lib/apt/lists/*
Notice the line apt-get install -y libfabric-bin
after the EFA installer.
I added that to be able to get the fi_info
command but it installed the system libfabric1 as a dependency.
I did not realize that fi_info
was already there but not in PATH
at /opt/amazon/efa/bin/fi_info
.
ah, yes, that would do it.
The installer creates a file /etc/profile.d/efa.sh
that will add /opt/amazon/efa/bin
to your path on the next shell login. For your installer script, a . /etc/profile.d/efa.sh
after you run the installer should get you the Libfabric (and MPI) utilities in your path without starting a new shell (assuming you're using an sh-derivative, of course).
I'm going to leave this issue open until we fix the Libfabric version bug, just so we don't forget.
I didn't get this fixed in time for the 1.7.2 release, but am definitely trying to get it in 1.7.3.
Patches are now in the v1.7.x-aws branch; next step is release.