aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mellanox and EFA in Docker Image

mvpatel2000 opened this issue · comments

I'm attempting to assemble a single docker image to support both EFA and mellanox as we split workloads between different clouds, and it's easy to use the wrong image on the wrong cloud. I currently have something like this:

#####################################
# Install EFA and AWS-OFI-NCCL plugin
#####################################

ARG EFA_INSTALLER_VERSION=latest
ARG AWS_OFI_NCCL_VERSION=v1.5.0-aws

ENV LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/opt/amazon/openmpi/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:$LD_LIBRARY_PATH
ENV PATH=/opt/amazon/openmpi/bin/:/opt/amazon/efa/bin:$PATH

RUN if [ -n "$CUDA_VERSION" ] ; then \
        cd /tmp && \
        curl -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
        tar -xf /tmp/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
        cd aws-efa-installer && \
        apt-get update && \
        ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify && \
        rm -rf /tmp/aws-efa-installer* ; \
    fi

RUN if [ -n "$CUDA_VERSION" ] ; then \
        git clone https://github.com/aws/aws-ofi-nccl.git /opt/aws-ofi-nccl && \
        cd /opt/aws-ofi-nccl && \
        git checkout ${AWS_OFI_NCCL_VERSION} && \
        ./autogen.sh && \
        ./configure --prefix=/opt/aws-ofi-nccl/install \
            --with-libfabric=/opt/amazon/efa/ \
            --with-cuda=/usr/local/cuda \
            --disable-tests && \
        make && make install ; \
    fi

###################################
# Mellanox OFED driver installation
###################################

ARG MOFED_VERSION

RUN if [ -n "$MOFED_VERSION" ] ; then \
        mkdir -p /tmp/mofed && \
        wget -nv -P /tmp/mofed http://content.mellanox.com/ofed/MLNX_OFED-${MOFED_VERSION}/MLNX_OFED_LINUX-${MOFED_VERSION}-ubuntu20.04-x86_64.tgz && \
        tar -zxvf /tmp/mofed/MLNX_OFED_LINUX-${MOFED_VERSION}-ubuntu20.04-x86_64.tgz -C /tmp/mofed && \
        /tmp/mofed/MLNX_OFED_LINUX-${MOFED_VERSION}-ubuntu20.04-x86_64/mlnxofedinstall --user-space-only --without-fw-update --force && \
        rm -rf /tmp/mofed ; \
    fi

and I either comment out the mellanox part of the EFA part depending on which image I want to build. When attempting to build both at the same time, it appears the mellanox installation wipes out something from EFA resulting in EFA not working. If I install EFA after mellanox, I encounter the following error:

The following packages have unmet dependencies:
   libibmad5-dbg : Depends: libibmad5 (= 43.0-1) but 55mlnx37-1.55103 is to be installed
   libibnetdisc-dev : Depends: libibnetdisc5 (= 43.0-1) but 55mlnx37-1.55103 is to be installed
   libibnetdisc5-dbg : Depends: libibnetdisc5 (= 43.0-1) but 55mlnx37-1.55103 is to be installed
   libibumad3-dbg : Depends: libibumad3 (= 43.0-1) but 55mlnx37-1.55103 is to be installed
   librdmacm1-dbg : Depends: librdmacm1 (= 43.0-1) but 55mlnx37-1.55103 is to be installed

I would love to get some insight into

  1. is this possible
  2. is there any documentation / guidance on how to do it?

@mvpatel2000 We have followed up internally to get back an answer for you. Thanks for your patience.

Thanks! I appreciate the help :)

Unfortunately, it will not be easy to get what you're looking for. It's possible, but would require you to build some packages yourself, and likely skip at least one of the two installer scripts. The core issue is that operating systems are relatively slow to update the rdma-core package in their distributions, and there are new features both AWS and Mellanox are releasing all the time, which require updating rdma-core. The solution is that both of us ship an rdma-core in our installers, which creates a conflict.

The EFA installer currently ships rdma-core v43. Other than applying Ubuntu's packaging scripts, it is unmodified from the official upstream, and it includes all the providers (ie, drivers) for all NICs that are upstreamed. The EFA packaging names correspond to the packaging names used by Ubuntu when they package rdma-core.

The Mellanox installer currently ships rdma-core v40, although it appears that it includes Mellanox-specific patches. Unfortunately, it does not include all the providers in upstream, but only installes the mlx5 provider. Additionally (but generally not importantly) the Mellanox packaging seems to conform to the upstream dpkg files, rather than how Ubuntu packages things.

So the difference in package naming is why you get the weird conflict above (and part of why you likely can't use at least one of the installers). But you have another problem with rdma-core, in that the EFA-installer package includes the mlx5 provider (the one you need for modern IB) but does not include whatever patched Mellanox adds that aren't upstreamed. The Mellanox installer package includes the Mellanox patches, but not the efa provider. So neither built gets you entirely what you want in both platforms (although it is likely that if you are only using NCCL, you are using a subset of the Mellanox stack small enough that you aren't using anything that is patched, but I'm not going to be able to answer that question).

I can see two options for solving the rdma-core problem:

  1. Use the EFA installer rdma-core. You may miss out on a new feature of the Mellanox cards, I wouldn't recommend this if you're using really advanced users of the Mellanox stack like Open MPI, MVAPICH, or UCX, but for NCCL it should be fine. And obviously it will work on EFA>
  2. The source artifacts for MOFED do include the EFA provider, so you could build from source and end up with an rdma-core that worked with EFA and included all the Mellanox extensions on IB. Of course, support on EFA with that MOFED version would be difficult and certainly isn't something AWS tests on a regular basis.

Once you have a working MOFED, I think it's a matter of installing the other packages manually from the installer tarballs. That's a bit of a pain, but should be relatively straight-forward. The EFA Libfabric and Open MPI installs will end up in /opt/amazon and the mellanox packages in /opt/mellanox, and everything should be happy. But, like I said, the installers will go back to clobbering each other over rdma-core, so that has to be avoided.

@bwbarrett thank you so much for the detailed answer and explanation -- this is incredibly helpful.

Given all this context, I think the best way to proceed is probably to build a different set of images for EFA and mellanox, and come back to merging them later given the challenges. So, I will close this issue in the meantime.

When I have some time, I would like to try to merge them as it solves some maintenance issues for us. Given your explanation, I will try to proceed by installing EFA (and keeping rdma-core unpatched) and then try to layer on top what I need for Mellanox. I will then benchmark NCCL and verify it doesn't cause any performance degradations. I'll share whatever findings I get in this thread later if you or anyone else looking at this in the future finds any need for it :)

Separately, I want to quickly thank you and the rest of the people who have helped me on this repo over Github issues. It's been incredibly invaluable in digging up some of the esoterics here, and I really appreciate it!

No problem at all; glad that all made some sense :). We'd love feedback if you find an approach that works in the long term.