aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

WARNING: unrecognized options: --with-nccl when attempting to install

mvpatel2000 opened this issue · comments

I am attempting to build off an existing Dockerfile to add EFA multinode support for some runs I would like to do on AWS. My dockerfile can be found here. I am looking at this guide and this example dockerfile.

I would like to avoid rebuilding pytorch from source, so I am instead trying to clone the same version of nccl and then build the aws-ofi-nccl plugin pointing to that version of nccl. While doing this, I am hitting the following error:

configure: WARNING: unrecognized options: --with-nccl

on the following command:

./configure --prefix=/opt/aws-ofi-nccl/install --with-libfabric=/opt/amazon/efa/ --with-cuda=/usr/local/cuda --with-nccl=/opt/nccl/build --with-mpi=/opt/amazon/openmpi/

It appears the latest on branch aws has changed to no longer have this flag. Are there updated instructions on how to install this

Hi @mvpatel2000,

A recent change eliminates the dependency on NCCL, so you can omit that option.

The following should work:
./configure --prefix=/opt/aws-ofi-nccl/install --with-libfabric=/opt/amazon/efa/ --with-cuda=/usr/local/cuda --with-mpi=/opt/amazon/openmpi/

@ryanamazon thank you for the prompt response! Would you mind giving updated instructions in that case since it seems like the guides are outdated? In order to run multinode, I am attempting to do:

RUN if [ -n "$CUDA_VERSION" ] ; then \
        cd /tmp && \
        curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz && \
        tar -xf /tmp/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
        cd aws-efa-installer && \
        apt-get update && \
        ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify && \
        rm -rf /tmp/aws-efa-installer* ; \
    fi
    
RUN if [ -n "$CUDA_VERSION" ] ; then \
        git clone https://github.com/aws/aws-ofi-nccl.git /opt/aws-ofi-nccl && \
        cd /opt/aws-ofi-nccl && \
        git checkout ${AWS_OFI_NCCL_VERSION} && \
        ./autogen.sh && \
        ./configure --prefix=/opt/aws-ofi-nccl/install \
            --with-libfabric=/opt/amazon/efa/ \
            --with-cuda=/usr/local/cuda \
            # --with-nccl=/opt/nccl/build \
            --with-mpi=/opt/amazon/openmpi/ \
            --prefix=/usr/local && \
        make && \
        make install ; \
    fi

which would install EFA and the plugin. However, I am still hitting errors with multinode. For example, when running an all_reduce I am getting:

Traceback (most recent call last):
File "/examples/scratch/test.py", line 7, in <module>
  dist.all_reduce(a)
File "/composer/composer/utils/dist.py", line 211, in all_reduce
  dist.all_reduce(tensor, op=reduce_op)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/distributed_c10d.py", line 1320, in all_reduce
  work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.

@ryanamazon also, the aws branch says NOTE: This is an experimental branch specifically targeted for testing on AWS. Therefore, This branch is not supported.. What is the recommended path here since the guides all say to use aws branch. Should I be using master instead?

@ryanamazon In particular, when I install

RUN if [ -n "$CUDA_VERSION" ] ; then \
        cd /tmp && \
        curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz && \
        tar -xf /tmp/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
        cd aws-efa-installer && \
        apt-get update && \
        ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify && \
        rm -rf /tmp/aws-efa-installer* ; \
    fi

I only see efa_installed_packages in /opt/amazon where I would expect to see /opt/amazon/efa/ (for libfabric). Can you please provide any guidance on this? I am struggling to find any documentation here

I was able to install the aws-ofi-plugin:

FROM nvcr.io/nvidia/cuda:11.2.0-cudnn8-devel-ubuntu18.04

ARG EFA_INSTALLER_VERSION=latest
ARG AWS_OFI_NCCL_VERSION=aws
ARG NCCL_TESTS_VERSION=v2.0.0

RUN apt-get update -y
RUN apt-get purge -y libmlx5-1 ibverbs-utils libibverbs-dev libibverbs1
RUN apt-get install -y --allow-unauthenticated \
    git \
    gcc \
    yum-utils \
    vim \
    kmod \
    openssh-client \
    openssh-server \
    build-essential \
    curl \
    autoconf \
    libtool \
    gdb \
    automake \
    python3-distutils \
    cmake \
    && rm -rf /var/lib/apt/lists/*



ENV HOME /tmp

ENV LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:$LD_LIBRARY_PATH
ENV PATH=/opt/amazon/openmpi/bin/:/opt/amazon/efa/bin:$PATH

RUN curl https://bootstrap.pypa.io/pip/3.6/get-pip.py -o /tmp/get-pip.py && python3 /tmp/get-pip.py
RUN pip3 install awscli

RUN        cd /tmp && \
        curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz && \
        tar -xf /tmp/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
        cd aws-efa-installer && \
        apt-get update && \
        ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify && \
        rm -rf /tmp/aws-efa-installer*



####################################################
### Install AWS-OFI-NCCL plugin
RUN git clone https://github.com/aws/aws-ofi-nccl.git /opt/aws-ofi-nccl \
    && cd /opt/aws-ofi-nccl \
    && git checkout ${AWS_OFI_NCCL_VERSION} \
    && ./autogen.sh \
    && ./configure --prefix=/opt/aws-ofi-nccl/install \
       --with-libfabric=/opt/amazon/efa/ \
       --with-cuda=/usr/local/cuda \
       --disable-tests \
    && make && make install

When I do, I get the following in /opt/amazon:

root@043a223188e8:/opt/amazon# ls
efa  efa_installed_packages  openmpi

@ryanamazon also, the aws branch says NOTE: This is an experimental branch specifically targeted for testing on AWS. Therefore, This branch is not supported.. What is the recommended path here since the guides all say to use aws branch. Should I be using master instead?

You should use aws on aws hardware.

Thanks! I will try to reproduce what you gave and see if there's something else I did wrong. Really appreciate the help

@ryanamazon I see you have libmlx5-1 -- does this mean it is not possible to have a single image which has both mellanox and EFA support?

@mvpatel2000,

@ryanamazon I see you have libmlx5-1 -- does this mean it is not possible to have a single image which has both mellanox and EFA support?

Unfortunately, that's not my area of expertise, but since you are installing libfabric with EFA support from aws-efa-installer, the dependencies on the built-in Ubuntu rdma-core are removed; you should be able to build rdma-core and libfabric separately with support for both mellanox and EFA if you need to do so. I'm not sure whether that support is included in efa-installer by default.

Closing out this issue -- it seems to be working! @ryanamazon thank you so much for the help, I really really appreciate it. The documentation is unfortunately sorely lacking, and your guidance was incredibly helpful in getting this sorted out :)