WARNING: unrecognized options: --with-nccl when attempting to install
mvpatel2000 opened this issue · comments
I am attempting to build off an existing Dockerfile to add EFA multinode support for some runs I would like to do on AWS. My dockerfile can be found here. I am looking at this guide and this example dockerfile.
I would like to avoid rebuilding pytorch from source, so I am instead trying to clone the same version of nccl and then build the aws-ofi-nccl plugin pointing to that version of nccl. While doing this, I am hitting the following error:
configure: WARNING: unrecognized options: --with-nccl
on the following command:
./configure --prefix=/opt/aws-ofi-nccl/install --with-libfabric=/opt/amazon/efa/ --with-cuda=/usr/local/cuda --with-nccl=/opt/nccl/build --with-mpi=/opt/amazon/openmpi/
It appears the latest on branch aws
has changed to no longer have this flag. Are there updated instructions on how to install this
Hi @mvpatel2000,
A recent change eliminates the dependency on NCCL, so you can omit that option.
The following should work:
./configure --prefix=/opt/aws-ofi-nccl/install --with-libfabric=/opt/amazon/efa/ --with-cuda=/usr/local/cuda --with-mpi=/opt/amazon/openmpi/
@ryanamazon thank you for the prompt response! Would you mind giving updated instructions in that case since it seems like the guides are outdated? In order to run multinode, I am attempting to do:
RUN if [ -n "$CUDA_VERSION" ] ; then \
cd /tmp && \
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz && \
tar -xf /tmp/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
cd aws-efa-installer && \
apt-get update && \
./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify && \
rm -rf /tmp/aws-efa-installer* ; \
fi
RUN if [ -n "$CUDA_VERSION" ] ; then \
git clone https://github.com/aws/aws-ofi-nccl.git /opt/aws-ofi-nccl && \
cd /opt/aws-ofi-nccl && \
git checkout ${AWS_OFI_NCCL_VERSION} && \
./autogen.sh && \
./configure --prefix=/opt/aws-ofi-nccl/install \
--with-libfabric=/opt/amazon/efa/ \
--with-cuda=/usr/local/cuda \
# --with-nccl=/opt/nccl/build \
--with-mpi=/opt/amazon/openmpi/ \
--prefix=/usr/local && \
make && \
make install ; \
fi
which would install EFA and the plugin. However, I am still hitting errors with multinode. For example, when running an all_reduce
I am getting:
Traceback (most recent call last):
File "/examples/scratch/test.py", line 7, in <module>
dist.all_reduce(a)
File "/composer/composer/utils/dist.py", line 211, in all_reduce
dist.all_reduce(tensor, op=reduce_op)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/distributed_c10d.py", line 1320, in all_reduce
work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
@ryanamazon also, the aws branch says NOTE: This is an experimental branch specifically targeted for testing on AWS. Therefore, This branch is not supported.
. What is the recommended path here since the guides all say to use aws
branch. Should I be using master
instead?
@ryanamazon In particular, when I install
RUN if [ -n "$CUDA_VERSION" ] ; then \
cd /tmp && \
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz && \
tar -xf /tmp/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
cd aws-efa-installer && \
apt-get update && \
./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify && \
rm -rf /tmp/aws-efa-installer* ; \
fi
I only see efa_installed_packages
in /opt/amazon
where I would expect to see /opt/amazon/efa/
(for libfabric). Can you please provide any guidance on this? I am struggling to find any documentation here
I was able to install the aws-ofi-plugin:
FROM nvcr.io/nvidia/cuda:11.2.0-cudnn8-devel-ubuntu18.04
ARG EFA_INSTALLER_VERSION=latest
ARG AWS_OFI_NCCL_VERSION=aws
ARG NCCL_TESTS_VERSION=v2.0.0
RUN apt-get update -y
RUN apt-get purge -y libmlx5-1 ibverbs-utils libibverbs-dev libibverbs1
RUN apt-get install -y --allow-unauthenticated \
git \
gcc \
yum-utils \
vim \
kmod \
openssh-client \
openssh-server \
build-essential \
curl \
autoconf \
libtool \
gdb \
automake \
python3-distutils \
cmake \
&& rm -rf /var/lib/apt/lists/*
ENV HOME /tmp
ENV LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:$LD_LIBRARY_PATH
ENV PATH=/opt/amazon/openmpi/bin/:/opt/amazon/efa/bin:$PATH
RUN curl https://bootstrap.pypa.io/pip/3.6/get-pip.py -o /tmp/get-pip.py && python3 /tmp/get-pip.py
RUN pip3 install awscli
RUN cd /tmp && \
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz && \
tar -xf /tmp/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
cd aws-efa-installer && \
apt-get update && \
./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify && \
rm -rf /tmp/aws-efa-installer*
####################################################
### Install AWS-OFI-NCCL plugin
RUN git clone https://github.com/aws/aws-ofi-nccl.git /opt/aws-ofi-nccl \
&& cd /opt/aws-ofi-nccl \
&& git checkout ${AWS_OFI_NCCL_VERSION} \
&& ./autogen.sh \
&& ./configure --prefix=/opt/aws-ofi-nccl/install \
--with-libfabric=/opt/amazon/efa/ \
--with-cuda=/usr/local/cuda \
--disable-tests \
&& make && make install
When I do, I get the following in /opt/amazon:
root@043a223188e8:/opt/amazon# ls
efa efa_installed_packages openmpi
@ryanamazon also, the aws branch says
NOTE: This is an experimental branch specifically targeted for testing on AWS. Therefore, This branch is not supported.
. What is the recommended path here since the guides all say to useaws
branch. Should I be usingmaster
instead?
You should use aws on aws hardware.
Thanks! I will try to reproduce what you gave and see if there's something else I did wrong. Really appreciate the help
@ryanamazon I see you have libmlx5-1
-- does this mean it is not possible to have a single image which has both mellanox and EFA support?
@ryanamazon I see you have
libmlx5-1
-- does this mean it is not possible to have a single image which has both mellanox and EFA support?
Unfortunately, that's not my area of expertise, but since you are installing libfabric with EFA support from aws-efa-installer, the dependencies on the built-in Ubuntu rdma-core are removed; you should be able to build rdma-core and libfabric separately with support for both mellanox and EFA if you need to do so. I'm not sure whether that support is included in efa-installer by default.
Closing out this issue -- it seems to be working! @ryanamazon thank you so much for the help, I really really appreciate it. The documentation is unfortunately sorely lacking, and your guidance was incredibly helpful in getting this sorted out :)