aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

EFA is not enabled in P4DN after upgrading NCCL v 2.10.3 and PyTorch v1.10 (master)

chaoyanghe opened this issue · comments

In P4DN 2 nodes, I upgraded NCCL to v2.10.3 and PyTorch to v1.10. However, EFA is not enabled even all related libraries (PyTorch, aws-ofi-nccl, nccl-tests) load the same NCCL library (located at /usr/lib/x86_64-linux-gnu).

The error logs from aws-ofi-nccl/tests and nccl_tests are as follows.

/opt/amazon/openmpi/bin/mpirun \
         -n 16 -N 8 --hostfile /job/hostfile \
         -x LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/usr/local/cuda/lib64:/usr/local/cuda:/usr/local/cuda/lib:/deepspeed/$USER/aws-ofi-nccl/install/lib:/home/$USER/aws-ofi-nccl:$LD_LIBRARY_PATH \
         -x FI_PROVIDER="efa" --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
         /home/deepspeed/aws-ofi-nccl/tests/nccl_message_transfer
10.3.35.83: + /opt/amazon/openmpi/bin/mpirun -n 16 -N 8 --hostfile /job/hostfile -x LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/usr/local/cuda/lib64:/usr/local/cuda:/usr/local/cuda/lib:/deepspeed/deepspeed/aws-ofi-nccl/install/lib:/home/deepspeed/aws-ofi-nccl:/usr/local/cuda-11.0/lib64:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:/usr/lib:/usr/local/lib:/usr/local/cuda-11.0/lib64:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:/usr/lib:/usr/local/lib::/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64 -x FI_PROVIDER=efa --mca btl_tcp_if_exclude lo,docker0 --bind-to none /home/deepspeed/aws-ofi-nccl/tests/nccl_message_transfer
10.3.35.83: Warning: Permanently added '[10.3.60.130]:2022' (RSA) to the list of known hosts.
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: --------------------------------------------------------------------------
10.3.35.83: Primary job  terminated normally, but 1 process returned
10.3.35.83: a non-zero exit code. Per user-direction, the job has been aborted.
10.3.35.83: --------------------------------------------------------------------------
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: --------------------------------------------------------------------------
10.3.35.83: mpirun detected that one or more processes exited with non-zero status, thus causing
10.3.35.83: the job to be terminated. The first process to do so was:
10.3.35.83: 
10.3.35.83:   Process name: [[43564,1],7]
10.3.35.83:   Exit code:    2
10.3.35.83: --------------------------------------------------------------------------
cd /fsx/hchaoyan/home/m5/nccl-tests && \
sudo rm -rf build && \
make MPI=1 MPI_HOME=/opt/amazon/openmpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/lib/x86_64-linux-gnu

echo "running P4DN test"
cd /fsx/hchaoyan/home/m5/nccl-tests
$(which mpirun) -allow-run-as-root --mca plm_rsh_no_tree_spawn 1 \
-x FI_PROVIDER="efa" \
-x NCCL_SOCKET_IFNAME=eth \
-x FI_EFA_USE_DEVICE_RDMA=1 \
-x RDMAV_FORK_SAFE=1 \
-x LD_LIBRARY_PATH=/usr/local/cuda/lib64:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/home/$USER/aws-ofi-nccl/install/lib:$LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
-x NCCL_MIN_NCHANNELS=8 \
-x NCCL_ALGO=Ring \
-x OMP_NUM_THREADS=8 \
-x NCCL_NSOCKS_PERTHREAD=8 \
-x NCCL_SOCKET_NTHREADS=8 \
-bind-to none \
-n 16 -N 8 \
--mca pml ^cm \
--hostfile /job/hostfile \
-mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 \
./build/all_reduce_perf -b 2G -e 2G -g 1 -n 30

10.3.35.83: #   Rank 15 Pid     42 on ip-10-3-60-130 device  7 [0xa0] A100-SXM4-40GB
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO Bootstrap : Using eth0:10.3.35.83<0>
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83:
10.3.35.83: ip-10-3-35-83:362:362 [0] find_ofi_provider:543 NCCL WARN NET/OFI Couldn't find any optimal provider
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO NET/IB : No device found.
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO NET/Socket : Using [0]eth0:10.3.35.83<0> [1]eth1:10.3.42.59<0> [2]eth2:10.3.48.121<0> [3]eth3:10.3.49.25<0>
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO Using network Socket
10.3.35.83: NCCL version 2.10.3+cuda11.0
10.3.35.83: ip-10-3-35-83:371:371 [7] NCCL INFO Bootstrap : Using eth0:10.3.35.83<0>
10.3.35.83: ip-10-3-35-83:371:371 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
10.3.35.83: ip-10-3-35-83:371:371 [7] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: ip-10-3-35-83:369:369 [6] NCCL INFO Bootstrap : Using eth0:10.3.35.83<0>
10.3.35.83: ip-10-3-35-83:367:367 [5] NCCL INFO Bootstrap : Using eth0:10.3.35.83<0>
10.3.35.83: ip-10-3-35-83:369:369 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
10.3.35.83: ip-10-3-35-83:369:369 [6] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: ip-10-3-35-83:367:367 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
10.3.35.83: ip-10-3-35-83:367:367 [5] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: ip-10-3-35-83:365:365 [3] NCCL INFO Bootstrap : Using eth0:10.3.35.83<0>

I checked the NCCL library used by PyTorch, aws-ofi-nccl, and nccl-tests. They all used the same library at /usr/lib/x86_64-linux-gnu

10.3.35.83: + ldd /tmp/pytorch/build/lib/libtorch_cuda.so
10.3.35.83:     linux-vdso.so.1 (0x00007fffff849000)
10.3.35.83:     libcudart.so.11.0 => /usr/local/cuda-11.0/lib64/libcudart.so.11.0 (0x00007eff6f08e000)
10.3.35.83:     libc10_cuda.so => /tmp/pytorch/build/lib/libc10_cuda.so (0x00007eff6ee3b000)
10.3.35.83:     libcusparse.so.11 => /usr/local/cuda-11.0/lib64/libcusparse.so.11 (0x00007eff652f6000)
10.3.35.83:     libcurand.so.10 => /usr/local/cuda-11.0/lib64/libcurand.so.10 (0x00007eff6078a000)
10.3.35.83:     libcusolver.so.10 => /usr/local/cuda-11.0/lib64/libcusolver.so.10 (0x00007eff3f96b000)
10.3.35.83:     libnccl.so.2 => /usr/lib/x86_64-linux-gnu/libnccl.so.2 (0x00007eff34e06000)
10.3.35.83:     libnvToolsExt.so.1 => /usr/local/cuda-11.0/lib64/libnvToolsExt.so.1 (0x00007eff34bfd000)
10.3.35.83:     libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007eff349de000)
10.3.35.83:     librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007eff347d6000)
10.3.35.83:     libc10.so => /tmp/pytorch/build/lib/libc10.so (0x00007eff34546000)
10.3.35.83:     libtorch_cpu.so => /tmp/pytorch/build/lib/libtorch_cpu.so (0x00007eff2af76000)
10.3.35.83:     libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007eff2ad72000)
10.3.35.83:     libcufft.so.10 => /usr/local/cuda-11.0/lib64/libcufft.so.10 (0x00007eff20eae000)
10.3.35.83:     libcublas.so.11 => /usr/local/cuda-11.0/lib64/libcublas.so.11 (0x00007eff1b05e000)
10.3.35.83:     libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007eff1acd5000)
10.3.35.83:     libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007eff1a937000)
10.3.35.83:     libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007eff1a71f000)
10.3.35.83:     libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007eff1a32e000)
10.3.35.83:     /lib64/ld-linux-x86-64.so.2 (0x00007eff92b55000)
10.3.35.83:     libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007eff1a0ff000)
10.3.35.83:     libcublasLt.so.11 => /usr/local/cuda-11.0/lib64/libcublasLt.so.11 (0x00007eff0ef70000)

10.3.35.83: + ldd //home/deepspeed/aws-ofi-nccl/install/lib/libnccl-net.so
10.3.35.83:     linux-vdso.so.1 (0x00007ffe06d81000)
10.3.35.83:     libcudart.so.11.0 => /usr/local/cuda-11.0/lib64/libcudart.so.11.0 (0x00007fbfc51b5000)
10.3.35.83:     libfabric.so.1 => /opt/amazon/efa/lib/libfabric.so.1 (0x00007fbfc4ec9000)
10.3.35.83:     libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbfc4ad8000)
10.3.35.83:     /lib64/ld-linux-x86-64.so.2 (0x00007fbfc563d000)
10.3.35.83:     libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fbfc48d4000)
10.3.35.83:     libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fbfc46b5000)
10.3.35.83:     librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fbfc44ad000)
10.3.35.83:     libefa.so.1 => /usr/lib/x86_64-linux-gnu/libefa.so.1 (0x00007fbfc42a6000)
10.3.35.83:     libibverbs.so.1 => /usr/lib/x86_64-linux-gnu/libibverbs.so.1 (0x00007fbfc4088000)
10.3.35.83:     libnl-route-3.so.200 => /usr/lib/x86_64-linux-gnu/libnl-route-3.so.200 (0x00007fbfc3e13000)
10.3.35.83:     libnl-3.so.200 => /lib/x86_64-linux-gnu/libnl-3.so.200 (0x00007fbfc3bf3000)
10.3.35.83:     libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fbfc3855000)

10.3.35.83: + ldd ./build/all_reduce_perf
10.3.35.83:     linux-vdso.so.1 (0x00007ffe3359d000)
10.3.35.83:     libcudart.so.11.0 => /usr/local/cuda-11.0/lib64/libcudart.so.11.0 (0x00007f2f97be5000)
10.3.35.83:     libmpi.so.40 => not found
10.3.35.83:     libnccl.so.2 => /usr/lib/x86_64-linux-gnu/libnccl.so.2 (0x00007f2f8d080000)
10.3.35.83:     libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f2f8ce61000)
10.3.35.83:     libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f2f8cad8000)
10.3.35.83:     libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2f8c6e7000)
10.3.35.83:     /lib64/ld-linux-x86-64.so.2 (0x00007f2f981f0000)
10.3.35.83:     libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f2f8c4e3000)
10.3.35.83:     librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f2f8c2db000)
10.3.35.83:     libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f2f8bf3d000)
10.3.35.83:     libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f2f8bd25000)

The docker file I used:

FROM nvidia/cuda:11.0-devel-ubuntu18.04

##############################################################################
# Temporary Installation Directory
##############################################################################
ENV STAGE_DIR=/tmp
RUN mkdir -p ${STAGE_DIR}

##############################################################################
# Installation/Basic Utilities
##############################################################################
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common build-essential autotools-dev \
        nfs-common pdsh \
        cmake g++ gcc \
        curl wget vim tmux emacs less unzip \
        htop iftop iotop ca-certificates openssh-client openssh-server \
        rsync iputils-ping net-tools sudo \
        llvm-9-dev

##############################################################################
# Installation Latest Git
##############################################################################
RUN add-apt-repository ppa:git-core/ppa -y && \
    apt-get update && \
    apt-get install -y git && \
    git --version


##############################################################################
# OPENMPI
##############################################################################
ENV OPENMPI_BASEVERSION=4.0
ENV OPENMPI_VERSION=${OPENMPI_BASEVERSION}.1
RUN cd ${STAGE_DIR} && \
    wget -q -O - https://download.open-mpi.org/release/open-mpi/v${OPENMPI_BASEVERSION}/openmpi-${OPENMPI_VERSION}.tar.gz | tar xzf - && \
    cd openmpi-${OPENMPI_VERSION} && \
    ./configure --prefix=/usr/local/openmpi-${OPENMPI_VERSION} && \
    make -j"$(nproc)" install && \
    ln -s /usr/local/openmpi-${OPENMPI_VERSION} /usr/local/mpi && \
    # Sanity check:
    test -f /usr/local/mpi/bin/mpic++ && \
    cd ${STAGE_DIR} && \
    rm -r ${STAGE_DIR}/openmpi-${OPENMPI_VERSION}
ENV PATH=/usr/local/mpi/bin:${PATH} \
    LD_LIBRARY_PATH=/usr/local/lib:/usr/local/mpi/lib:/usr/local/mpi/lib64:${LD_LIBRARY_PATH}
# Create a wrapper for OpenMPI to allow running as root by default
RUN mv /usr/local/mpi/bin/mpirun /usr/local/mpi/bin/mpirun.real && \
    echo '#!/bin/bash' > /usr/local/mpi/bin/mpirun && \
    echo 'mpirun.real --allow-run-as-root --prefix /usr/local/mpi "$@"' >> /usr/local/mpi/bin/mpirun && \
    chmod a+x /usr/local/mpi/bin/mpirun

##############################################################################
# Python
##############################################################################
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHON_VERSION=3
RUN apt-get install -y python3 python3-dev && \
    rm -f /usr/bin/python && \
    ln -s /usr/bin/python3 /usr/bin/python && \
    curl -O https://bootstrap.pypa.io/get-pip.py && \
        python get-pip.py && \
        rm get-pip.py && \
    pip install --upgrade pip && \
    # Print python an pip version
    python -V && pip -V
RUN pip install pyyaml
RUN pip install ipython

RUN apt-get update && \
    apt-get install -y vim git tmux wget curl autoconf libtool apt-utils
##############################################################################
# EFA Setup
##############################################################################
RUN cd ${STAGE_DIR} && curl -O  https://efa-installer.amazonaws.com/aws-efa-installer-1.12.3.tar.gz && tar -xf aws-efa-installer-1.12.3.tar.gz && cd aws-efa-installer && sudo ./efa_installer.sh -y -d -g  --skip-kmod --skip-limit-conf --no-verify

##############################################################################
# NCCL 2.10.3 Upgrade
##############################################################################
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub && add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /" && apt update && apt install -y --allow-change-held-packages libnccl2=2.10.3-1+cuda11.0 libnccl-dev=2.10.3-1+cuda11.0
ENV NCCL_VERSION=2.10.3

##############################################################################
# TensorFlow
##############################################################################
ENV TENSORFLOW_VERSION=1.15.2
RUN pip install tensorflow-gpu==${TENSORFLOW_VERSION}

##############################################################################
# Some Packages
##############################################################################
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        libsndfile-dev \
        libcupti-dev \
        libjpeg-dev \
        libpng-dev \
        screen
RUN pip install psutil \
                yappi \
                cffi \
                ipdb \
                pandas \
                matplotlib \
                py3nvml \
                pyarrow \
                graphviz \
                astor \
                boto3 \
                tqdm \
                sentencepiece \
                msgpack \
                requests \
                pandas \
                sphinx \
                sphinx_rtd_theme \
                scipy \
                numpy \
                sklearn \
                scikit-learn \
                nvidia-ml-py3 \
                mpi4py \
                cupy-cuda100

##############################################################################
## SSH daemon port inside container cannot conflict with host OS port
###############################################################################
#ENV SSH_PORT=2222
#RUN cat /etc/ssh/sshd_config > ${STAGE_DIR}/sshd_config && \
#    sed "0,/^#Port 22/s//Port ${SSH_PORT}/" ${STAGE_DIR}/sshd_config > /etc/ssh/sshd_config

##############################################################################
# PyTorch
##############################################################################
RUN sudo pip3 install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses h5py
RUN cd ${STAGE_DIR} && git clone https://github.com/pytorch/pytorch.git && cd pytorch && git submodule sync && git submodule update --init --recursive && sudo USE_SYSTEM_NCCL=1 TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0" python3 setup.py install
##############################################################################
# PyYAML build issue
# https://stackoverflow.com/a/53926898
##############################################################################
RUN rm -rf /usr/lib/python3/dist-packages/yaml && \
    rm -rf /usr/lib/python3/dist-packages/PyYAML-*

##############################################################################
## Add deepspeed user
###############################################################################
# Add a deepspeed user with user id 8877
#RUN useradd --create-home --uid 8877 deepspeed
RUN useradd --create-home --uid 1000 --shell /bin/bash deepspeed
RUN usermod -aG sudo deepspeed
RUN echo "deepspeed ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
# # Change to non-root privilege
USER deepspeed


##############################################################################
## Install Z3
##############################################################################
RUN cd ${STAGE_DIR} && git clone https://github.com/Z3Prover/z3.git &&  cd z3 && python scripts/mk_make.py && cd build && make -j$(nproc) && sudo make install

##############################################################################
## Install capstone
##############################################################################
RUN cd ${STAGE_DIR} && git clone https://github.com/aquynh/capstone.git  && cd  capstone && sudo make -j$(nproc) && sudo ./make.sh install

##############################################################################
## Install boost
##############################################################################
RUN cd ${STAGE_DIR} && wget https://boostorg.jfrog.io/artifactory/main/release/1.76.0/source/boost_1_76_0.tar.gz && tar -xzvf boost_1_76_0.tar.gz && cd boost_1_76_0 && sudo ./bootstrap.sh && sudo ./b2 install

##############################################################################
## Install Triton
##############################################################################
RUN cd ${STAGE_DIR} && git clone https://github.com/JonathanSalwan/Triton.git && cd Triton && mkdir build && cd build && cmake .. && sudo make -j$(nproc) install && cd ../../

##############################################################################
## Install Triton
##############################################################################
RUN cd ${STAGE_DIR} && wget http://github.com/unicode-org/icu/releases/download/release-67-1/icu4c-67_1-src.tgz && tar -xvzf icu4c-67_1-src.tgz && cd icu/source && ./configure --prefix=/usr && sudo make -j$(nproc) && sudo make install

##############################################################################
## Install custom Apex
##############################################################################
RUN cd ${STAGE_DIR} && git clone  https://github.com/szhengac/apex.git && cd apex && git checkout lans && sudo TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0" pip3 install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

##############################################################################
## Install TensorboardX
##############################################################################
RUN sudo pip3 install tensorboardX

##############################################################################
# DeepSpeed
##############################################################################
RUN git clone https://github.com/microsoft/DeepSpeed.git ${STAGE_DIR}/DeepSpeed
##Commit in which the AMI has been build
RUN cd /tmp/DeepSpeed && git checkout 7435b2f10af773b0204e77c3549b2b7df9a7a65b
RUN rm ${STAGE_DIR}/DeepSpeed/requirements/requirements-sparse_attn.txt && touch ${STAGE_DIR}/DeepSpeed/requirements/requirements-sparser_attn.txt
## We have to do this otherwise bdist install Python 1.7
RUN rm ${STAGE_DIR}/DeepSpeed/requirements/requirements.txt && touch ${STAGE_DIR}/DeepSpeed/requirements/requirements.txt
ENV DS_BUILD_OPS=1
ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0"
RUN cd ${STAGE_DIR}/DeepSpeed && sudo DS_BUILD_OPS=1 TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0" pip install -v ./

# Extra installation
RUN sudo pip3 install sentencepiece
RUN sudo pip3 install pytorch-ignite
RUN sudo pip3 install pytest-cov

# Batch Multi Node
ENV USER deepspeed
ENV HOME /home/$USER
RUN echo $HOME
RUN sudo pip install supervisor awscli

##############################################################################
# SSH Setup
##############################################################################
ENV SSHDIR $HOME/.ssh
RUN mkdir -p ${SSHDIR} \
&& touch ${SSHDIR}/sshd_config \
&& ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N '' \
&& cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys \
&& cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa \
&& echo "       IdentityFile ${SSHDIR}/id_rsa" >> ${SSHDIR}/config \
&& echo "       StrictHostKeyChecking no" >> ${SSHDIR}/config \
&& echo "       UserKnownHostsFile /dev/null" >> ${SSHDIR}/config \
&& echo "       Port 2022" >> ${SSHDIR}/config \
&& echo 'Port 2022' >> ${SSHDIR}/sshd_config \
&& echo 'UsePrivilegeSeparation no' >> ${SSHDIR}/sshd_config \
&& echo "HostKey ${SSHDIR}/ssh_host_rsa_key" >> ${SSHDIR}/sshd_config \ && echo "PidFile ${SSHDIR}/sshd.pid" >> ${SSHDIR}/sshd_config \
&& chmod -R 600 ${SSHDIR}/* \
&& chown -R ${USER}:${USER} ${SSHDIR}/
RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa

RUN sudo apt install -y iproute2


EXPOSE 22


USER deepspeed

##############################################################################
# Supervisor container startup
##############################################################################
ADD conf/supervisord/supervisord.conf /etc/supervisor/supervisord.conf
ADD supervised-scripts/mpi-run.sh supervised-scripts/mpi-run.sh
RUN sudo chmod 755 supervised-scripts/mpi-run.sh

##############################################################################
# Entry Point Script
##############################################################################
ADD batch-runtime-scripts/entry-point.sh batch-runtime-scripts/entry-point.sh
RUN sudo chmod 0755 batch-runtime-scripts/entry-point.sh
CMD /batch-runtime-scripts/entry-point.sh

##############################################################################
# Install AWS-OFI-NCCL plugin
##############################################################################
RUN git clone https://github.com/aws/aws-ofi-nccl.git $HOME/aws-ofi-nccl \
    && cd $HOME/aws-ofi-nccl \
    && git checkout aws  \
    && ./autogen.sh \
    && ./configure --prefix=$HOME/aws-ofi-nccl/install \
       --with-libfabric=/opt/amazon/efa/ \
       --with-cuda=/usr/local/cuda \
       --with-mpi=/opt/amazon/openmpi/ \
       --with-nccl=/usr/lib/x86_64-linux-gnu \
    && make -j$(nproc) && make install

We spent some time debugging this offline and the issue was caused by GPUDirect configuration in Docker. Since this is specific to a particular use case, closing this public ticket and tracking internally.