aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

try horovod: create_nccl_ofi_component:496 NCCL WARN NET/OFI Couldn't enable endpoint. RC: -12, ERROR: Cannot allocate memory

handar423 opened this issue · comments

Hello, I'm trying to test Horovod with EFA+nccl, but it was stuck when testing with multi nodes, I think the main error is: create_nccl_ofi_component:496 NCCL WARN NET/OFI Couldn't enable endpoint. RC: -12, ERROR: Cannot allocate memory.

[1,3]<stdout>:ip-172-31-6-189:153:691 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[1,5]<stdout>:ip-172-31-6-189:155:696 [5] NCCL INFO Ring 01 : 5[5] -> 6[6] via P2P/IPC
[1,3]<stdout>:ip-172-31-6-189:153:691 [3] NCCL INFO Ring 01 : 3 -> 10 [send] via NET/AWS Libfabric/1
[1,6]<stdout>:ip-172-31-6-189:156:695 [6] NCCL INFO Ring 01 : 6[6] -> 4[4] via P2P/IPC
[1,7]<stdout>:ip-172-31-6-189:157:687 [7] NCCL INFO Ring 01 : 7[7] -> 3[3] via P2P/IPC
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO Ring 01 : 4[4] -> 7[7] via P2P/IPC
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO Ring 02 : 13 -> 4 [receive] via NET/AWS Libfabric/2
[1,6]<stdout>:ip-172-31-6-189:156:695 [6] NCCL INFO Ring 02 : 6[6] -> 7[7] via P2P/IPC
[1,7]<stdout>:ip-172-31-6-189:157:687 [7] NCCL INFO Ring 02 : 7[7] -> 5[5] via P2P/IPC
[1,4]<stdout>:
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] create_nccl_ofi_component:496 NCCL WARN NET/OFI Couldn't enable endpoint. RC: -12, ERROR: Cannot allocate memory
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO include/net.h:21 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO transport/net.cc:334 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO init.cc:340 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO init.cc:650 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO init.cc:815 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO init.cc:951 -> 2
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] NCCL INFO Ring 01 : 12[4] -> 15[7] via P2P/IPC
[1,14]<stdout>:ip-172-31-3-127:43:574 [6] NCCL INFO Ring 01 : 14[6] -> 12[4] via P2P/IPC
[1,11]<stdout>:ip-172-31-3-127:40:580 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[1,11]<stdout>:ip-172-31-3-127:40:580 [3] NCCL INFO Ring 01 : 11 -> 2 [send] via NET/AWS Libfabric/1
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] NCCL INFO Ring 02 : 5 -> 12 [receive] via NET/AWS Libfabric/2
[1,13]<stdout>:ip-172-31-3-127:42:576 [5] NCCL INFO Ring 01 : 13[5] -> 14[6] via P2P/IPC
[1,12]<stdout>:
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] create_nccl_ofi_component:496 NCCL WARN NET/OFI Couldn't enable endpoint. RC: -12, ERROR: Cannot allocate memory
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] NCCL INFO include/net.h:21 -> 2

some information which may be helpful:

I use efa 1.5.1, and fi_info -p efa works:

provider: efa
    fabric: EFA-fe80::4a:78ff:fef8:fa03
    domain: efa_0-rdm
    version: 2.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::4a:78ff:fef8:fa03
    domain: efa_0-dgrm
    version: 2.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
    fabric: EFA-fe80::4a:78ff:fef8:fa03
    domain: efa_0-dgrm
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXD

 EFA installer version: 1.5.1
# Debug packages installed: yes
# Packages installed:
ibacm_25.0-1_amd64 ibverbs-providers_25.0-1_amd64 ibverbs-utils_25.0-1_amd64 infiniband-diags_25.0-1_amd64 libfabric-bin_1.8.0amzn1.0_amd64 libfabric-dev_1.8.0amzn1.0_amd64 libfabric1_1.8.0amzn1.0_amd64 libibmad-dev_25.0-1_amd64 libibmad5_25.0-1_amd64 libibnetdisc-dev_25.0-1_amd64 libibnetdisc5_25.0-1_amd64 libibumad-dev_25.0-1_amd64 libibumad3_25.0-1_amd64 libibverbs-dev_25.0-1_amd64 libibverbs1_25.0-1_amd64 librdmacm-dev_25.0-1_amd64 librdmacm1_25.0-1_amd64 openmpi_3.1.4-2_amd64 rdma-core_25.0-1_amd64 rdmacm-utils_25.0-1_amd64 libfabric1-dbg_1.8.0amzn1.0_amd64 libibmad5-dbg_25.0-1_amd64 libibnetdisc5-dbg_25.0-1_amd64 libibumad3-dbg_25.0-1_amd64 libibverbs1-dbg_25.0-1_amd64 librdmacm1-dbg_25.0-1_amd64

I also test nccl all_reduce_perf and it works as well, to run:

curl http://169.254.169.254/latest/meta-data/local-ipv4 >> my-hosts &&
/opt/amazon/openmpi/bin/mpirun  
-x FI_PROVIDER=efa     
-x FI_EFA_TX_MIN_CREDITS=64     
-x NCCL_DEBUG=INFO      
-x NCCL_TREE_THRESHOLD=0        
--hostfile my-hosts -n 8 -N 8 
--mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none    
/opt/build/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100

I get:
https://github.com/handar423/aws-efa-horovod-log/blob/master/1.5.1_nccl-efa-test.log

about Horovod, my command is

NCCL_DEBUG=INFO 
HOROVOD_NUM_NCCL_STREAMS=4 
horovodrun -np 16 -H localhost:8,172.31.3.127:8 
--mpi-args="-x PATH -x LD_LIBRARY_PATH -x FI_PROVIDER=efa -x FI_EFA_TX_MIN_CREDITS=64 -x NCCL_TREE_THRESHOLD=0" python3 /home/cluster/distributed-training/test_scripts/pytorch_synthetic_benchmark.py --model resnet101 --batch-size 32 |& grep -v "Read -1"

the complete log:
https://github.com/handar423/aws-efa-horovod-log/blob/master/1.5.1_horovod-test.log

PS: In Fact, I would prefer to use EFA 1.8.3 (to maintain same test envrioment), but I got more error in this version:

[1,0]<stderr>:terminate called after throwing an instance of 'std::system_error'
[1,0]<stderr>:  what():  Resource deadlock avoided
[1,0]<stderr>:[ip-172-31-6-189:00789] *** Process received signal ***
[1,0]<stderr>:[ip-172-31-6-189:00789] Signal: Aborted (6)
[1,0]<stderr>:[ip-172-31-6-189:00789] Signal code:  (-6)
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 0] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f17f5b34f20]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 1] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f17f5b34e97]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 2] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f17f5b36801]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 3] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c957)[0x7f17f0d40957]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 4] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92ab6)[0x7f17f0d46ab6]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 5] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x91b19)[0x7f17f0d45b19]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 6] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2a8)[0x7f17f0d46488]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 7] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10613)[0x7f17f0aac613]
[1,0]<stderr>:[ip-172-31-6-189:00789] [1,0]<stderr>:[ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x2b1)[0x7f17f0aacb71]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 9] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x37)[0x7f17f0d46d17]
[1,0]<stderr>:[ip-172-31-6-189:00789] [10] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8ea19)[0x7f17f0d42a19]
[1,0]<stderr>:[ip-172-31-6-189:00789] [11] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd8dc)[0x7f17f0d718dc]
[1,0]<stderr>:[ip-172-31-6-189:00789] [12] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common18HorovodGlobalStateD1Ev+0xaa8)[0x7f17c47176b8]
[1,0]<stderr>:[ip-172-31-6-189:00789] [13] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x43041)[0x7f17f5b39041]
[1,0]<stderr>:[ip-172-31-6-189:00789] [14] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x4313a)[0x7f17f5b3913a]
[1,0]<stderr>:[ip-172-31-6-189:00789] [15] /opt/amazon/efa/lib/libfabric.so.1(+0x5ebbf)[0x7f16dbd7ebbf]
[1,0]<stderr>:[ip-172-31-6-189:00789] [1,0]<stderr>:[16] /opt/amazon/efa/lib/libfabric.so.1(+0xf3f2)[0x7f16dbd2f3f2]
[1,0]<stderr>:[ip-172-31-6-189:00789] [17] /opt/amazon/efa/lib/libfabric.so.1(fi_getinfo+0x45d)[0x7f16dbd2fa9d]
[1,0]<stderr>:[ip-172-31-6-189:00789] [18] /usr/local/lib/libnccl-net.so(+0x2045)[0x7f16e41cc045]
[1,0]<stderr>:[ip-172-31-6-189:00789] [1,0]<stderr>:[19] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(+0xf1065)[0x7f17c47a2065]
[1,0]<stderr>:[ip-172-31-6-189:00789] [20] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(+0xf1b4f)[0x7f17c47a2b4f]
[1,0]<stderr>:[ip-172-31-6-189:00789] [21] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce12InitNCCLCommERKSt6vectorINS0_16TensorTableEntryESaIS3_EERKS2_IiSaIiEE+0x245)[0x7f17c47633a5]
[1,0]<stderr>:[ip-172-31-6-189:00789] [22] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x54)[0x7f17c47634d4]
[1,0]<stderr>:[ip-172-31-6-189:00789] [23] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x71)[0x7f17c472e061]
[1,0]<stderr>:[ip-172-31-6-189:00789] [24] /usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0xa1)[0x7f17c472e371]
[1,0]<stderr>:[ip-172-31-6-189:00789] [25] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(+0x5e440)[0x7f17c470f440]
[1,0]<stderr>:[ip-172-31-6-189:00789] [26] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(+0xaecf)[0x7f17f2000ecf]
[1,0]<stderr>:[ip-172-31-6-189:00789] [27] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f17f58de6db]
[1,0]<stderr>:[ip-172-31-6-189:00789] [28] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f17f5c1788f]
[1,0]<stderr>:[ip-172-31-6-189:00789] *** End of error message ***

complete log:
https://github.com/handar423/aws-efa-horovod-log/blob/master/1.8.3_horovod-test.log

and nccl all_reduce_perf test also works:
https://github.com/handar423/aws-efa-horovod-log/blob/master/1.8.3_nccl-efa-test.log

and fi_info -p efa, cat /opt/amazon/efa_installed_packages:

provider: efa
    fabric: EFA-fe80::4a:78ff:fef8:fa03
    domain: efa_0-rdm
    version: 2.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::4a:78ff:fef8:fa03
    domain: efa_0-dgrm
    version: 2.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
    fabric: EFA-fe80::4a:78ff:fef8:fa03
    domain: efa_0-dgrm
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXD

SHM transfer will be disabled because of ptrace protection.
To enable SHM transfer, please refer to the man page fi_efa.7 for more information.
Also note that turning off ptrace protection has security implications. If you cannot
turn it off, you can suppress this message by setting FI_EFA_ENABLE_SHM_TRANSFER=0

 EFA installer version: 1.8.3
# Debug packages installed: yes
# Packages installed:
ibacm_25.0-1_amd64 ibverbs-providers_25.0-1_amd64 ibverbs-utils_25.0-1_amd64 infiniband-diags_25.0-1_amd64 libfabric-aws-bin_1.9.0amzn1.1_amd64 libfabric-aws-dev_1.9.0amzn1.1_amd64 libfabric1-aws_1.9.0amzn1.1_amd64 libibmad-dev_25.0-1_amd64 libibmad5_25.0-1_amd64 libibnetdisc-dev_25.0-1_amd64 libibnetdisc5_25.0-1_amd64 libibumad-dev_25.0-1_amd64 libibumad3_25.0-1_amd64 libibverbs-dev_25.0-1_amd64 libibverbs1_25.0-1_amd64 librdmacm-dev_25.0-1_amd64 librdmacm1_25.0-1_amd64 openmpi40-aws_4.0.2-1_amd64 rdma-core_25.0-1_amd64 rdmacm-utils_25.0-1_amd64 libfabric1-aws-dbg_1.9.0amzn1.1_amd64 libibmad5-dbg_25.0-1_amd64 libibnetdisc5-dbg_25.0-1_amd64 libibumad3-dbg_25.0-1_amd64 libibverbs1-dbg_25.0-1_amd64 librdmacm1-dbg_25.0-1_amd64

Could someone please give me some suggestions about the right direction? Thank you very much!

@handar423 I see that you ran nccl-tests only on single node. Have you tried running it on multiple node setup? This will help us verify if EFA setup is working between the 2 instances. Which AMI are you using to set this up?

I tried to setup this benchmark and I could run it successfully. Here are my detailed steps and outputs for your reference:

Installation instructions:

  1. Use Deep Learning AMI (Amazon Linux 2) Version 32.0 (It comes with NCCL, aws-ofi-nccl plugin and EFA installer v1.9.3 pre-installed, along with tensorflow, pytorch and mxnet frameworks).

  2. Install nccl-tests for testing EFA setup

$> git clone https://github.com/NVIDIA/nccl-tests.git && cd nccl-tests
$> make MPI=1 NCCL_HOME=/usr/local/cuda-10.0/ MPI_HOME=/opt/amazon/openmpi/ CUDA_HOME=/usr/local/cuda-10.0/
  1. Setup passwordless ssh connection between all nodes in the cluster

  2. Run nccl-tests across 2 nodes

Run command:

$> /opt/amazon/openmpi/bin/mpirun \
     -x FI_EFA_TX_MIN_CREDITS=64 \
     -x NCCL_DEBUG=INFO -x NCCL_TREE_THRESHOLD=0 \
     --hostfile ~/hosts -n 16 -N 8 \
     --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
     -x RDMAV_FORK_SAFE=1 \
     -x LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda/efa/lib:$LDIBRARY_PATH \
     ~/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100

Performance Output: (Complete output is present here: https://gist.github.com/rashikakheria/f870dc9407216a03cf6c34f335d7c6f1)

#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2   float     sum    171.9    0.00    0.00  2e-07    170.9    0.00    0.00  1e-07
          16             4   float     sum    172.2    0.00    0.00  1e-07    171.5    0.00    0.00  1e-07
          32             8   float     sum    169.6    0.00    0.00  1e-07    172.5    0.00    0.00  1e-07
          64            16   float     sum    171.5    0.00    0.00  1e-07    171.8    0.00    0.00  1e-07
         128            32   float     sum    170.8    0.00    0.00  1e-07    170.4    0.00    0.00  1e-07
         256            64   float     sum    169.6    0.00    0.00  1e-07    169.4    0.00    0.00  1e-07
         512           128   float     sum    168.6    0.00    0.01  1e-07    171.0    0.00    0.01  1e-07
        1024           256   float     sum    171.3    0.01    0.01  2e-07    169.1    0.01    0.01  2e-07
        2048           512   float     sum    171.1    0.01    0.02  2e-07    170.4    0.01    0.02  2e-07
        4096          1024   float     sum    172.0    0.02    0.04  5e-07    172.3    0.02    0.04  5e-07
        8192          2048   float     sum    175.0    0.05    0.09  5e-07    175.0    0.05    0.09  5e-07
       16384          4096   float     sum    181.3    0.09    0.17  5e-07    180.6    0.09    0.17  5e-07
       32768          8192   float     sum    194.6    0.17    0.32  5e-07    194.8    0.17    0.32  5e-07
       65536         16384   float     sum    208.0    0.32    0.59  5e-07    207.5    0.32    0.59  5e-07
      131072         32768   float     sum    228.8    0.57    1.07  5e-07    227.0    0.58    1.08  5e-07
      262144         65536   float     sum    285.2    0.92    1.72  5e-07    284.2    0.92    1.73  5e-07
      524288        131072   float     sum    351.1    1.49    2.80  5e-07    349.5    1.50    2.81  5e-07
     1048576        262144   float     sum    531.1    1.97    3.70  5e-07    532.3    1.97    3.69  5e-07
     2097152        524288   float     sum    949.6    2.21    4.14  5e-07    956.0    2.19    4.11  5e-07
     4194304       1048576   float     sum   1777.2    2.36    4.43  5e-07   1743.0    2.41    4.51  5e-07
     8388608       2097152   float     sum   2198.5    3.82    7.15  5e-07   2223.4    3.77    7.07  5e-07
    16777216       4194304   float     sum   4807.5    3.49    6.54  5e-07   4822.2    3.48    6.52  5e-07
    33554432       8388608   float     sum   9802.3    3.42    6.42  5e-07   9706.8    3.46    6.48  5e-07
    67108864      16777216   float     sum    17647    3.80    7.13  5e-07    17696    3.79    7.11  5e-07
   134217728      33554432   float     sum    34570    3.88    7.28  5e-07    34471    3.89    7.30  5e-07
   268435456      67108864   float     sum    63162    4.25    7.97  5e-07    63065    4.26    7.98  5e-07
   536870912     134217728   float     sum   126035    4.26    7.99  5e-07   126101    4.26    7.98  5e-07
  1073741824     268435456   float     sum   250594    4.28    8.03  5e-07   250473    4.29    8.04  5e-07
  1. Installed horovod in pytorch environment
# Go to pytorch + python 3.6 conda environment
$> source activate pytorch_p36

# Install horovod
$> HOROVOD_GPU_OPERATIONS=NCCL pip install horovod
Collecting horovod
.....
Successfully built horovod
Installing collected packages: horovod
Successfully installed horovod-0.19.5
  1. Run horovod pytorch benchmark
$> git clone https://github.com/horovod/horovod.git

Run command:

$> HOROVOD_NUM_NCCL_STREAMS=4 horovodrun \
    -np 16 -H 10.0.137.167:8,10.0.143.56:8 \
    --mpi-args="-x FI_EFA_TX_MIN_CREDITS=64 -x NCCL_DEBUG=INFO -x NCCL_TREE_THRESHOLD=0 -x RDMAV_FORK_SAFE=1 -x LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda/efa/lib:$LDIBRARY_PATH" \
    python3 /home/ec2-user/horovod/examples/pytorch_synthetic_benchmark.py
[1,0]<stdout>:Model: resnet50
[1,0]<stdout>:Batch size: 32
[1,0]<stdout>:Number of GPUs: 16
[1,0]<stdout>:Running warmup...
[1,0]<stdout>:Running benchmark...
[1,0]<stdout>:Iter #0: 99.2 img/sec per GPU
[1,0]<stdout>:Iter #1: 98.9 img/sec per GPU
[1,0]<stdout>:Iter #2: 98.6 img/sec per GPU
[1,0]<stdout>:Iter #3: 98.7 img/sec per GPU
[1,0]<stdout>:Iter #4: 98.9 img/sec per GPU
[1,0]<stdout>:Iter #5: 98.8 img/sec per GPU
[1,0]<stdout>:Iter #6: 98.5 img/sec per GPU
[1,0]<stdout>:Iter #7: 98.8 img/sec per GPU
[1,0]<stdout>:Iter #8: 98.8 img/sec per GPU
[1,0]<stdout>:Iter #9: 89.7 img/sec per GPU
[1,0]<stdout>:Img/sec per GPU: 97.9 +-5.3
[1,0]<stdout>:Total img/sec on 16 GPU(s): 1566.4 +-85.5

Thank you very much! I followed your step and succeed in Amazon Linux 2.

Great, it worked for you. You can follow similar instructions and use Ubuntu based AMI. Closing this issue.