try horovod: create_nccl_ofi_component:496 NCCL WARN NET/OFI Couldn't enable endpoint. RC: -12, ERROR: Cannot allocate memory
handar423 opened this issue · comments
Hello, I'm trying to test Horovod with EFA+nccl, but it was stuck when testing with multi nodes, I think the main error is: create_nccl_ofi_component:496 NCCL WARN NET/OFI Couldn't enable endpoint. RC: -12, ERROR: Cannot allocate memory.
[1,3]<stdout>:ip-172-31-6-189:153:691 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[1,5]<stdout>:ip-172-31-6-189:155:696 [5] NCCL INFO Ring 01 : 5[5] -> 6[6] via P2P/IPC
[1,3]<stdout>:ip-172-31-6-189:153:691 [3] NCCL INFO Ring 01 : 3 -> 10 [send] via NET/AWS Libfabric/1
[1,6]<stdout>:ip-172-31-6-189:156:695 [6] NCCL INFO Ring 01 : 6[6] -> 4[4] via P2P/IPC
[1,7]<stdout>:ip-172-31-6-189:157:687 [7] NCCL INFO Ring 01 : 7[7] -> 3[3] via P2P/IPC
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO Ring 01 : 4[4] -> 7[7] via P2P/IPC
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO Ring 02 : 13 -> 4 [receive] via NET/AWS Libfabric/2
[1,6]<stdout>:ip-172-31-6-189:156:695 [6] NCCL INFO Ring 02 : 6[6] -> 7[7] via P2P/IPC
[1,7]<stdout>:ip-172-31-6-189:157:687 [7] NCCL INFO Ring 02 : 7[7] -> 5[5] via P2P/IPC
[1,4]<stdout>:
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] create_nccl_ofi_component:496 NCCL WARN NET/OFI Couldn't enable endpoint. RC: -12, ERROR: Cannot allocate memory
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO include/net.h:21 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO transport/net.cc:334 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO init.cc:340 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO init.cc:650 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO init.cc:815 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO init.cc:951 -> 2
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] NCCL INFO Ring 01 : 12[4] -> 15[7] via P2P/IPC
[1,14]<stdout>:ip-172-31-3-127:43:574 [6] NCCL INFO Ring 01 : 14[6] -> 12[4] via P2P/IPC
[1,11]<stdout>:ip-172-31-3-127:40:580 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[1,11]<stdout>:ip-172-31-3-127:40:580 [3] NCCL INFO Ring 01 : 11 -> 2 [send] via NET/AWS Libfabric/1
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] NCCL INFO Ring 02 : 5 -> 12 [receive] via NET/AWS Libfabric/2
[1,13]<stdout>:ip-172-31-3-127:42:576 [5] NCCL INFO Ring 01 : 13[5] -> 14[6] via P2P/IPC
[1,12]<stdout>:
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] create_nccl_ofi_component:496 NCCL WARN NET/OFI Couldn't enable endpoint. RC: -12, ERROR: Cannot allocate memory
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] NCCL INFO include/net.h:21 -> 2
some information which may be helpful:
I use efa 1.5.1, and fi_info -p efa works:
provider: efa
fabric: EFA-fe80::4a:78ff:fef8:fa03
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::4a:78ff:fef8:fa03
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::4a:78ff:fef8:fa03
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
EFA installer version: 1.5.1
# Debug packages installed: yes
# Packages installed:
ibacm_25.0-1_amd64 ibverbs-providers_25.0-1_amd64 ibverbs-utils_25.0-1_amd64 infiniband-diags_25.0-1_amd64 libfabric-bin_1.8.0amzn1.0_amd64 libfabric-dev_1.8.0amzn1.0_amd64 libfabric1_1.8.0amzn1.0_amd64 libibmad-dev_25.0-1_amd64 libibmad5_25.0-1_amd64 libibnetdisc-dev_25.0-1_amd64 libibnetdisc5_25.0-1_amd64 libibumad-dev_25.0-1_amd64 libibumad3_25.0-1_amd64 libibverbs-dev_25.0-1_amd64 libibverbs1_25.0-1_amd64 librdmacm-dev_25.0-1_amd64 librdmacm1_25.0-1_amd64 openmpi_3.1.4-2_amd64 rdma-core_25.0-1_amd64 rdmacm-utils_25.0-1_amd64 libfabric1-dbg_1.8.0amzn1.0_amd64 libibmad5-dbg_25.0-1_amd64 libibnetdisc5-dbg_25.0-1_amd64 libibumad3-dbg_25.0-1_amd64 libibverbs1-dbg_25.0-1_amd64 librdmacm1-dbg_25.0-1_amd64
I also test nccl all_reduce_perf and it works as well, to run:
curl http://169.254.169.254/latest/meta-data/local-ipv4 >> my-hosts &&
/opt/amazon/openmpi/bin/mpirun
-x FI_PROVIDER=efa
-x FI_EFA_TX_MIN_CREDITS=64
-x NCCL_DEBUG=INFO
-x NCCL_TREE_THRESHOLD=0
--hostfile my-hosts -n 8 -N 8
--mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none
/opt/build/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
I get:
https://github.com/handar423/aws-efa-horovod-log/blob/master/1.5.1_nccl-efa-test.log
about Horovod, my command is
NCCL_DEBUG=INFO
HOROVOD_NUM_NCCL_STREAMS=4
horovodrun -np 16 -H localhost:8,172.31.3.127:8
--mpi-args="-x PATH -x LD_LIBRARY_PATH -x FI_PROVIDER=efa -x FI_EFA_TX_MIN_CREDITS=64 -x NCCL_TREE_THRESHOLD=0" python3 /home/cluster/distributed-training/test_scripts/pytorch_synthetic_benchmark.py --model resnet101 --batch-size 32 |& grep -v "Read -1"
the complete log:
https://github.com/handar423/aws-efa-horovod-log/blob/master/1.5.1_horovod-test.log
PS: In Fact, I would prefer to use EFA 1.8.3 (to maintain same test envrioment), but I got more error in this version:
[1,0]<stderr>:terminate called after throwing an instance of 'std::system_error'
[1,0]<stderr>: what(): Resource deadlock avoided
[1,0]<stderr>:[ip-172-31-6-189:00789] *** Process received signal ***
[1,0]<stderr>:[ip-172-31-6-189:00789] Signal: Aborted (6)
[1,0]<stderr>:[ip-172-31-6-189:00789] Signal code: (-6)
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 0] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f17f5b34f20]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 1] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f17f5b34e97]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 2] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f17f5b36801]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 3] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c957)[0x7f17f0d40957]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 4] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92ab6)[0x7f17f0d46ab6]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 5] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x91b19)[0x7f17f0d45b19]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 6] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2a8)[0x7f17f0d46488]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 7] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10613)[0x7f17f0aac613]
[1,0]<stderr>:[ip-172-31-6-189:00789] [1,0]<stderr>:[ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x2b1)[0x7f17f0aacb71]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 9] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x37)[0x7f17f0d46d17]
[1,0]<stderr>:[ip-172-31-6-189:00789] [10] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8ea19)[0x7f17f0d42a19]
[1,0]<stderr>:[ip-172-31-6-189:00789] [11] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd8dc)[0x7f17f0d718dc]
[1,0]<stderr>:[ip-172-31-6-189:00789] [12] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common18HorovodGlobalStateD1Ev+0xaa8)[0x7f17c47176b8]
[1,0]<stderr>:[ip-172-31-6-189:00789] [13] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x43041)[0x7f17f5b39041]
[1,0]<stderr>:[ip-172-31-6-189:00789] [14] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x4313a)[0x7f17f5b3913a]
[1,0]<stderr>:[ip-172-31-6-189:00789] [15] /opt/amazon/efa/lib/libfabric.so.1(+0x5ebbf)[0x7f16dbd7ebbf]
[1,0]<stderr>:[ip-172-31-6-189:00789] [1,0]<stderr>:[16] /opt/amazon/efa/lib/libfabric.so.1(+0xf3f2)[0x7f16dbd2f3f2]
[1,0]<stderr>:[ip-172-31-6-189:00789] [17] /opt/amazon/efa/lib/libfabric.so.1(fi_getinfo+0x45d)[0x7f16dbd2fa9d]
[1,0]<stderr>:[ip-172-31-6-189:00789] [18] /usr/local/lib/libnccl-net.so(+0x2045)[0x7f16e41cc045]
[1,0]<stderr>:[ip-172-31-6-189:00789] [1,0]<stderr>:[19] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(+0xf1065)[0x7f17c47a2065]
[1,0]<stderr>:[ip-172-31-6-189:00789] [20] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(+0xf1b4f)[0x7f17c47a2b4f]
[1,0]<stderr>:[ip-172-31-6-189:00789] [21] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce12InitNCCLCommERKSt6vectorINS0_16TensorTableEntryESaIS3_EERKS2_IiSaIiEE+0x245)[0x7f17c47633a5]
[1,0]<stderr>:[ip-172-31-6-189:00789] [22] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x54)[0x7f17c47634d4]
[1,0]<stderr>:[ip-172-31-6-189:00789] [23] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x71)[0x7f17c472e061]
[1,0]<stderr>:[ip-172-31-6-189:00789] [24] /usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0xa1)[0x7f17c472e371]
[1,0]<stderr>:[ip-172-31-6-189:00789] [25] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(+0x5e440)[0x7f17c470f440]
[1,0]<stderr>:[ip-172-31-6-189:00789] [26] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(+0xaecf)[0x7f17f2000ecf]
[1,0]<stderr>:[ip-172-31-6-189:00789] [27] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f17f58de6db]
[1,0]<stderr>:[ip-172-31-6-189:00789] [28] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f17f5c1788f]
[1,0]<stderr>:[ip-172-31-6-189:00789] *** End of error message ***
complete log:
https://github.com/handar423/aws-efa-horovod-log/blob/master/1.8.3_horovod-test.log
and nccl all_reduce_perf test also works:
https://github.com/handar423/aws-efa-horovod-log/blob/master/1.8.3_nccl-efa-test.log
and fi_info -p efa
, cat /opt/amazon/efa_installed_packages
:
provider: efa
fabric: EFA-fe80::4a:78ff:fef8:fa03
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::4a:78ff:fef8:fa03
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::4a:78ff:fef8:fa03
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
SHM transfer will be disabled because of ptrace protection.
To enable SHM transfer, please refer to the man page fi_efa.7 for more information.
Also note that turning off ptrace protection has security implications. If you cannot
turn it off, you can suppress this message by setting FI_EFA_ENABLE_SHM_TRANSFER=0
EFA installer version: 1.8.3
# Debug packages installed: yes
# Packages installed:
ibacm_25.0-1_amd64 ibverbs-providers_25.0-1_amd64 ibverbs-utils_25.0-1_amd64 infiniband-diags_25.0-1_amd64 libfabric-aws-bin_1.9.0amzn1.1_amd64 libfabric-aws-dev_1.9.0amzn1.1_amd64 libfabric1-aws_1.9.0amzn1.1_amd64 libibmad-dev_25.0-1_amd64 libibmad5_25.0-1_amd64 libibnetdisc-dev_25.0-1_amd64 libibnetdisc5_25.0-1_amd64 libibumad-dev_25.0-1_amd64 libibumad3_25.0-1_amd64 libibverbs-dev_25.0-1_amd64 libibverbs1_25.0-1_amd64 librdmacm-dev_25.0-1_amd64 librdmacm1_25.0-1_amd64 openmpi40-aws_4.0.2-1_amd64 rdma-core_25.0-1_amd64 rdmacm-utils_25.0-1_amd64 libfabric1-aws-dbg_1.9.0amzn1.1_amd64 libibmad5-dbg_25.0-1_amd64 libibnetdisc5-dbg_25.0-1_amd64 libibumad3-dbg_25.0-1_amd64 libibverbs1-dbg_25.0-1_amd64 librdmacm1-dbg_25.0-1_amd64
Could someone please give me some suggestions about the right direction? Thank you very much!
@handar423 I see that you ran nccl-tests only on single node. Have you tried running it on multiple node setup? This will help us verify if EFA setup is working between the 2 instances. Which AMI are you using to set this up?
I tried to setup this benchmark and I could run it successfully. Here are my detailed steps and outputs for your reference:
Installation instructions:
-
Use Deep Learning AMI (Amazon Linux 2) Version 32.0 (It comes with NCCL, aws-ofi-nccl plugin and EFA installer v1.9.3 pre-installed, along with tensorflow, pytorch and mxnet frameworks).
-
Install nccl-tests for testing EFA setup
$> git clone https://github.com/NVIDIA/nccl-tests.git && cd nccl-tests
$> make MPI=1 NCCL_HOME=/usr/local/cuda-10.0/ MPI_HOME=/opt/amazon/openmpi/ CUDA_HOME=/usr/local/cuda-10.0/
-
Setup passwordless ssh connection between all nodes in the cluster
-
Run nccl-tests across 2 nodes
Run command:
$> /opt/amazon/openmpi/bin/mpirun \
-x FI_EFA_TX_MIN_CREDITS=64 \
-x NCCL_DEBUG=INFO -x NCCL_TREE_THRESHOLD=0 \
--hostfile ~/hosts -n 16 -N 8 \
--mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
-x RDMAV_FORK_SAFE=1 \
-x LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda/efa/lib:$LDIBRARY_PATH \
~/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
Performance Output: (Complete output is present here: https://gist.github.com/rashikakheria/f870dc9407216a03cf6c34f335d7c6f1)
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 171.9 0.00 0.00 2e-07 170.9 0.00 0.00 1e-07
16 4 float sum 172.2 0.00 0.00 1e-07 171.5 0.00 0.00 1e-07
32 8 float sum 169.6 0.00 0.00 1e-07 172.5 0.00 0.00 1e-07
64 16 float sum 171.5 0.00 0.00 1e-07 171.8 0.00 0.00 1e-07
128 32 float sum 170.8 0.00 0.00 1e-07 170.4 0.00 0.00 1e-07
256 64 float sum 169.6 0.00 0.00 1e-07 169.4 0.00 0.00 1e-07
512 128 float sum 168.6 0.00 0.01 1e-07 171.0 0.00 0.01 1e-07
1024 256 float sum 171.3 0.01 0.01 2e-07 169.1 0.01 0.01 2e-07
2048 512 float sum 171.1 0.01 0.02 2e-07 170.4 0.01 0.02 2e-07
4096 1024 float sum 172.0 0.02 0.04 5e-07 172.3 0.02 0.04 5e-07
8192 2048 float sum 175.0 0.05 0.09 5e-07 175.0 0.05 0.09 5e-07
16384 4096 float sum 181.3 0.09 0.17 5e-07 180.6 0.09 0.17 5e-07
32768 8192 float sum 194.6 0.17 0.32 5e-07 194.8 0.17 0.32 5e-07
65536 16384 float sum 208.0 0.32 0.59 5e-07 207.5 0.32 0.59 5e-07
131072 32768 float sum 228.8 0.57 1.07 5e-07 227.0 0.58 1.08 5e-07
262144 65536 float sum 285.2 0.92 1.72 5e-07 284.2 0.92 1.73 5e-07
524288 131072 float sum 351.1 1.49 2.80 5e-07 349.5 1.50 2.81 5e-07
1048576 262144 float sum 531.1 1.97 3.70 5e-07 532.3 1.97 3.69 5e-07
2097152 524288 float sum 949.6 2.21 4.14 5e-07 956.0 2.19 4.11 5e-07
4194304 1048576 float sum 1777.2 2.36 4.43 5e-07 1743.0 2.41 4.51 5e-07
8388608 2097152 float sum 2198.5 3.82 7.15 5e-07 2223.4 3.77 7.07 5e-07
16777216 4194304 float sum 4807.5 3.49 6.54 5e-07 4822.2 3.48 6.52 5e-07
33554432 8388608 float sum 9802.3 3.42 6.42 5e-07 9706.8 3.46 6.48 5e-07
67108864 16777216 float sum 17647 3.80 7.13 5e-07 17696 3.79 7.11 5e-07
134217728 33554432 float sum 34570 3.88 7.28 5e-07 34471 3.89 7.30 5e-07
268435456 67108864 float sum 63162 4.25 7.97 5e-07 63065 4.26 7.98 5e-07
536870912 134217728 float sum 126035 4.26 7.99 5e-07 126101 4.26 7.98 5e-07
1073741824 268435456 float sum 250594 4.28 8.03 5e-07 250473 4.29 8.04 5e-07
- Installed horovod in pytorch environment
# Go to pytorch + python 3.6 conda environment
$> source activate pytorch_p36
# Install horovod
$> HOROVOD_GPU_OPERATIONS=NCCL pip install horovod
Collecting horovod
.....
Successfully built horovod
Installing collected packages: horovod
Successfully installed horovod-0.19.5
- Run horovod pytorch benchmark
$> git clone https://github.com/horovod/horovod.git
Run command:
$> HOROVOD_NUM_NCCL_STREAMS=4 horovodrun \
-np 16 -H 10.0.137.167:8,10.0.143.56:8 \
--mpi-args="-x FI_EFA_TX_MIN_CREDITS=64 -x NCCL_DEBUG=INFO -x NCCL_TREE_THRESHOLD=0 -x RDMAV_FORK_SAFE=1 -x LD_LIBRARY_PATH=/opt/amazon/openmpi/lib64:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/cuda/efa/lib:$LDIBRARY_PATH" \
python3 /home/ec2-user/horovod/examples/pytorch_synthetic_benchmark.py
[1,0]<stdout>:Model: resnet50
[1,0]<stdout>:Batch size: 32
[1,0]<stdout>:Number of GPUs: 16
[1,0]<stdout>:Running warmup...
[1,0]<stdout>:Running benchmark...
[1,0]<stdout>:Iter #0: 99.2 img/sec per GPU
[1,0]<stdout>:Iter #1: 98.9 img/sec per GPU
[1,0]<stdout>:Iter #2: 98.6 img/sec per GPU
[1,0]<stdout>:Iter #3: 98.7 img/sec per GPU
[1,0]<stdout>:Iter #4: 98.9 img/sec per GPU
[1,0]<stdout>:Iter #5: 98.8 img/sec per GPU
[1,0]<stdout>:Iter #6: 98.5 img/sec per GPU
[1,0]<stdout>:Iter #7: 98.8 img/sec per GPU
[1,0]<stdout>:Iter #8: 98.8 img/sec per GPU
[1,0]<stdout>:Iter #9: 89.7 img/sec per GPU
[1,0]<stdout>:Img/sec per GPU: 97.9 +-5.3
[1,0]<stdout>:Total img/sec on 16 GPU(s): 1566.4 +-85.5
Thank you very much! I followed your step and succeed in Amazon Linux 2.
Great, it worked for you. You can follow similar instructions and use Ubuntu based AMI. Closing this issue.