[7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
eric-haibin-lin opened this issue · comments
I'm using horovod with EFA, and the multi-node job hangs with
...
[1,26]<stdout>:ip-172-32-36-209:8716:9558 [2] NCCL INFO Ring 00 : 26[2] -> 30[6] via P2P/IPC
[1,25]<stdout>:ip-172-32-36-209:8715:9546 [1] NCCL INFO Ring 00 : 25[1] -> 27[3] via P2P/IPC
[1,29]<stdout>:ip-172-32-36-209:8719:9555 [5] NCCL INFO Ring 00 : 29[5] -> 31[7] via P2P/IPC
[1,28]<stdout>:ip-172-32-36-209:8718:9563 [4] NCCL INFO Ring 00 : 28[4] -> 29[5] via P2P/IPC
[1,27]<stdout>:ip-172-32-36-209:8717:9550 [3] NCCL INFO Ring 00 : 27[3] -> 26[2] via P2P/IPC
[1,7]<stdout>:ip-172-32-38-1:21471:22316 [7] NCCL INFO Ring 00 : 7 -> 8 [send] via NET/AWS Libfabric/0
[1,15]<stdout>:ip-172-32-34-121:70560:71394 [7] NCCL INFO Ring 00 : 15 -> 16 [send] via NET/AWS Libfabric/0
[1,30]<stdout>:ip-172-32-36-209:8720:9567 [6] NCCL INFO Ring 00 : 30[6] -> 28[4] via P2P/IPC
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO Ring 00 : 23 -> 24 [send] via NET/AWS Libfabric/0
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO NET/OFI No NIC info for dev 0
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO include/net.h:24 -> 2
[1,23]<stdout>:
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO include/net.h:27 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO transport/net.cc:357 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:668 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:814 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:950 -> 2
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO Ring 00 : 31 -> 0 [send] via NET/AWS Libfabric/0
...
ubuntu@ip-172-32-38-1:~$ fi_info -p efa
provider: efa
fabric: EFA-fe80::7e:9eff:fed3:c48a
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::7e:9eff:fed3:c48a
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::7e:9eff:fed3:c48a
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
Does your security group include a rule for All Traffic for itself Inbound & Outbound ?
In addition to what David asked, could you provide the following information as well?
- Complete log of your run.
- EFA installer version. You can find this using
cat /opt/amazon/efa_installed_packages
- Which commit of aws-ofi-nccl are you using to run the tests? Is that the latest of
aws
branch?
Thanks for the quick reply.
Does your security group include a rule for All Traffic for itself Inbound & Outbound ?
Yes.
Complete log of your run.
EFA installer version. You can find this using
# EFA installer version: 1.4.1
# Debug packages installed: no
# Packages installed:
efa_1.3.0-1.amzn1_amd64 libfabric1_1.8.0amzn1.0_amd64 libfabric-bin_1.8.0amzn1.0_amd64 libfabric-dev_1.8.0amzn1.0_amd64 openmpi_3.1.4-2_amd64
I was using horovod with my model training code. I was trying to compile nccl-tests but the linker complains about some missing .so files
Could you confirm the commit that you are using for aws-ofi-nccl? Please use the latest aws
branch of the plugin when running on EC2 infrastructure.
Also, does host "ip-172-32-36-209" have efa installed?
I got the same error while running nccl-test
~/anaconda3/bin/mpirun \
-x FI_PROVIDER="efa" \
-x FI_EFA_TX_MIN_CREDITS=64 \
-x LD_LIBRARY_PATH=$HOME/drivers/aws-ofi-nccl/install/lib/:$HOME/drivers/nccl/build/lib:/usr/local/cuda-10.0/lib64:/opt/amazon/efa/lib64:$LD_LIBRARY_PATH \
-x NCCL_DEBUG=WARN -x NCCL_TREE_THRESHOLD=0 --hostfile $HOME/hosts -n 16 -N 8 \
--mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
$HOME/drivers/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
Running on single node is fine. The error only occur if I run on multiple nodes.
I am actually following the quip doc "Running nccl-tests on AWS EC2" and got the following error:
[ec2-user@ip-172-31-10-20 nccl-tests]$ ./run-test.sh | tee test.log
# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
# Rank 0 Pid 23389 on ip-172-31-10-20 device 0 [0x00] Tesla V100-SXM2-32GB
# Rank 1 Pid 23390 on ip-172-31-10-20 device 1 [0x00] Tesla V100-SXM2-32GB
# Rank 2 Pid 23391 on ip-172-31-10-20 device 2 [0x00] Tesla V100-SXM2-32GB
# Rank 3 Pid 23392 on ip-172-31-10-20 device 3 [0x00] Tesla V100-SXM2-32GB
# Rank 4 Pid 23393 on ip-172-31-10-20 device 4 [0x00] Tesla V100-SXM2-32GB
# Rank 5 Pid 23394 on ip-172-31-10-20 device 5 [0x00] Tesla V100-SXM2-32GB
# Rank 6 Pid 23395 on ip-172-31-10-20 device 6 [0x00] Tesla V100-SXM2-32GB
# Rank 7 Pid 23396 on ip-172-31-10-20 device 7 [0x00] Tesla V100-SXM2-32GB
# Rank 8 Pid 10508 on ip-172-31-1-59 device 0 [0x00] Tesla V100-SXM2-32GB
# Rank 9 Pid 10509 on ip-172-31-1-59 device 1 [0x00] Tesla V100-SXM2-32GB
# Rank 10 Pid 10510 on ip-172-31-1-59 device 2 [0x00] Tesla V100-SXM2-32GB
# Rank 11 Pid 10511 on ip-172-31-1-59 device 3 [0x00] Tesla V100-SXM2-32GB
# Rank 12 Pid 10512 on ip-172-31-1-59 device 4 [0x00] Tesla V100-SXM2-32GB
# Rank 13 Pid 10513 on ip-172-31-1-59 device 5 [0x00] Tesla V100-SXM2-32GB
# Rank 14 Pid 10514 on ip-172-31-1-59 device 6 [0x00] Tesla V100-SXM2-32GB
# Rank 15 Pid 10515 on ip-172-31-1-59 device 7 [0x00] Tesla V100-SXM2-32GB
NCCL version 2.4.6+cuda10.0
ip-172-31-1-59:10508:10581 [0] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
ip-172-31-1-59: Test NCCL failure common.cu:782 'unhandled system error'
ip-172-31-1-59:10515:10584 [7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
ip-172-31-1-59: Test NCCL failure common.cu:782 'unhandled system error'
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[42896,1],8]
Exit code: 3
This would happen if you are using the master
branch. Please use aws
branch when working with EFA.
Please re-open if you see the issue again.