[7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument

Question

[7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument

eric-haibin-lin opened this issue 5 years ago · comments

I'm using horovod with EFA, and the multi-node job hangs with

...
[1,26]<stdout>:ip-172-32-36-209:8716:9558 [2] NCCL INFO Ring 00 : 26[2] -> 30[6] via P2P/IPC
[1,25]<stdout>:ip-172-32-36-209:8715:9546 [1] NCCL INFO Ring 00 : 25[1] -> 27[3] via P2P/IPC
[1,29]<stdout>:ip-172-32-36-209:8719:9555 [5] NCCL INFO Ring 00 : 29[5] -> 31[7] via P2P/IPC
[1,28]<stdout>:ip-172-32-36-209:8718:9563 [4] NCCL INFO Ring 00 : 28[4] -> 29[5] via P2P/IPC
[1,27]<stdout>:ip-172-32-36-209:8717:9550 [3] NCCL INFO Ring 00 : 27[3] -> 26[2] via P2P/IPC
[1,7]<stdout>:ip-172-32-38-1:21471:22316 [7] NCCL INFO Ring 00 : 7 -> 8 [send] via NET/AWS Libfabric/0
[1,15]<stdout>:ip-172-32-34-121:70560:71394 [7] NCCL INFO Ring 00 : 15 -> 16 [send] via NET/AWS Libfabric/0
[1,30]<stdout>:ip-172-32-36-209:8720:9567 [6] NCCL INFO Ring 00 : 30[6] -> 28[4] via P2P/IPC
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO Ring 00 : 23 -> 24 [send] via NET/AWS Libfabric/0
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO NET/OFI No NIC info for dev 0
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO include/net.h:24 -> 2
[1,23]<stdout>:
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO include/net.h:27 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO transport/net.cc:357 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:668 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:814 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:950 -> 2
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO Ring 00 : 31 -> 0 [send] via NET/AWS Libfabric/0
...

ubuntu@ip-172-32-38-1:~$ fi_info -p efa
provider: efa
    fabric: EFA-fe80::7e:9eff:fed3:c48a
    domain: efa_0-rdm
    version: 2.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::7e:9eff:fed3:c48a
    domain: efa_0-dgrm
    version: 2.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
    fabric: EFA-fe80::7e:9eff:fed3:c48a
    domain: efa_0-dgrm
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXD

David Addison · Answer 1 · Sat Sep 14 2019 00:54:38 GMT+0800 (China Standard Time)

Does your security group include a rule for All Traffic for itself Inbound & Outbound ?

Rashika Kheria · Answer 2 · Sat Sep 14 2019 01:34:40 GMT+0800 (China Standard Time)

In addition to what David asked, could you provide the following information as well?

Complete log of your run.
EFA installer version. You can find this using

cat /opt/amazon/efa_installed_packages

Which commit of aws-ofi-nccl are you using to run the tests? Is that the latest of aws branch?

Haibin Lin · Answer 3 · Sat Sep 14 2019 03:07:54 GMT+0800 (China Standard Time)

Thanks for the quick reply.

Does your security group include a rule for All Traffic for itself Inbound & Outbound ?

Yes.

Complete log of your run.

efa.log

EFA installer version. You can find this using

# EFA installer version: 1.4.1
# Debug packages installed: no
# Packages installed:
efa_1.3.0-1.amzn1_amd64 libfabric1_1.8.0amzn1.0_amd64 libfabric-bin_1.8.0amzn1.0_amd64 libfabric-dev_1.8.0amzn1.0_amd64 openmpi_3.1.4-2_amd64

I was using horovod with my model training code. I was trying to compile nccl-tests but the linker complains about some missing .so files

Rashika Kheria · Answer 4 · Sat Sep 14 2019 07:12:45 GMT+0800 (China Standard Time)

Could you confirm the commit that you are using for aws-ofi-nccl? Please use the latest aws branch of the plugin when running on EC2 infrastructure.

Also, does host "ip-172-32-36-209" have efa installed?

Lin Yuan · Answer 5 · Sat Sep 28 2019 13:54:34 GMT+0800 (China Standard Time)

I got the same error while running nccl-test

~/anaconda3/bin/mpirun \
        -x FI_PROVIDER="efa" \
        -x FI_EFA_TX_MIN_CREDITS=64 \
        -x LD_LIBRARY_PATH=$HOME/drivers/aws-ofi-nccl/install/lib/:$HOME/drivers/nccl/build/lib:/usr/local/cuda-10.0/lib64:/opt/amazon/efa/lib64:$LD_LIBRARY_PATH \
        -x NCCL_DEBUG=WARN -x NCCL_TREE_THRESHOLD=0 --hostfile $HOME/hosts -n 16 -N 8 \
        --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
        $HOME/drivers/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100

Running on single node is fine. The error only occur if I run on multiple nodes.
I am actually following the quip doc "Running nccl-tests on AWS EC2" and got the following error:

[ec2-user@ip-172-31-10-20 nccl-tests]$ ./run-test.sh  | tee test.log
# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
#   Rank  0 Pid  23389 on ip-172-31-10-20 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  1 Pid  23390 on ip-172-31-10-20 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank  2 Pid  23391 on ip-172-31-10-20 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank  3 Pid  23392 on ip-172-31-10-20 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank  4 Pid  23393 on ip-172-31-10-20 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank  5 Pid  23394 on ip-172-31-10-20 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank  6 Pid  23395 on ip-172-31-10-20 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank  7 Pid  23396 on ip-172-31-10-20 device  7 [0x00] Tesla V100-SXM2-32GB
#   Rank  8 Pid  10508 on ip-172-31-1-59 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  9 Pid  10509 on ip-172-31-1-59 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank 10 Pid  10510 on ip-172-31-1-59 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank 11 Pid  10511 on ip-172-31-1-59 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank 12 Pid  10512 on ip-172-31-1-59 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank 13 Pid  10513 on ip-172-31-1-59 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank 14 Pid  10514 on ip-172-31-1-59 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank 15 Pid  10515 on ip-172-31-1-59 device  7 [0x00] Tesla V100-SXM2-32GB
NCCL version 2.4.6+cuda10.0

ip-172-31-1-59:10508:10581 [0] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
ip-172-31-1-59: Test NCCL failure common.cu:782 'unhandled system error'

ip-172-31-1-59:10515:10584 [7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
ip-172-31-1-59: Test NCCL failure common.cu:782 'unhandled system error'
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[42896,1],8]
  Exit code:    3

Rashika Kheria · Answer 6 · Wed Oct 02 2019 02:02:00 GMT+0800 (China Standard Time)

This would happen if you are using the master branch. Please use aws branch when working with EFA.

Rashika Kheria · Answer 7 · Sat Jan 11 2020 04:10:40 GMT+0800 (China Standard Time)

Please re-open if you see the issue again.