aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument

eric-haibin-lin opened this issue · comments

I'm using horovod with EFA, and the multi-node job hangs with

...
[1,26]<stdout>:ip-172-32-36-209:8716:9558 [2] NCCL INFO Ring 00 : 26[2] -> 30[6] via P2P/IPC
[1,25]<stdout>:ip-172-32-36-209:8715:9546 [1] NCCL INFO Ring 00 : 25[1] -> 27[3] via P2P/IPC
[1,29]<stdout>:ip-172-32-36-209:8719:9555 [5] NCCL INFO Ring 00 : 29[5] -> 31[7] via P2P/IPC
[1,28]<stdout>:ip-172-32-36-209:8718:9563 [4] NCCL INFO Ring 00 : 28[4] -> 29[5] via P2P/IPC
[1,27]<stdout>:ip-172-32-36-209:8717:9550 [3] NCCL INFO Ring 00 : 27[3] -> 26[2] via P2P/IPC
[1,7]<stdout>:ip-172-32-38-1:21471:22316 [7] NCCL INFO Ring 00 : 7 -> 8 [send] via NET/AWS Libfabric/0
[1,15]<stdout>:ip-172-32-34-121:70560:71394 [7] NCCL INFO Ring 00 : 15 -> 16 [send] via NET/AWS Libfabric/0
[1,30]<stdout>:ip-172-32-36-209:8720:9567 [6] NCCL INFO Ring 00 : 30[6] -> 28[4] via P2P/IPC
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO Ring 00 : 23 -> 24 [send] via NET/AWS Libfabric/0
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO NET/OFI No NIC info for dev 0
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO include/net.h:24 -> 2
[1,23]<stdout>:
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO include/net.h:27 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO transport/net.cc:357 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:668 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:814 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:950 -> 2
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO Ring 00 : 31 -> 0 [send] via NET/AWS Libfabric/0
...
ubuntu@ip-172-32-38-1:~$ fi_info -p efa
provider: efa
    fabric: EFA-fe80::7e:9eff:fed3:c48a
    domain: efa_0-rdm
    version: 2.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::7e:9eff:fed3:c48a
    domain: efa_0-dgrm
    version: 2.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
    fabric: EFA-fe80::7e:9eff:fed3:c48a
    domain: efa_0-dgrm
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXD

Does your security group include a rule for All Traffic for itself Inbound & Outbound ?

In addition to what David asked, could you provide the following information as well?

  1. Complete log of your run.
  2. EFA installer version. You can find this using
cat /opt/amazon/efa_installed_packages
  1. Which commit of aws-ofi-nccl are you using to run the tests? Is that the latest of aws branch?

Thanks for the quick reply.

Does your security group include a rule for All Traffic for itself Inbound & Outbound ?

Yes.

Complete log of your run.

efa.log

EFA installer version. You can find this using

# EFA installer version: 1.4.1
# Debug packages installed: no
# Packages installed:
efa_1.3.0-1.amzn1_amd64 libfabric1_1.8.0amzn1.0_amd64 libfabric-bin_1.8.0amzn1.0_amd64 libfabric-dev_1.8.0amzn1.0_amd64 openmpi_3.1.4-2_amd64

I was using horovod with my model training code. I was trying to compile nccl-tests but the linker complains about some missing .so files

Could you confirm the commit that you are using for aws-ofi-nccl? Please use the latest aws branch of the plugin when running on EC2 infrastructure.

Also, does host "ip-172-32-36-209" have efa installed?

I got the same error while running nccl-test

~/anaconda3/bin/mpirun \
        -x FI_PROVIDER="efa" \
        -x FI_EFA_TX_MIN_CREDITS=64 \
        -x LD_LIBRARY_PATH=$HOME/drivers/aws-ofi-nccl/install/lib/:$HOME/drivers/nccl/build/lib:/usr/local/cuda-10.0/lib64:/opt/amazon/efa/lib64:$LD_LIBRARY_PATH \
        -x NCCL_DEBUG=WARN -x NCCL_TREE_THRESHOLD=0 --hostfile $HOME/hosts -n 16 -N 8 \
        --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
        $HOME/drivers/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100

Running on single node is fine. The error only occur if I run on multiple nodes.
I am actually following the quip doc "Running nccl-tests on AWS EC2" and got the following error:

[ec2-user@ip-172-31-10-20 nccl-tests]$ ./run-test.sh  | tee test.log
# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
#   Rank  0 Pid  23389 on ip-172-31-10-20 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  1 Pid  23390 on ip-172-31-10-20 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank  2 Pid  23391 on ip-172-31-10-20 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank  3 Pid  23392 on ip-172-31-10-20 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank  4 Pid  23393 on ip-172-31-10-20 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank  5 Pid  23394 on ip-172-31-10-20 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank  6 Pid  23395 on ip-172-31-10-20 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank  7 Pid  23396 on ip-172-31-10-20 device  7 [0x00] Tesla V100-SXM2-32GB
#   Rank  8 Pid  10508 on ip-172-31-1-59 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  9 Pid  10509 on ip-172-31-1-59 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank 10 Pid  10510 on ip-172-31-1-59 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank 11 Pid  10511 on ip-172-31-1-59 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank 12 Pid  10512 on ip-172-31-1-59 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank 13 Pid  10513 on ip-172-31-1-59 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank 14 Pid  10514 on ip-172-31-1-59 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank 15 Pid  10515 on ip-172-31-1-59 device  7 [0x00] Tesla V100-SXM2-32GB
NCCL version 2.4.6+cuda10.0

ip-172-31-1-59:10508:10581 [0] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
ip-172-31-1-59: Test NCCL failure common.cu:782 'unhandled system error'

ip-172-31-1-59:10515:10584 [7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
ip-172-31-1-59: Test NCCL failure common.cu:782 'unhandled system error'
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[42896,1],8]
  Exit code:    3

This would happen if you are using the master branch. Please use aws branch when working with EFA.

Please re-open if you see the issue again.