NVIDIA / nccl

Optimized primitives for collective multi-GPU communication

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The variable NCCL_IB_ADDR_RANGE did not work properly after being configured

riverzhang opened this issue · comments

Some software versions:
nccl test : 2.13.9
openmpi: 4.1.5
rdma ofed: 23.10-1.1.9.0
nvidia-dirver: 535.104.12-1
cuda: 11.4.4-1
nccl: 2.21.5-1

Command
mpirun --allow-run-as-root -bind-to none -map-by ppr:4:node -np 8 -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH -H xxxxx:4,xxxxx:4 -x NCCL_NVLS_ENABLE=0 -x NCCL_IB_HCA=mlx5_0,mlx5_1 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_ADDR_RANGE=192.168.64.0/24 -x NCCL_IB_ADDR_FAMILY=AF_INET -x NCCL_IB_ROCE_VERSION_NUM=2 -x NCCL_DEBUG=INFO -x NCCL_IB_TC=160 -mca btl_tcp_if_include eth0 ./build/all_reduce_perf -b 256M -e 4G -f 2 -g 1

error log:
busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO NCCL_IB_ADDR_FAMILY set by environment to AF_INET
busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO NCCL_IB_ROCE_VERSION_NUM set by environment to 2.
busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO NCCL_IB_ADDR_RANGE set by environment to 192.168.64.0/24
busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO transport/net_ib.cc:282 -> 2
busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO transport/net_ib.cc:305 -> 2
busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO transport/net_ib.cc:1047 -> 2
busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO transport/net.cc:687 -> 2
busybox2-68df6c586-ntvlv:11537:11560 [3] NCCL INFO transport/net.cc:306 -> 2
busybox2-68df6c586-ntvlv:11537:11560 [3] NCCL INFO transport.cc:165 -> 2
busybox2-68df6c586-ntvlv:11537:11560 [3] NCCL INFO init.cc:1263 -> 2
busybox2-68df6c586-ntvlv:11537:11560 [3] NCCL INFO init.cc:1548 -> 2
busybox2-68df6c586-ntvlv:11537:11560 [3] NCCL INFO group.cc:64 -> 2 [Async thread]
busybox2-68df6c586-ntvlv:11537:11537 [3] NCCL INFO group.cc:418 -> 2
busybox2-68df6c586-ntvlv:11537:11537 [3] NCCL INFO group.cc:95 -> 2
busybox2-68df6c586-ntvlv: Test NCCL failure common.cu:961 'unhandled system error (run with NCCL_DEBUG=INFO for details) / '
.. busybox2-68df6c586-ntvlv pid 11537: Test failure common.cu:844

@sjeaugey Hello,I'm looking at this problem of NCCL. Similar problems have been posted (like #890 ) and I've tried the suggestions but it hasn't worked.

@riverzhang that looks like a problem with RoCE version detection. The code retrieves the RoCE version by reading it from /sys/class/infiniband/<device>/ports/<port_num>/gid_attrs/types/<gid_index>. The open (or read) call is failing for the above file and returning ncclSystemError (error code 2). Could you check if the path exists?

Could you apply this patch and rerun your tests?
0001-net_ib-add-warn-debug-for-RoCE-version-detection.patch