The variable NCCL_IB_ADDR_RANGE did not work properly after being configured
riverzhang opened this issue · comments
Some software versions:
nccl test : 2.13.9
openmpi: 4.1.5
rdma ofed: 23.10-1.1.9.0
nvidia-dirver: 535.104.12-1
cuda: 11.4.4-1
nccl: 2.21.5-1
Command
mpirun --allow-run-as-root -bind-to none -map-by ppr:4:node -np 8 -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH -H xxxxx:4,xxxxx:4 -x NCCL_NVLS_ENABLE=0 -x NCCL_IB_HCA=mlx5_0,mlx5_1 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_ADDR_RANGE=192.168.64.0/24 -x NCCL_IB_ADDR_FAMILY=AF_INET -x NCCL_IB_ROCE_VERSION_NUM=2 -x NCCL_DEBUG=INFO -x NCCL_IB_TC=160 -mca btl_tcp_if_include eth0 ./build/all_reduce_perf -b 256M -e 4G -f 2 -g 1
error log:
busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO NCCL_IB_ADDR_FAMILY set by environment to AF_INET
busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO NCCL_IB_ROCE_VERSION_NUM set by environment to 2.
busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO NCCL_IB_ADDR_RANGE set by environment to 192.168.64.0/24
busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO transport/net_ib.cc:282 -> 2
busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO transport/net_ib.cc:305 -> 2
busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO transport/net_ib.cc:1047 -> 2
busybox2-68df6c586-ntvlv:11537:11571 [3] NCCL INFO transport/net.cc:687 -> 2
busybox2-68df6c586-ntvlv:11537:11560 [3] NCCL INFO transport/net.cc:306 -> 2
busybox2-68df6c586-ntvlv:11537:11560 [3] NCCL INFO transport.cc:165 -> 2
busybox2-68df6c586-ntvlv:11537:11560 [3] NCCL INFO init.cc:1263 -> 2
busybox2-68df6c586-ntvlv:11537:11560 [3] NCCL INFO init.cc:1548 -> 2
busybox2-68df6c586-ntvlv:11537:11560 [3] NCCL INFO group.cc:64 -> 2 [Async thread]
busybox2-68df6c586-ntvlv:11537:11537 [3] NCCL INFO group.cc:418 -> 2
busybox2-68df6c586-ntvlv:11537:11537 [3] NCCL INFO group.cc:95 -> 2
busybox2-68df6c586-ntvlv: Test NCCL failure common.cu:961 'unhandled system error (run with NCCL_DEBUG=INFO for details) / '
.. busybox2-68df6c586-ntvlv pid 11537: Test failure common.cu:844
@riverzhang that looks like a problem with RoCE version detection. The code retrieves the RoCE version by reading it from /sys/class/infiniband/<device>/ports/<port_num>/gid_attrs/types/<gid_index>
. The open
(or read
) call is failing for the above file and returning ncclSystemError
(error code 2). Could you check if the path exists?
Could you apply this patch and rerun your tests?
0001-net_ib-add-warn-debug-for-RoCE-version-detection.patch