ROCm / rccl-tests

RCCL Performance Benchmark Tests

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Test NCCL failure common.cu:1285 : internal error

Eliasj42 opened this issue · comments

Hi, when I try to run the example in the README ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4, I'm getting this error

# nThreads: 1 nGpus: 4 nRanks: 1 minBytes: 8 maxBytes: 134217728 step: 2(factor) warmupIters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid 117738 on sharbox-ultra device  0 [0000:63:00.0] AMD Instinct MI210
#   Rank  1 Pid 117738 on sharbox-ultra device  1 [0000:43:00.0] AMD Instinct MI210
#   Rank  2 Pid 117738 on sharbox-ultra device  2 [0000:30:00.0] AMD Instinct MI210
#   Rank  3 Pid 117738 on sharbox-ultra device  3 [0000:03:00.0] AMD Instinct MI210
sharbox-ultra: Test NCCL failure common.cu:1285 'internal error - please report this issue to the NCCL developers'
 .. sharbox-ultra pid 117738: Test failure common.cu:1161

When I tried to run the tests for building rccl, I got this error output

[==========] Running 7 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 7 tests from AllReduce
[ RUN      ] AllReduce.OutOfPlace
[ INFO     ] Calling PIPE_READ to Child 0

sharbox-ultra:204802:204802 [0] /home/elias/rccl/build/release/hipify/src/init.cc:125 NCCL WARN Missing "amd_iommu=on" from kernel command line which can lead to system instablity or hang!

sharbox-ultra:204802:204802 [0] /home/elias/rccl/build/release/hipify/src/init.cc:127 NCCL WARN Missing "iommu=pt" from kernel command line which can lead to system instablity or hang!

sharbox-ultra:204802:204802 [0] /home/elias/rccl/build/release/hipify/src/init.cc:132 NCCL WARN Missing "HSA_FORCE_FINE_GRAIN_PCIE=1" from environment which can lead to low RCCL performance, system instablity or hang!
[ INFO     ] Got PIPE_READ 128 from Child 0
[ INFO     ] Calling PIPE_READ to Child 0
[ INFO     ] Got PIPE_READ 4 from Child 0
[ INFO     ] Calling PIPE_READ to Child 0
RCCL version 2.18.3+hip5.5 develop:6ecf771+

sharbox-ultra:204802:204836 [1] /home/elias/rccl/build/release/hipify/src/transport/p2p.cc:220 NCCL WARN hipIpcGetMemHandle failed : invalid argument

sharbox-ultra:204802:204836 [1] /home/elias/rccl/build/release/hipify/src/transport/p2p.cc:222 NCCL WARN Cuda failure 'invalid argument'

sharbox-ultra:204802:204836 [1] /home/elias/rccl/build/release/hipify/src/proxy.cc:1524 NCCL WARN [Proxy Service 1] Failed to execute operation Setup from rank 1, retcode 1

sharbox-ultra:204802:204837 [0] /home/elias/rccl/build/release/hipify/src/transport/p2p.cc:220 NCCL WARN hipIpcGetMemHandle failed : invalid argument

sharbox-ultra:204802:204837 [0] /home/elias/rccl/build/release/hipify/src/transport/p2p.cc:222 NCCL WARN Cuda failure 'invalid argument'

sharbox-ultra:204802:204837 [0] /home/elias/rccl/build/release/hipify/src/proxy.cc:1524 NCCL WARN [Proxy Service 0] Failed to execute operation Setup from rank 0, retcode 1

sharbox-ultra:204802:204835 [1] /home/elias/rccl/build/release/hipify/src/misc/socket.cc:57 NCCL WARN socketProgress: Connection closed by remote peer sharbox-ultra<48295>

sharbox-ultra:204802:204835 [1] /home/elias/rccl/build/release/hipify/src/proxy.cc:1148 NCCL WARN Socket recv failed while polling for opId=0x7fe95066b760

sharbox-ultra:204802:204834 [0] /home/elias/rccl/build/release/hipify/src/misc/socket.cc:57 NCCL WARN socketProgress: Connection closed by remote peer sharbox-ultra<35867>

sharbox-ultra:204802:204834 [0] /home/elias/rccl/build/release/hipify/src/proxy.cc:1148 NCCL WARN Socket recv failed while polling for opId=0x7fe958288240
[ ERROR    ] Child process 0 fails NCCL call ncclGroupEnd with code 3
[ ERROR    ] Child 0 failed on command [INIT_COMMS]:
[ INFO     ] Got PIPE_READ 4 from Child 0
[ ERROR    ] Child 0 reports failure
/home/elias/rccl/test/common/TestBed.cpp:178: Failure
Expected equality of these values:
  response
    Which is: 1
  TEST_SUCCESS
    Which is: 0
[  FAILED  ] AllReduce.OutOfPlace (665 ms)

Do you have any idea of what could be causing this crash?
It seems like the invalid arguments are the root of this issue, but I'm unsure what to do about it.