ROCm / rccl-tests

RCCL Performance Benchmark Tests

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

rccl-test get stuck on gfx1100

Frozenmad opened this issue · comments

I'm testing the connectivities of rccl on two gfx1100 devices. The rocm-bandwidth-test is ok, but the rccl-tests get stuck.
The output is:

# nThreads: 1 nGpus: 2 nRanks: 1 minBytes: 8 maxBytes: 134217728 step: 2(factor) warmupIters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#   Rank  0 Pid  61927 on  deepspeed device  0 [0000:83:00.0] Radeon RX 7900 XTX
#   Rank  1 Pid  61927 on  deepspeed device  1 [0000:03:00.0] Radeon RX 7900 XTX
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)

No further output is derived and the program won't move on.
Same problem is observed when using pytorch distributed package. The following code get stucks too:

# main.py
import os
import torch
import torch.distributed as dist

dist.init_process_group('nccl')
rank = int(os.getenv('LOCAL_RANK', 0))
word = int(os.getenv('WORLD_SIZE', 0))

torch.cuda.set_device(rank)
a = dist.broadcast(a, 0)
torchrun --nnode 1 --nproc-per-node 2 main.py

Does rccl support gfx1100?

Hi, how did you solve this problem? I have met the same case, wish your response, thank you.

Hey @nusislam, any idea what the fix for this issue was? The RCCL ticket linked here doesn't seem to have any info on how to solve. Currently OPX team is running into this same issue where we are stuck in ncclGroupEnd.

Thanks in advance for any help you can offer!