rccl-test get stuck on gfx1100

Question

rccl-test get stuck on gfx1100

Frozenmad opened this issue 9 months ago · comments

I'm testing the connectivities of rccl on two gfx1100 devices. The rocm-bandwidth-test is ok, but the rccl-tests get stuck.
The output is:

# nThreads: 1 nGpus: 2 nRanks: 1 minBytes: 8 maxBytes: 134217728 step: 2(factor) warmupIters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#   Rank  0 Pid  61927 on  deepspeed device  0 [0000:83:00.0] Radeon RX 7900 XTX
#   Rank  1 Pid  61927 on  deepspeed device  1 [0000:03:00.0] Radeon RX 7900 XTX
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)

No further output is derived and the program won't move on.
Same problem is observed when using pytorch distributed package. The following code get stucks too:

# main.py
import os
import torch
import torch.distributed as dist

dist.init_process_group('nccl')
rank = int(os.getenv('LOCAL_RANK', 0))
word = int(os.getenv('WORLD_SIZE', 0))

torch.cuda.set_device(rank)
a = dist.broadcast(a, 0)

torchrun --nnode 1 --nproc-per-node 2 main.py

Does rccl support gfx1100?

Leo shan · Answer 1 · Tue Dec 26 2023 11:37:23 GMT+0800 (China Standard Time)

Hi, how did you solve this problem? I have met the same case, wish your response, thank you.

Thomas Huber · Answer 2 · Fri Jul 19 2024 05:26:57 GMT+0800 (China Standard Time)

Hey @nusislam, any idea what the fix for this issue was? The RCCL ticket linked here doesn't seem to have any info on how to solve. Currently OPX team is running into this same issue where we are stuck in ncclGroupEnd.

Thanks in advance for any help you can offer!