[Issue]: Torch segfaults when gpus with three different arches are avaiable

Question

[Issue]: Torch segfaults when gpus with three different arches are avaiable

IMbackK opened this issue 5 months ago · comments

Problem Description

Hi,

I have a system with a rx6800 an mi50 and several mi100s.
Since rocm 6.0 there is a segfault in ExtractFatBinaryUsingCOMGR when more than two different gpu architectures are visible to pytorch when any operation involveing the gpus is performed (torch.cuda.is_available() and placeing any tensor on any gpu)
both the offical torch-2.3.0.dev20240207+rocm6.0-cp311-cp311-linux_x86_64 wheel and building torch from source exhibits this problem.

Restricting to a maximum of any 2 architectures using ROCR_VISIBLE_DEVICES works around this issue.

Below is a backtrace of the issue:

hip::FatBinaryInfo::ExtractFatBinaryUsingCOMGR (this=0x55555b582d50, devices=std::vector of length 3, capacity 4 = {...}) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_fatbin.cpp:230
230           query_list_array[isa_idx].isa = isa_it->first.c_str();                                                                                                                                          
(gdb) bt
#0  hip::FatBinaryInfo::ExtractFatBinaryUsingCOMGR (this=0x55555b582d50, devices=std::vector of length 3, capacity 4 = {...}) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_fatbin.cpp:230
#1  0x00007fff7f2cefae in hip::StatCO::digestFatBinary (this=this@entry=0x55555575a1a8, data=0x7fffe13b3000, programs=@0x55555981a240: 0x55555b582d50)
    at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_code_object.cpp:752
#2  0x00007fff7f2cfaba in PlatformState::digestFatBinary (programs=@0x55555981a240: 0x55555b582d50, data=<optimized out>, this=<optimized out>)
    at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_platform.cpp:829
#3  PlatformState::init (this=0x55555575a0a0) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_platform.cpp:670
#4  hip::init (status=0x7fffffffc807) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_context.cpp:76
#5  0x00007ffff76ae6af in __pthread_once_slow (once_control=0x7fff80911e78 <g_ihipInitialized>, init_routine=0x7fff46ae0230 <std::__once_proxy()>) at pthread_once.c:116
#6  0x00007fff7f2f0313 in __gthread_once (__func=<optimized out>, __once=<optimized out>) at /usr/include/c++/13.2.1/x86_64-pc-linux-gnu/bits/gthr-default.h:700
#7  std::call_once<void (&)(bool*), bool*> (__once=<optimized out>, __f=<optimized out>) at /usr/include/c++/13.2.1/mutex:907
#8  hipGetDeviceCount (count=0x7fffffffcea0) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_device_runtime.cpp:637
#9  0x00007ffff517bb8a in c10::hip::device_count_ensure_non_zero() () at /usr/lib/libc10_hip.so
#10 0x00007fff821c0f2a in ??? () at /usr/lib/libtorch_hip.so
#11 0x00007fffee7bc984 in ??? () at /usr/lib/python3.11/site-packages/torch/lib/libtorch_python.so
#12 0x00007ffff79de349 in cfunction_vectorcall_NOARGS (func=0x7ffff6c984a0, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at ./Include/cpython/methodobject.h:52
#13 0x00007ffff79f2987 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0x7ffff6c984a0, tstate=0x7ffff7d89378 <_PyRuntime+166328>)
    at ./Include/internal/pycore_call.h:92
#14 PyObject_Vectorcall (callable=0x7ffff6c984a0, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:299
#15 0x00007ffff79e4c23 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4760
#16 0x00007ffff7a9c484 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7f7d020, tstate=0x7ffff7d89378 <_PyRuntime+166328>) at ./Include/internal/pycore_ceval.h:73
#17 _PyEval_Vector
    (tstate=tstate@entry=0x7ffff7d89378 <_PyRuntime+166328>, func=func@entry=0x7ffff6ec7ce0, locals=locals@entry=0x7ffff71f31c0, args=args@entry=0x0, argcount=argcount@entry=0, kwnames=kwnames@entry=0x0)
    at Python/ceval.c:6425
#18 0x00007ffff7a9be6c in PyEval_EvalCode (co=0x7ffc65a35b30, globals=<optimized out>, locals=0x7ffff71f31c0) at Python/ceval.c:1140
#19 0x00007ffff7ab9fc3 in run_eval_code_obj (tstate=tstate@entry=0x7ffff7d89378 <_PyRuntime+166328>, co=co@entry=0x7ffc65a35b30, globals=globals@entry=0x7ffff71f31c0, locals=locals@entry=0x7ffff71f31c0)
    at Python/pythonrun.c:1710
#20 0x00007ffff7ab63ea in run_mod
    (mod=mod@entry=0x5555555cb710, filename=filename@entry=0x7ffff71f3330, globals=0x7ffff71f31c0, locals=0x7ffff71f31c0, flags=flags@entry=0x7fffffffd4f8, arena=arena@entry=0x7ffff711b750)
    at Python/pythonrun.c:1731
#21 0x00007ffff79be02a in PyRun_InteractiveOneObjectEx (fp=fp@entry=0x7ffff77f68e0 <_IO_2_1_stdin_>, filename=filename@entry=0x7ffff71f3330, flags=flags@entry=0x7fffffffd4f8) at Python/pythonrun.c:261
#22 0x00007ffff79be2f2 in _PyRun_InteractiveLoopObject (fp=0x7ffff77f68e0 <_IO_2_1_stdin_>, filename=0x7ffff71f3330, flags=0x7fffffffd4f8) at Python/pythonrun.c:138
#23 0x00007ffff793556b in _PyRun_AnyFileObject (fp=0x7ffff77f68e0 <_IO_2_1_stdin_>, filename=0x7ffff71f3330, closeit=0, flags=0x7fffffffd4f8) at Python/pythonrun.c:73
#24 0x00007ffff79be417 in PyRun_AnyFileExFlags (fp=0x7ffff77f68e0 <_IO_2_1_stdin_>, filename=<optimized out>, closeit=0, flags=0x7fffffffd4f8) at Python/pythonrun.c:105
#25 0x00007ffff792fc28 in pymain_run_stdin (config=0x7ffff7d6f3c0 <_PyRuntime+59904>) at Modules/main.c:509
#26 pymain_run_python (exitcode=0x7fffffffd4f0) at Modules/main.c:604
#27 Py_RunMain () at Modules/main.c:680
#28 0x00007ffff7a8e79b in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:734
#29 0x00007ffff7643cd0 in __libc_start_call_main (main=main@entry=0x555555555120 <main>, argc=argc@entry=1, argv=argv@entry=0x7fffffffd748) at ../sysdeps/nptl/libc_start_call_main.h:58
#30 0x00007ffff7643d8a in __libc_start_main_impl (main=0x555555555120 <main>, argc=1, argv=0x7fffffffd748, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd738)
    at ../csu/libc-start.c:360
#31 0x0000555555555045 in _start ()

some more info on the crashed environment:

(gdb) p *query_list_array
$7 = {isa = 0x55555b584160 "amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-", size = 0, offset = 0}
(gdb) p isa_idx
$8 = <optimized out>
(gdb) p devices
$9 = std::vector of length 3, capacity 4 = {0x55555b581220, 0x55555b5814a0, 0x55555b5818a0}
(gdb) p devices[0]
gdb hangs

Operating System

Arch Linux

CPU

AMD EPYC 7502P

GPU

AMD Instinct MI100, AMD Radeon Pro W6800, AMD Radeon Pro VII

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Adam Tran - AMD · Answer 1 · Thu Feb 08 2024 11:01:48 GMT+0800 (China Standard Time)

An internal ticket has been created to track the investigation.

uvos · Answer 2 · Sun Feb 11 2024 06:30:06 GMT+0800 (China Standard Time)

i can also confirm that this problem is reproducible with just rocblas

Adam Tran - AMD · Answer 3 · Wed Feb 14 2024 01:52:03 GMT+0800 (China Standard Time)

This is a known internal issue which has been fixed. The fix should be available in upcoming ROCm 6.1 release.

ipanfilo · Answer 4 · Wed Feb 14 2024 02:28:48 GMT+0800 (China Standard Time)

The fix is applied and should be available in upcoming ROCm-6.1

uvos · Answer 5 · Thu Feb 15 2024 23:26:19 GMT+0800 (China Standard Time)

Could we have the patch in question to apply to our own builds of rocm?