ROCm / ROCm

AMD ROCm™ Software - GitHub Home

Home Page:https://rocm.docs.amd.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Rocm 6.0: amdgpu.ids: No such file or directory

Loungagna opened this issue · comments

Problem Description

When pytorch 2.3.0.dev20240312+rocm6.0 is used, only the Laptop Graphic card is active:
`>>> torch.cuda.is_available()
True

torch.cuda.device_count()
1
torch.cuda.device(0)
<torch.cuda.device object at 0x7f0ff2000830>
torch.cuda.get_device_name(0)
'AMD Radeon Graphics'`

whereas on the same machine, on the same linux, with the same user when pytorch 2.2.1+rocm5.7 is used, then the proper GPU is found:
`>>> torch.cuda.is_available()
True

torch.cuda.device_count()
1
torch.cuda.current_device()
0
torch.cuda.device(0)
<torch.cuda.device object at 0x7f913c3b2360>
torch.cuda.get_device_name(0)
'AMD Radeon RX 7700S'`

Operating System

OS: NAME="Fedora Linux" VERSION="40 (Workstation Edition Prerelease)"

CPU

CPU: model name : AMD Ryzen 7 7840HS w/ Radeon 780M Graphics

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

import torch;

(Notice the error message:
amdgpu.ids: No such file or directory
amdgpu.ids: No such file or directory

torch.cuda.device(0);
torch.cuda.get_device_name(0);

expected:
'AMD Radeon RX 7700S'

issue is Rocm 6.0 reports:
'AMD Radeon Graphics'

it is the motherboard graphic card, not the expensive GPU added to the Framework laptop.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded

HSA System Attributes

Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES

==========
HSA Agents


Agent 1


Name: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 5137
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32021008(0x1e89a10) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32021008(0x1e89a10) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32021008(0x1e89a10) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: gfx1102
Uuid: GPU-XX
Marketing Name: AMD Radeon RX 7700S
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 2048(0x800) KB
Chip ID: 29824(0x7480)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2208
BDFID: 768
Internal Node ID: 1
Compute Unit: 32
SIMDs per CU: 2
Shader Engines: 2
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 550
SDMA engine uCode:: 16
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1102
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

Additional Information

The GPU not recognized by 6.0 is #0 AMD Radeon RX 7700S.

Hi @Loungagna, thanks for filing the ticket. Can you please share your laptop brand/model? Thanks.

Framework 16 from Framework Computers

@Loungagna, can you please confirm libdrm-amdgpu-common is installed? Thanks.

.local/lib/python3.12/sites-packages/torch/lib is where libdrm_andgpu.so is

@Loungagna , Can you please run "pip show torch". It will show the path to site-packages (for you must be something like "Location: .local/lib/python3.12/sites-packages/"). From that path find torch/share/libdrm/amdgpu.ids. If the file exists, can you find AMD Radeon RX 7700S? Does it match your device id? If you cannot find amdgpu.ids in that location, can you run "find .local/lib/python3.12/sites-packages/torch -name amdgpu.ids". Thanks.

❯ pip show torch Name: torch Version: 2.2.1+rocm5.7 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3 Location: /home/loungagna/.local/lib/python3.12/site-packages Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions Required-by: torchaudio, torchvision

and there's the expected amdgpu.ids:
lib/python3.12/site-packages via 🐍 v3.12.2 ❯ ll torch/share/libdrm/amd* Permissions Size User Date Modified Name .rw-r--r--@ 17k loungagna 12 Mar 11:09 torch/share/libdrm/amdgpu.ids

That's you 2.2.1+rocm5.7 case, which you have no problem with. Can you do the same for 2.3.0.dev20240312+rocm6.0?

Here is for the 6.0:

`Projects/learning-python/pytorch via 🐍 v3.12.2
❯ source .venv/bin/activate

Projects/learning-python/pytorch via 🐍 v3.12.1 (.venv)
❯ pip show torch
Name: torch
Version: 2.3.0.dev20240312+rocm6.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /home/loungagna/Documents/Projects/learning-python/pytorch/.venv/lib/python3.12/site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: torchaudio, torchvision

Projects/learning-python/pytorch via 🐍 v3.12.1 (.venv)
❯ cd .venv/lib/python3.12/site-packages

lib/python3.12/site-packages via 🐍 v3.12.1 (.venv)
❯ ll torch/share/libdrm/amd*
Permissions Size User Date Modified Name
.rw-r--r--@ 19k loungagna 12 Mar 15:02 torch/share/libdrm/amdgpu.ids`

What if you just run print(torch.file) from python. Can you also run "grep "RX 7700S" /opt/amdgpu/share/libdrm/amdgpu.ids" if the file exists. For me, if I delete /opt/amdgpu/share/libdrm/amdgpu.ids the device name is taken from python's site-packages/torch/share/libdrm/amdgpu.ids. If I delete both, I'll can reproduce the problem:

import torch
amdgpu.ids: No such file or directory
torch.cuda.get_device_name(0)
'AMD Radeon Graphics'

I have no /opt/amdgpu directory. I have a /opt/rocm which point to /etc/alternatives/rocm and a /opt/rocm-6.0.2 directories.

Using the defective python venv, I have:
`Python 3.12.1 | packaged by Anaconda, Inc. | (main, Jan 19 2024, 15:51:05) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import torch
amdgpu.ids: No such file or directory
amdgpu.ids: No such file or directory
print(torch.file)
Traceback (most recent call last):
File "", line 1, in
File "/home/loungagna/Documents/Projects/learning-python/pytorch/.venv/lib/python3.12/site-packages/torch/init.py", line 2006, in getattr
raise AttributeError(f"module '{name}' has no attribute '{name}'")
AttributeError: module 'torch' has no attribute 'file'
`

and with the non defective python version, I have:
`❯ python
Python 3.12.2 (main, Feb 21 2024, 00:00:00) [GCC 14.0.1 20240217 (Red Hat 14.0.1-0)] on linux
Type "help", "copyright", "credits" or "license" for more information.

import torch
print(torch.file)
Traceback (most recent call last):
File "", line 1, in
File "/home/loungagna/.local/lib/python3.12/site-packages/torch/init.py", line 1938, in getattr
raise AttributeError(f"module '{name}' has no attribute '{name}'")
AttributeError: module 'torch' has no attribute 'file'
`

So, it seems you don't have user-mode driver installed. It's Red Hat, right? Try to run "rpm -qa | grep libdrm" to see if it's installed. If you did have it installed, it would take preference. Torch seems to have it's own user mode components to interact with kernel driver. But in your defective python it seems not being able to work properly. Is it just device identification that is failing or everything else failing too, like creating a tensor on GPU?

So, it seems there is no issue AMD can help to resolve in this case. Should we close the ticket?

I will open a bug report. My understanding is that 6.0 does not work under Fedora 40 and python 3.12.2, whereas 5.7 and python 3.11.1 work fine.

I have just built pyton12 on new RHEL system with rocm-6.1.0 installed. Installed pytorch (pip3.12 install --pre --force-reinstall torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0/) . And I can see torch.cuda.get_device_name(0) gives me correct product name. Maybe you can try clean install first.

@Loungagna, any update with @vstempen's suggestion of doing a clean install first?

Closing the ticket. @Loungagna, please re-open if you still see the issue with a clean install. Thanks.