Is AMD Radeon Vega 8 supported?

Question

Is AMD Radeon Vega 8 supported?

dmfrey opened this issue 7 months ago · comments

I couldn't find any information regarding this.

When I deploy both the device plugin and labeler to my cluster, my nodes get labeled like:

Labels:             beta.amd.com/gpu.device-id.1638=1                                                                                                                                                                                                                                               
                    beta.amd.com/gpu.family.RV=1                                                                                                                                                                                                                                                    
                    beta.amd.com/gpu.vram.1G=1

What I don't see, however, are labels like amd.com/gpu.

y2kenny-amd · Answer 1 · Wed Jan 10 2024 12:32:52 GMT+0800 (China Standard Time)

Please see ROCm system requirement here:
https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html

Nathan Flint · Answer 2 · Wed Jan 24 2024 11:13:04 GMT+0800 (China Standard Time)

@dmfrey It's working for me on Vega7 (Ryzen 4650G), so it should work on Vega8. But I'm just using it for hardware video encoding, nothing else.

Regarding the labels, it seems to me that you're mixing up node labels and resource limits/requests. The labels are to ensure that your pod runs on a node with the right kind of GPU, in case you have a large cluster with many different nodes and kinds of GPUs. However, the resource section ensures a GPU gets scheduled/mapped into your POD. In my case, I'm running a single node cluster for personal use, so I dont need node labels. All my pods run on the same node. But I did need the resources section in my Pod definition.

Daniel Frey · Answer 3 · Wed Jan 24 2024 22:47:17 GMT+0800 (China Standard Time)

@nlflint This is a home lab as well, 3 nodes, each identical, so labels probably aren't needed either.

These are the boxes I'm running on: https://www.geekompc.com/geekom-a5-mini-pc/

My case is just like yours, I wish to use these for transcoding, specifically with the TDARR app of STARR apps and PLEX transcoding. However, if I try to put a request on the pods to have that GPU available, it fails to deploy to the nodes. When I put the requests in, it complained no nodes were available with that resource.

Nathan Flint · Answer 4 · Fri Jan 26 2024 02:21:03 GMT+0800 (China Standard Time)

@dmfrey If you do a describe nodes, do you see the amd.com/gpu in the capacity and allocatable sections on each of your nodes? Like I see on my single node?

Daniel Frey · Answer 5 · Fri Jan 26 2024 03:28:56 GMT+0800 (China Standard Time)

They say beta.amd.com/gpu

Nathan Flint · Answer 6 · Fri Jan 26 2024 04:23:48 GMT+0800 (China Standard Time)

Show me the logs of one of your amdgpu plugin daemonset pods (you should have 3 running). Here's mine for an example:

➜ nathan@rodan  ~  kubectl logs amdgpu-device-plugin-daemonset-6ks4c -n kube-system 
I0123 02:50:09.718359       1 main.go:305] AMD GPU device plugin for Kubernetes
I0123 02:50:09.718409       1 main.go:305] ./k8s-device-plugin version v1.25.2.6-5-g4503704
I0123 02:50:09.718412       1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800
I0123 02:50:09.718417       1 manager.go:42] Starting device plugin manager
I0123 02:50:09.718422       1 manager.go:46] Registering for system signal notifications
I0123 02:50:09.718589       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
I0123 02:50:09.718652       1 manager.go:60] Starting Discovery on new plugins
I0123 02:50:09.718663       1 manager.go:66] Handling incoming signals
I0123 02:50:09.718679       1 manager.go:71] Received new list of plugins: [gpu]
I0123 02:50:09.718706       1 manager.go:110] Adding a new plugin "gpu"
I0123 02:50:09.718726       1 plugin.go:64] gpu: Starting plugin server
I0123 02:50:09.718733       1 plugin.go:94] gpu: Starting the DPI gRPC server
I0123 02:50:09.719399       1 plugin.go:112] gpu: Serving requests...
I0123 02:50:19.720886       1 plugin.go:128] gpu: Registering the DPI with Kubelet
I0123 02:50:19.722865       1 plugin.go:140] gpu: Registration for endpoint amd.com_gpu
I0123 02:50:19.727721       1 amdgpu.go:100] /sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0
I0123 02:50:19.727787       1 amdgpu.go:100] /sys/module/amdgpu/drivers/pci:amdgpu/0000:07:00.0
I0123 02:50:19.756473       1 main.go:149] Watching GPU with bus ID: 0000:01:00.0 NUMA Node: []
E0123 02:50:19.756497       1 main.go:151] No NUMA node found with bus ID: 0000:01:00.0
I0123 02:50:19.756505       1 main.go:149] Watching GPU with bus ID: 0000:07:00.0 NUMA Node: []
E0123 02:50:19.756507       1 main.go:151] No NUMA node found with bus ID: 0000:07:00.0
I0124 04:55:58.585795       1 main.go:224] Allocating device ID: 0000:07:00.0
I0124 04:55:58.585832       1 main.go:224] Allocating device ID: 0000:01:00.0
I0124 05:08:10.915814       1 main.go:224] Allocating device ID: 0000:01:00.0
I0124 05:08:10.915850       1 main.go:224] Allocating device ID: 0000:07:00.0
I0124 05:14:42.111098       1 main.go:224] Allocating device ID: 0000:07:00.0
I0124 05:14:42.111143       1 main.go:224] Allocating device ID: 0000:01:00.0

I have two GPUs on this machine, a Vega7 iGPU and an RX550 PCIe card, which is why there are 2 devices found.

Notice the line in the middle shows: 1 plugin.go:140] gpu: Registration for endpoint amd.com_gpu. The name of the resource is amd.com/gpu. What does yours say?

Daniel Frey · Answer 7 · Fri Jan 26 2024 22:18:27 GMT+0800 (China Standard Time)

I will post them when I'm back home. Thank you.

…

On Thu, Jan 25, 2024, 3:23 PM Nathan Flint ***@***.***> wrote: Show me the logs of the amdgpu plugin daemonset pod. Here's mine for an example: ➜ ***@***.*** ~ kubectl logs amdgpu-device-plugin-daemonset-6ks4c -n kube-system I0123 02:50:09.718359 1 main.go:305] AMD GPU device plugin for Kubernetes I0123 02:50:09.718409 1 main.go:305] ./k8s-device-plugin version v1.25.2.6-5-g4503704 I0123 02:50:09.718412 1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800 I0123 02:50:09.718417 1 manager.go:42] Starting device plugin manager I0123 02:50:09.718422 1 manager.go:46] Registering for system signal notifications I0123 02:50:09.718589 1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory I0123 02:50:09.718652 1 manager.go:60] Starting Discovery on new plugins I0123 02:50:09.718663 1 manager.go:66] Handling incoming signals I0123 02:50:09.718679 1 manager.go:71] Received new list of plugins: [gpu] I0123 02:50:09.718706 1 manager.go:110] Adding a new plugin "gpu" I0123 02:50:09.718726 1 plugin.go:64] gpu: Starting plugin server I0123 02:50:09.718733 1 plugin.go:94] gpu: Starting the DPI gRPC server I0123 02:50:09.719399 1 plugin.go:112] gpu: Serving requests... I0123 02:50:19.720886 1 plugin.go:128] gpu: Registering the DPI with Kubelet I0123 02:50:19.722865 1 plugin.go:140] gpu: Registration for endpoint amd.com_gpu I0123 02:50:19.727721 1 amdgpu.go:100] /sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0 I0123 02:50:19.727787 1 amdgpu.go:100] /sys/module/amdgpu/drivers/pci:amdgpu/0000:07:00.0 I0123 02:50:19.756473 1 main.go:149] Watching GPU with bus ID: 0000:01:00.0 NUMA Node: [] E0123 02:50:19.756497 1 main.go:151] No NUMA node found with bus ID: 0000:01:00.0 I0123 02:50:19.756505 1 main.go:149] Watching GPU with bus ID: 0000:07:00.0 NUMA Node: [] E0123 02:50:19.756507 1 main.go:151] No NUMA node found with bus ID: 0000:07:00.0 I0124 04:55:58.585795 1 main.go:224] Allocating device ID: 0000:07:00.0 I0124 04:55:58.585832 1 main.go:224] Allocating device ID: 0000:01:00.0 I0124 05:08:10.915814 1 main.go:224] Allocating device ID: 0000:01:00.0 I0124 05:08:10.915850 1 main.go:224] Allocating device ID: 0000:07:00.0 I0124 05:14:42.111098 1 main.go:224] Allocating device ID: 0000:07:00.0 I0124 05:14:42.111143 1 main.go:224] Allocating device ID: 0000:01:00.0 — Reply to this email directly, view it on GitHub <#42 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHHFTAL3RCIVGRABBFW7FDYQK5N7AVCNFSM6AAAAABBEY2XI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQHE2DGNZXGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Daniel Frey · Answer 8 · Sun Jan 28 2024 02:40:40 GMT+0800 (China Standard Time)

Here are the logs for the deamonset. There are 3 nodes. The logs are very light.

amd-device-plugin-6bv75 I0127 18:29:58.101908       1 main.go:305] AMD GPU device plugin for Kubernetes                                                                                                                                                                                             
amd-device-plugin-6bv75 I0127 18:29:58.101961       1 main.go:305] ./k8s-device-plugin version v1.25.2.7-0-g4503704                                                                                                                                                                                 
amd-device-plugin-6bv75 I0127 18:29:58.101965       1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800                                                                                                                                        
amd-device-plugin-6bv75 I0127 18:29:58.101971       1 manager.go:42] Starting device plugin manager                                                                                                                                                                                                 
amd-device-plugin-6bv75 I0127 18:29:58.101980       1 manager.go:46] Registering for system signal notifications                                                                                                                                                                                    
amd-device-plugin-6bv75 I0127 18:29:58.102101       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory                                                                                                                                                 
amd-device-plugin-6bv75 I0127 18:29:58.102184       1 manager.go:60] Starting Discovery on new plugins                                                                                                                                                                                              
amd-device-plugin-6bv75 I0127 18:29:58.102196       1 manager.go:66] Handling incoming signals                                                                                                                                                                                                      

amd-device-plugin-gpxj8 I0127 18:29:52.707491       1 main.go:305] AMD GPU device plugin for Kubernetes                                                                                                                                                                                             
amd-device-plugin-gpxj8 I0127 18:29:52.707545       1 main.go:305] ./k8s-device-plugin version v1.25.2.7-0-g4503704                                                                                                                                                                                 
amd-device-plugin-gpxj8 I0127 18:29:52.707549       1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800                                                                                                                                        
amd-device-plugin-gpxj8 I0127 18:29:52.707556       1 manager.go:42] Starting device plugin manager                                                                                                                                                                                                 
amd-device-plugin-gpxj8 I0127 18:29:52.707564       1 manager.go:46] Registering for system signal notifications                                                                                                                                                                                    
amd-device-plugin-gpxj8 I0127 18:29:52.707650       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory                                                                                                                                                 
amd-device-plugin-gpxj8 I0127 18:29:52.707695       1 manager.go:60] Starting Discovery on new plugins                                                                                                                                                                                              
amd-device-plugin-gpxj8 I0127 18:29:52.707700       1 manager.go:66] Handling incoming signals                                                                                                                                                                                                      

amd-device-plugin-m29lm I0127 18:29:55.484546       1 main.go:305] AMD GPU device plugin for Kubernetes                                                                                                                                                                                             
amd-device-plugin-m29lm I0127 18:29:55.484601       1 main.go:305] ./k8s-device-plugin version v1.25.2.7-0-g4503704                                                                                                                                                                                 
amd-device-plugin-m29lm I0127 18:29:55.484605       1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800                                                                                                                                        
amd-device-plugin-m29lm I0127 18:29:55.484612       1 manager.go:42] Starting device plugin manager                                                                                                                                                                                                 
amd-device-plugin-m29lm I0127 18:29:55.484620       1 manager.go:46] Registering for system signal notifications                                                                                                                                                                                    
amd-device-plugin-m29lm I0127 18:29:55.484706       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory                                                                                                                                                 
amd-device-plugin-m29lm I0127 18:29:55.484762       1 manager.go:60] Starting Discovery on new plugins                                                                                                                                                                                              
amd-device-plugin-m29lm I0127 18:29:55.484769       1 manager.go:66] Handling incoming signals

Daniel Frey · Answer 9 · Sun Jan 28 2024 02:42:24 GMT+0800 (China Standard Time)

I never see it registering the gpu plugin

Nathan Flint · Answer 10 · Sun Jan 28 2024 03:08:57 GMT+0800 (China Standard Time)

Interesting, it's waiting for a "signal". Comparing to my logs, the next steps should be:

...
1 manager.go:71] Received new list of plugins: [gpu]
1 manager.go:110] Adding a new plugin "gpu"
...

It looks like the main.go from this ROCm plugin Here entered some k8s plugin manager code. I dont know how that stuff works, or why it's not receiving a signal for your gpu plugin.

Daniel Frey · Answer 11 · Sun Jan 28 2024 03:20:02 GMT+0800 (China Standard Time)

I deploy these with FluxCD. My config is here:

https://github.com/dmfrey/home-gitops/tree/main/kubernetes/apps/system/amd-device-plugin/app

Nathan Flint · Answer 12 · Sun Jan 28 2024 03:23:14 GMT+0800 (China Standard Time)

I took a deeper look and that code I linked, and it is just the constructor (i dont know Go). It appears the .run() at the bottom is where the plugin manager starts Here.

Based on this code comment, maybe you dont have the ROCm kernel/driver installed on each of your nodes?

go func() {
		// /sys/class/kfd only exists if ROCm kernel/driver is installed
		var path = "/sys/class/kfd"
		if _, err := os.Stat(path); err == nil {
			l.ResUpdateChan <- []string{"gpu"}
		}
	}()
manager.Run()

Does this file/filelink exist on your nodes? ls -la /sys/class/kfd

It exists on my node:

➜ nathan@rodan  ~  ls -la /sys/class/kfd
total 0
drwxr-xr-x  2 root root 0 Jan 27 16:56 .
drwxr-xr-x 81 root root 0 Jan 27 16:56 ..
lrwxrwxrwx  1 root root 0 Jan 27 16:56 kfd -> ../../devices/virtual/kfd/kfd

Nathan Flint · Answer 13 · Sun Jan 28 2024 03:28:08 GMT+0800 (China Standard Time)

I'm not sure how I got that ROCm module. I don't remember installing it. I'm just running Ubuntu 22.04 server edition.

Daniel Frey · Answer 14 · Sun Jan 28 2024 07:18:15 GMT+0800 (China Standard Time)

Maybe I have the wrong extension loaded. Talos Linux has these extension modules, one of which is amdgpu-firmware. I'm guessing that is probably different than the rocm package.

https://github.com/siderolabs/extensions?tab=readme-ov-file

Daniel Frey · Answer 15 · Sun Jan 28 2024 07:19:39 GMT+0800 (China Standard Time)

ls -la /sys/class/kfd

this does not exist on my nodes.

Nathan Flint · Answer 16 · Mon Jan 29 2024 02:25:26 GMT+0800 (China Standard Time)

Based on my reading of the pre-reqs links, it's unclear exactly which modules and software you need, but it does say:

ROCm kernel or latest AMD GPU Linux driver

You should be able to confirm if the AMDGPU kernel module is loaded. The lspci -v command shows verbose details of your PCIe devices and includes the kernel module associated with each device. Here's mine (Just upgraded my GPU yesterday):

➜ nathan@rodan  ~  lspci -v

<skipping lots of other devices...>

03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] (rev c1) (prog-if 00 [VGA controller])
        Subsystem: Sapphire Technology Limited Navi 23 [Radeon RX 6600/6600 XT/6600M]
        Flags: bus master, fast devsel, latency 0, IRQ 74
        Memory at 7c00000000 (64-bit, prefetchable) [size=8G]
        Memory at 7e00000000 (64-bit, prefetchable) [size=256M]
        I/O ports at f000 [size=256]
        Memory at fcc00000 (32-bit, non-prefetchable) [size=1M]
        Expansion ROM at fcd00000 [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: amdgpu
        **Kernel modules: amdgpu**

Notice the kernel driver: amdgpu. When you run the command, does it show an AMD VGA device with that module too?

On a side note: I found an MR for Talos where someone wanted ROCm support too: siderolabs/extensions#39 The MR looks abandoned but recent comments link to another MR for the AMD firmware that got merged. You have the firmware extension so I wonder why it's not working? I suggest following up with the Talos community, maybe they have a discord or something like that. You could point to that MR and maybe they can help you figure out why ROCm isn't working.

Daniel Frey · Answer 17 · Mon Jan 29 2024 09:24:14 GMT+0800 (China Standard Time)

I couldn't find a pod with lspci loaded on it, so I started a busybox pod. It's output is very limited, but amdgpu is loaded.

05:00.0 Class 0300: 1002:1638 amdgpu

I'll reach out on the talos slack channel. Thanks for all your help.

Daniel Frey · Answer 18 · Mon Jan 29 2024 22:42:15 GMT+0800 (China Standard Time)

I've posted to the slack support channel and created an issue to track it in their github:

siderolabs/extensions#307