ROCm / k8s-device-plugin

Kubernetes (k8s) device plugin to enable registration of AMD GPU to a container cluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is AMD Radeon Vega 8 supported?

dmfrey opened this issue · comments

I couldn't find any information regarding this.

When I deploy both the device plugin and labeler to my cluster, my nodes get labeled like:

Labels:             beta.amd.com/gpu.device-id.1638=1                                                                                                                                                                                                                                               
                    beta.amd.com/gpu.family.RV=1                                                                                                                                                                                                                                                    
                    beta.amd.com/gpu.vram.1G=1                                                                                                                                                                                                                                                      

What I don't see, however, are labels like amd.com/gpu.

@dmfrey It's working for me on Vega7 (Ryzen 4650G), so it should work on Vega8. But I'm just using it for hardware video encoding, nothing else.

Regarding the labels, it seems to me that you're mixing up node labels and resource limits/requests. The labels are to ensure that your pod runs on a node with the right kind of GPU, in case you have a large cluster with many different nodes and kinds of GPUs. However, the resource section ensures a GPU gets scheduled/mapped into your POD. In my case, I'm running a single node cluster for personal use, so I dont need node labels. All my pods run on the same node. But I did need the resources section in my Pod definition.

@nlflint This is a home lab as well, 3 nodes, each identical, so labels probably aren't needed either.

These are the boxes I'm running on: https://www.geekompc.com/geekom-a5-mini-pc/

My case is just like yours, I wish to use these for transcoding, specifically with the TDARR app of STARR apps and PLEX transcoding. However, if I try to put a request on the pods to have that GPU available, it fails to deploy to the nodes. When I put the requests in, it complained no nodes were available with that resource.

@dmfrey If you do a describe nodes, do you see the amd.com/gpu in the capacity and allocatable sections on each of your nodes? Like I see on my single node?

image

They say beta.amd.com/gpu

Show me the logs of one of your amdgpu plugin daemonset pods (you should have 3 running). Here's mine for an example:

➜ nathan@rodan  ~  kubectl logs amdgpu-device-plugin-daemonset-6ks4c -n kube-system 
I0123 02:50:09.718359       1 main.go:305] AMD GPU device plugin for Kubernetes
I0123 02:50:09.718409       1 main.go:305] ./k8s-device-plugin version v1.25.2.6-5-g4503704
I0123 02:50:09.718412       1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800
I0123 02:50:09.718417       1 manager.go:42] Starting device plugin manager
I0123 02:50:09.718422       1 manager.go:46] Registering for system signal notifications
I0123 02:50:09.718589       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
I0123 02:50:09.718652       1 manager.go:60] Starting Discovery on new plugins
I0123 02:50:09.718663       1 manager.go:66] Handling incoming signals
I0123 02:50:09.718679       1 manager.go:71] Received new list of plugins: [gpu]
I0123 02:50:09.718706       1 manager.go:110] Adding a new plugin "gpu"
I0123 02:50:09.718726       1 plugin.go:64] gpu: Starting plugin server
I0123 02:50:09.718733       1 plugin.go:94] gpu: Starting the DPI gRPC server
I0123 02:50:09.719399       1 plugin.go:112] gpu: Serving requests...
I0123 02:50:19.720886       1 plugin.go:128] gpu: Registering the DPI with Kubelet
I0123 02:50:19.722865       1 plugin.go:140] gpu: Registration for endpoint amd.com_gpu
I0123 02:50:19.727721       1 amdgpu.go:100] /sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0
I0123 02:50:19.727787       1 amdgpu.go:100] /sys/module/amdgpu/drivers/pci:amdgpu/0000:07:00.0
I0123 02:50:19.756473       1 main.go:149] Watching GPU with bus ID: 0000:01:00.0 NUMA Node: []
E0123 02:50:19.756497       1 main.go:151] No NUMA node found with bus ID: 0000:01:00.0
I0123 02:50:19.756505       1 main.go:149] Watching GPU with bus ID: 0000:07:00.0 NUMA Node: []
E0123 02:50:19.756507       1 main.go:151] No NUMA node found with bus ID: 0000:07:00.0
I0124 04:55:58.585795       1 main.go:224] Allocating device ID: 0000:07:00.0
I0124 04:55:58.585832       1 main.go:224] Allocating device ID: 0000:01:00.0
I0124 05:08:10.915814       1 main.go:224] Allocating device ID: 0000:01:00.0
I0124 05:08:10.915850       1 main.go:224] Allocating device ID: 0000:07:00.0
I0124 05:14:42.111098       1 main.go:224] Allocating device ID: 0000:07:00.0
I0124 05:14:42.111143       1 main.go:224] Allocating device ID: 0000:01:00.0

I have two GPUs on this machine, a Vega7 iGPU and an RX550 PCIe card, which is why there are 2 devices found.

Notice the line in the middle shows: 1 plugin.go:140] gpu: Registration for endpoint amd.com_gpu. The name of the resource is amd.com/gpu. What does yours say?

Here are the logs for the deamonset. There are 3 nodes. The logs are very light.

amd-device-plugin-6bv75 I0127 18:29:58.101908       1 main.go:305] AMD GPU device plugin for Kubernetes                                                                                                                                                                                             
amd-device-plugin-6bv75 I0127 18:29:58.101961       1 main.go:305] ./k8s-device-plugin version v1.25.2.7-0-g4503704                                                                                                                                                                                 
amd-device-plugin-6bv75 I0127 18:29:58.101965       1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800                                                                                                                                        
amd-device-plugin-6bv75 I0127 18:29:58.101971       1 manager.go:42] Starting device plugin manager                                                                                                                                                                                                 
amd-device-plugin-6bv75 I0127 18:29:58.101980       1 manager.go:46] Registering for system signal notifications                                                                                                                                                                                    
amd-device-plugin-6bv75 I0127 18:29:58.102101       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory                                                                                                                                                 
amd-device-plugin-6bv75 I0127 18:29:58.102184       1 manager.go:60] Starting Discovery on new plugins                                                                                                                                                                                              
amd-device-plugin-6bv75 I0127 18:29:58.102196       1 manager.go:66] Handling incoming signals                                                                                                                                                                                                      

amd-device-plugin-gpxj8 I0127 18:29:52.707491       1 main.go:305] AMD GPU device plugin for Kubernetes                                                                                                                                                                                             
amd-device-plugin-gpxj8 I0127 18:29:52.707545       1 main.go:305] ./k8s-device-plugin version v1.25.2.7-0-g4503704                                                                                                                                                                                 
amd-device-plugin-gpxj8 I0127 18:29:52.707549       1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800                                                                                                                                        
amd-device-plugin-gpxj8 I0127 18:29:52.707556       1 manager.go:42] Starting device plugin manager                                                                                                                                                                                                 
amd-device-plugin-gpxj8 I0127 18:29:52.707564       1 manager.go:46] Registering for system signal notifications                                                                                                                                                                                    
amd-device-plugin-gpxj8 I0127 18:29:52.707650       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory                                                                                                                                                 
amd-device-plugin-gpxj8 I0127 18:29:52.707695       1 manager.go:60] Starting Discovery on new plugins                                                                                                                                                                                              
amd-device-plugin-gpxj8 I0127 18:29:52.707700       1 manager.go:66] Handling incoming signals                                                                                                                                                                                                      

amd-device-plugin-m29lm I0127 18:29:55.484546       1 main.go:305] AMD GPU device plugin for Kubernetes                                                                                                                                                                                             
amd-device-plugin-m29lm I0127 18:29:55.484601       1 main.go:305] ./k8s-device-plugin version v1.25.2.7-0-g4503704                                                                                                                                                                                 
amd-device-plugin-m29lm I0127 18:29:55.484605       1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800                                                                                                                                        
amd-device-plugin-m29lm I0127 18:29:55.484612       1 manager.go:42] Starting device plugin manager                                                                                                                                                                                                 
amd-device-plugin-m29lm I0127 18:29:55.484620       1 manager.go:46] Registering for system signal notifications                                                                                                                                                                                    
amd-device-plugin-m29lm I0127 18:29:55.484706       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory                                                                                                                                                 
amd-device-plugin-m29lm I0127 18:29:55.484762       1 manager.go:60] Starting Discovery on new plugins                                                                                                                                                                                              
amd-device-plugin-m29lm I0127 18:29:55.484769       1 manager.go:66] Handling incoming signals                                                                                                                                                                                                      

I never see it registering the gpu plugin

Interesting, it's waiting for a "signal". Comparing to my logs, the next steps should be:

...
1 manager.go:71] Received new list of plugins: [gpu]
1 manager.go:110] Adding a new plugin "gpu"
...

It looks like the main.go from this ROCm plugin Here entered some k8s plugin manager code. I dont know how that stuff works, or why it's not receiving a signal for your gpu plugin.

I took a deeper look and that code I linked, and it is just the constructor (i dont know Go). It appears the .run() at the bottom is where the plugin manager starts Here.

Based on this code comment, maybe you dont have the ROCm kernel/driver installed on each of your nodes?

go func() {
		// /sys/class/kfd only exists if ROCm kernel/driver is installed
		var path = "/sys/class/kfd"
		if _, err := os.Stat(path); err == nil {
			l.ResUpdateChan <- []string{"gpu"}
		}
	}()
manager.Run()

Does this file/filelink exist on your nodes? ls -la /sys/class/kfd

It exists on my node:

➜ nathan@rodan  ~  ls -la /sys/class/kfd
total 0
drwxr-xr-x  2 root root 0 Jan 27 16:56 .
drwxr-xr-x 81 root root 0 Jan 27 16:56 ..
lrwxrwxrwx  1 root root 0 Jan 27 16:56 kfd -> ../../devices/virtual/kfd/kfd

I'm not sure how I got that ROCm module. I don't remember installing it. I'm just running Ubuntu 22.04 server edition.

Maybe I have the wrong extension loaded. Talos Linux has these extension modules, one of which is amdgpu-firmware. I'm guessing that is probably different than the rocm package.

https://github.com/siderolabs/extensions?tab=readme-ov-file

ls -la /sys/class/kfd

this does not exist on my nodes.

Based on my reading of the pre-reqs links, it's unclear exactly which modules and software you need, but it does say:

ROCm kernel or latest AMD GPU Linux driver

You should be able to confirm if the AMDGPU kernel module is loaded. The lspci -v command shows verbose details of your PCIe devices and includes the kernel module associated with each device. Here's mine (Just upgraded my GPU yesterday):

➜ nathan@rodan  ~  lspci -v

<skipping lots of other devices...>

03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] (rev c1) (prog-if 00 [VGA controller])
        Subsystem: Sapphire Technology Limited Navi 23 [Radeon RX 6600/6600 XT/6600M]
        Flags: bus master, fast devsel, latency 0, IRQ 74
        Memory at 7c00000000 (64-bit, prefetchable) [size=8G]
        Memory at 7e00000000 (64-bit, prefetchable) [size=256M]
        I/O ports at f000 [size=256]
        Memory at fcc00000 (32-bit, non-prefetchable) [size=1M]
        Expansion ROM at fcd00000 [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: amdgpu
        **Kernel modules: amdgpu**

Notice the kernel driver: amdgpu. When you run the command, does it show an AMD VGA device with that module too?

On a side note: I found an MR for Talos where someone wanted ROCm support too: siderolabs/extensions#39 The MR looks abandoned but recent comments link to another MR for the AMD firmware that got merged. You have the firmware extension so I wonder why it's not working? I suggest following up with the Talos community, maybe they have a discord or something like that. You could point to that MR and maybe they can help you figure out why ROCm isn't working.

I couldn't find a pod with lspci loaded on it, so I started a busybox pod. It's output is very limited, but amdgpu is loaded.

05:00.0 Class 0300: 1002:1638 amdgpu

I'll reach out on the talos slack channel. Thanks for all your help.

I've posted to the slack support channel and created an issue to track it in their github:

siderolabs/extensions#307