Is AMD Radeon Vega 8 supported?
dmfrey opened this issue · comments
I couldn't find any information regarding this.
When I deploy both the device plugin and labeler to my cluster, my nodes get labeled like:
Labels: beta.amd.com/gpu.device-id.1638=1
beta.amd.com/gpu.family.RV=1
beta.amd.com/gpu.vram.1G=1
What I don't see, however, are labels like amd.com/gpu
.
Please see ROCm system requirement here:
https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html
@dmfrey It's working for me on Vega7 (Ryzen 4650G), so it should work on Vega8. But I'm just using it for hardware video encoding, nothing else.
Regarding the labels, it seems to me that you're mixing up node labels and resource limits/requests. The labels are to ensure that your pod runs on a node with the right kind of GPU, in case you have a large cluster with many different nodes and kinds of GPUs. However, the resource section ensures a GPU gets scheduled/mapped into your POD. In my case, I'm running a single node cluster for personal use, so I dont need node labels. All my pods run on the same node. But I did need the resources section in my Pod definition.
@nlflint This is a home lab as well, 3 nodes, each identical, so labels probably aren't needed either.
These are the boxes I'm running on: https://www.geekompc.com/geekom-a5-mini-pc/
My case is just like yours, I wish to use these for transcoding, specifically with the TDARR app of STARR apps and PLEX transcoding. However, if I try to put a request on the pods to have that GPU available, it fails to deploy to the nodes. When I put the requests in, it complained no nodes were available with that resource.
@dmfrey If you do a describe nodes, do you see the amd.com/gpu
in the capacity and allocatable sections on each of your nodes? Like I see on my single node?
They say beta.amd.com/gpu
Show me the logs of one of your amdgpu plugin daemonset pods (you should have 3 running). Here's mine for an example:
➜ nathan@rodan ~ kubectl logs amdgpu-device-plugin-daemonset-6ks4c -n kube-system
I0123 02:50:09.718359 1 main.go:305] AMD GPU device plugin for Kubernetes
I0123 02:50:09.718409 1 main.go:305] ./k8s-device-plugin version v1.25.2.6-5-g4503704
I0123 02:50:09.718412 1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800
I0123 02:50:09.718417 1 manager.go:42] Starting device plugin manager
I0123 02:50:09.718422 1 manager.go:46] Registering for system signal notifications
I0123 02:50:09.718589 1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
I0123 02:50:09.718652 1 manager.go:60] Starting Discovery on new plugins
I0123 02:50:09.718663 1 manager.go:66] Handling incoming signals
I0123 02:50:09.718679 1 manager.go:71] Received new list of plugins: [gpu]
I0123 02:50:09.718706 1 manager.go:110] Adding a new plugin "gpu"
I0123 02:50:09.718726 1 plugin.go:64] gpu: Starting plugin server
I0123 02:50:09.718733 1 plugin.go:94] gpu: Starting the DPI gRPC server
I0123 02:50:09.719399 1 plugin.go:112] gpu: Serving requests...
I0123 02:50:19.720886 1 plugin.go:128] gpu: Registering the DPI with Kubelet
I0123 02:50:19.722865 1 plugin.go:140] gpu: Registration for endpoint amd.com_gpu
I0123 02:50:19.727721 1 amdgpu.go:100] /sys/module/amdgpu/drivers/pci:amdgpu/0000:01:00.0
I0123 02:50:19.727787 1 amdgpu.go:100] /sys/module/amdgpu/drivers/pci:amdgpu/0000:07:00.0
I0123 02:50:19.756473 1 main.go:149] Watching GPU with bus ID: 0000:01:00.0 NUMA Node: []
E0123 02:50:19.756497 1 main.go:151] No NUMA node found with bus ID: 0000:01:00.0
I0123 02:50:19.756505 1 main.go:149] Watching GPU with bus ID: 0000:07:00.0 NUMA Node: []
E0123 02:50:19.756507 1 main.go:151] No NUMA node found with bus ID: 0000:07:00.0
I0124 04:55:58.585795 1 main.go:224] Allocating device ID: 0000:07:00.0
I0124 04:55:58.585832 1 main.go:224] Allocating device ID: 0000:01:00.0
I0124 05:08:10.915814 1 main.go:224] Allocating device ID: 0000:01:00.0
I0124 05:08:10.915850 1 main.go:224] Allocating device ID: 0000:07:00.0
I0124 05:14:42.111098 1 main.go:224] Allocating device ID: 0000:07:00.0
I0124 05:14:42.111143 1 main.go:224] Allocating device ID: 0000:01:00.0
I have two GPUs on this machine, a Vega7 iGPU and an RX550 PCIe card, which is why there are 2 devices found.
Notice the line in the middle shows: 1 plugin.go:140] gpu: Registration for endpoint amd.com_gpu
. The name of the resource is amd.com/gpu
. What does yours say?
Here are the logs for the deamonset. There are 3 nodes. The logs are very light.
amd-device-plugin-6bv75 I0127 18:29:58.101908 1 main.go:305] AMD GPU device plugin for Kubernetes
amd-device-plugin-6bv75 I0127 18:29:58.101961 1 main.go:305] ./k8s-device-plugin version v1.25.2.7-0-g4503704
amd-device-plugin-6bv75 I0127 18:29:58.101965 1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800
amd-device-plugin-6bv75 I0127 18:29:58.101971 1 manager.go:42] Starting device plugin manager
amd-device-plugin-6bv75 I0127 18:29:58.101980 1 manager.go:46] Registering for system signal notifications
amd-device-plugin-6bv75 I0127 18:29:58.102101 1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
amd-device-plugin-6bv75 I0127 18:29:58.102184 1 manager.go:60] Starting Discovery on new plugins
amd-device-plugin-6bv75 I0127 18:29:58.102196 1 manager.go:66] Handling incoming signals
amd-device-plugin-gpxj8 I0127 18:29:52.707491 1 main.go:305] AMD GPU device plugin for Kubernetes
amd-device-plugin-gpxj8 I0127 18:29:52.707545 1 main.go:305] ./k8s-device-plugin version v1.25.2.7-0-g4503704
amd-device-plugin-gpxj8 I0127 18:29:52.707549 1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800
amd-device-plugin-gpxj8 I0127 18:29:52.707556 1 manager.go:42] Starting device plugin manager
amd-device-plugin-gpxj8 I0127 18:29:52.707564 1 manager.go:46] Registering for system signal notifications
amd-device-plugin-gpxj8 I0127 18:29:52.707650 1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
amd-device-plugin-gpxj8 I0127 18:29:52.707695 1 manager.go:60] Starting Discovery on new plugins
amd-device-plugin-gpxj8 I0127 18:29:52.707700 1 manager.go:66] Handling incoming signals
amd-device-plugin-m29lm I0127 18:29:55.484546 1 main.go:305] AMD GPU device plugin for Kubernetes
amd-device-plugin-m29lm I0127 18:29:55.484601 1 main.go:305] ./k8s-device-plugin version v1.25.2.7-0-g4503704
amd-device-plugin-m29lm I0127 18:29:55.484605 1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800
amd-device-plugin-m29lm I0127 18:29:55.484612 1 manager.go:42] Starting device plugin manager
amd-device-plugin-m29lm I0127 18:29:55.484620 1 manager.go:46] Registering for system signal notifications
amd-device-plugin-m29lm I0127 18:29:55.484706 1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
amd-device-plugin-m29lm I0127 18:29:55.484762 1 manager.go:60] Starting Discovery on new plugins
amd-device-plugin-m29lm I0127 18:29:55.484769 1 manager.go:66] Handling incoming signals
I never see it registering the gpu plugin
Interesting, it's waiting for a "signal". Comparing to my logs, the next steps should be:
...
1 manager.go:71] Received new list of plugins: [gpu]
1 manager.go:110] Adding a new plugin "gpu"
...
It looks like the main.go
from this ROCm plugin Here entered some k8s plugin manager code. I dont know how that stuff works, or why it's not receiving a signal for your gpu
plugin.
I deploy these with FluxCD. My config is here:
https://github.com/dmfrey/home-gitops/tree/main/kubernetes/apps/system/amd-device-plugin/app
I took a deeper look and that code I linked, and it is just the constructor (i dont know Go). It appears the .run()
at the bottom is where the plugin manager starts Here.
Based on this code comment, maybe you dont have the ROCm kernel/driver installed on each of your nodes?
go func() {
// /sys/class/kfd only exists if ROCm kernel/driver is installed
var path = "/sys/class/kfd"
if _, err := os.Stat(path); err == nil {
l.ResUpdateChan <- []string{"gpu"}
}
}()
manager.Run()
Does this file/filelink exist on your nodes? ls -la /sys/class/kfd
It exists on my node:
➜ nathan@rodan ~ ls -la /sys/class/kfd
total 0
drwxr-xr-x 2 root root 0 Jan 27 16:56 .
drwxr-xr-x 81 root root 0 Jan 27 16:56 ..
lrwxrwxrwx 1 root root 0 Jan 27 16:56 kfd -> ../../devices/virtual/kfd/kfd
I'm not sure how I got that ROCm module. I don't remember installing it. I'm just running Ubuntu 22.04 server edition.
Maybe I have the wrong extension loaded. Talos Linux has these extension modules, one of which is amdgpu-firmware
. I'm guessing that is probably different than the rocm package.
ls -la /sys/class/kfd
this does not exist on my nodes.
Based on my reading of the pre-reqs links, it's unclear exactly which modules and software you need, but it does say:
ROCm kernel or latest AMD GPU Linux driver
You should be able to confirm if the AMDGPU kernel module is loaded. The lspci -v
command shows verbose details of your PCIe devices and includes the kernel module associated with each device. Here's mine (Just upgraded my GPU yesterday):
➜ nathan@rodan ~ lspci -v
<skipping lots of other devices...>
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] (rev c1) (prog-if 00 [VGA controller])
Subsystem: Sapphire Technology Limited Navi 23 [Radeon RX 6600/6600 XT/6600M]
Flags: bus master, fast devsel, latency 0, IRQ 74
Memory at 7c00000000 (64-bit, prefetchable) [size=8G]
Memory at 7e00000000 (64-bit, prefetchable) [size=256M]
I/O ports at f000 [size=256]
Memory at fcc00000 (32-bit, non-prefetchable) [size=1M]
Expansion ROM at fcd00000 [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: amdgpu
**Kernel modules: amdgpu**
Notice the kernel driver: amdgpu
. When you run the command, does it show an AMD VGA device with that module too?
On a side note: I found an MR for Talos where someone wanted ROCm support too: siderolabs/extensions#39 The MR looks abandoned but recent comments link to another MR for the AMD firmware that got merged. You have the firmware extension so I wonder why it's not working? I suggest following up with the Talos community, maybe they have a discord or something like that. You could point to that MR and maybe they can help you figure out why ROCm isn't working.
I couldn't find a pod with lspci
loaded on it, so I started a busybox pod. It's output is very limited, but amdgpu
is loaded.
05:00.0 Class 0300: 1002:1638 amdgpu
I'll reach out on the talos slack channel. Thanks for all your help.
I've posted to the slack support channel and created an issue to track it in their github: