ROCm / k8s-device-plugin

Kubernetes (k8s) device plugin to enable registration of AMD GPU to a container cluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

EKS Support

aldmbmtl opened this issue · comments

Hello!

I am trying to get this to work on EKS. Sadly the device plugin doesn't seem to see the GPU. I am using g4ad's and I made a custom AMI running the latest version of the AMD GPU Pro drivers that I could get from Amazon (20.20). When scaling from zero, the cluster autoscaler isn't detecting the resource "amd.com/gpu: 1", but I don't think that will solve this other issue.

When I launch a node and then deploy the device plugin, the pod still won't be scheduled to the node. Any idea as to why?

I0724 03:19:09.793086       1 main.go:305] ./k8s-device-plugin version v1.18.1-21-g2e5bbc7
I0724 03:19:09.793089       1 main.go:305] hwloc: _VERSION: 2.9.1, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800
I0724 03:19:09.793105       1 manager.go:42] Starting device plugin manager
I0724 03:19:09.793108       1 manager.go:46] Registering for system signal notifications
I0724 03:19:09.793346       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
I0724 03:19:09.793400       1 manager.go:60] Starting Discovery on new plugins
I0724 03:19:09.793416       1 manager.go:66] Handling incoming signals```

This is the log from the device plugin manager. I assume I should be seeing something else? We would love to get off of Nvidia for our containerized workstations, but this has been blocking us. I assume it is because AWS doesn't seem to want to support Radeon :disappointed: 

 Thanks!