ROCm / k8s-device-plugin

Kubernetes (k8s) device plugin to enable registration of AMD GPU to a container cluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Permissions for the /dev/{kfd,dri/renderXXXX} devices in containers

elukey opened this issue · comments

Hi folks!

I am trying the AMD device plugin on my system, deployed as Systemd unit on Debian 11 (so not a DaemonSet, but directly on the K8s node). Everything works fine and I am able to see two devices in my test container:

  • /dev/kfd
  • /dev/dri/renderD128

I am trying to run the container with an unpriviledged user, like nobody, but I am struggling to assign the proper permissions to the above devices. In the container I see something like the following (tested via nsenter):

root@alexnet-tf-gpu-pod:/# ls -l /dev/kfd 
crw-rw---- 1 root 106 242, 0 Apr 18 15:58 /dev/kfd

root@alexnet-tf-gpu-pod:/# ls -l /dev/dri/renderD128 
crw-rw---- 1 root 106 226, 128 Apr 18 15:58 /dev/dri/renderD128

The gid 106 is the render group on the underlying "bare metal" K8s worker OS, that gets mapped to the test container, but in this way I don't have a clear way to add nobody to render or similar (in the Docker image). Is there a best practice that you can suggest?

Thanks in advance!

In the securityContext for the pod, you can add supplementalGroups that the pod is run as, which I found enabled me to use the hardware.

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/#podsecuritycontext-v1-core