ROCm / k8s-device-plugin

Kubernetes (k8s) device plugin to enable registration of AMD GPU to a container cluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Prometheus metrics of gpu resources

boniek83 opened this issue · comments

I want to monitor gpu statistics of pods that have gpu assigned. I'm aware of RDC, but this is not good enough. There is no pod label in metrics.
Something like: https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/ specifically Per-pod GPU metrics in a Kubernetes cluster section.

I've implemented this feature and created MR. For anyone interested please take a look at this:
ROCm/rdc#1