Prometheus metrics of gpu resources
boniek83 opened this issue · comments
I want to monitor gpu statistics of pods that have gpu assigned. I'm aware of RDC, but this is not good enough. There is no pod label in metrics.
Something like: https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/ specifically Per-pod GPU metrics in a Kubernetes cluster section.
I've implemented this feature and created MR. For anyone interested please take a look at this:
ROCm/rdc#1