Prometheus metrics of gpu resources

Question

Prometheus metrics of gpu resources

boniek83 opened this issue 3 years ago · comments

I want to monitor gpu statistics of pods that have gpu assigned. I'm aware of RDC, but this is not good enough. There is no pod label in metrics.
Something like: https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/ specifically Per-pod GPU metrics in a Kubernetes cluster section.

Rafał Boniecki · Answer 1 · Wed Apr 13 2022 22:31:36 GMT+0800 (China Standard Time)

I've implemented this feature and created MR. For anyone interested please take a look at this:
ROCm/rdc#1