allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

Home Page:https://clear.ml/docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GPU Utilization for MIG devices

uzmargomez opened this issue · comments

Proposal Summary

Currently I'm using MIG devices to spin up agents that serve different ClearML queues, however, when looking at the Workers CPU and GPU usage graph, I see no information related with the GPU usage, presumably (and I couldn't find info on this so correct me if I'm wrong) because this graph is obtained by reading the DCGM_FI_DEV_GPU_UTIL metric which is not enabled for MIG.

According to this thread, the utilization metric DCGM_FI_DEV_GPU_UTIL is not supported for being "outdated and with several limitations", and they suggest using the metrics in DCGM_FI_PROF_*.

Would it be possible to update the way this graph is obtained? Thanks in advance for your help!

Hi @uzmargomez , which version of the clearml-agent are you using?

Hey @jkhenning, I'm using clearml-agent-5.1.0 Helm Chart.