utkuozdemir / nvidia_gpu_exporter

Nvidia GPU exporter for prometheus using nvidia-smi binary

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Metric per process/pod

andlogreg opened this issue · comments

Is is possible to see memory utilization per process instead of just the total memory usage on a specific gpu?

If not this could be quite useful. Given that this information is already available through nvidia-smi I imagine it should be doable.

The feature does not exist yet, but if it is possible to get that data using nvidia-smi, it should be fairly straightforward to implement. I might have a look into it some time, but don't know when (and no promises).

👍 for this functionality.

If it is of any use: a few months ago I needed to know which process and user was running what in each GPU of a server (+ memory and elapsed time). Since nvidia-smi did not provide such details, I ended up combining it with simple ps calls (which will depend on the host OS, thus may not be a great option). Basically, it went around:

# get gpus
nvidia-smi --query-gpu=index,uuid,gpu_name --format=csv

# get running processes
nvidia-smi --query-compute-apps=pid,process_name,gpu_name,gpu_uuid,used_memory --format=csv

# for each process, using -o lstart, -o etimes or -o cmd to get other details.
ps -fwwp #{pid} -o user -h

Initially tried to get everything about each process in one go but was having difficulties parsing stuff and just hacked this up, which at the time served the purpose. That was then transformed into a decent table, taking into account that a process might be running on multiple GPUs and several processes can be also placed in a single one.

Hi. Are there any plans for an update on this?

Hi, I lately don't have any time for my personal projects, so I'm afraid it won't be there anytime soon.

PRs are always welcome though - if they are of good quality and have tests (so does not require much effort from my side), I'd merge them and make new releases.

Duplicate of #190, let's track this on that one.