NVIDIA / go-nvml

Go Bindings for the NVIDIA Management Library (NVML)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GetProcessUtilization says: Insufficient Size

qisikai opened this issue · comments

image

I run pytorch in a docker container, with the 4th gpu.
I run go-nvml in host environment.

Code:

_, ret := dev.GetProcessUtilization(ts)
if ret != nvml.SUCCESS {
	log.Printf("[x] Unable to call GetProcessUtilization", nvml.ErrorString(ret))
}

OUTPUTS:

   Unable to call GetProcessUtilization%!(EXTRA string=Insufficient Size)

I modify the code, and this code works:

var ProcessSamplesCount uint32 = 100
Utilization := make([]ProcessUtilizationSample, 100)
ret := nvmlDeviceGetProcessUtilization(Device, &Utilization[0], &ProcessSamplesCount, LastSeenTimeStamp)

Thanks @qisikai. It seems as if the first call to nvmlDeviceGetProcessUtilization which should be setting the number of samples is not returning the expected value in ProcessSamplesCount. Could you check what value ProcessSamplesCount after the first call?

As a matter of interest, what is the timestamp value that you pass?

@elezar Thanks.

1

value: ProcessSamplesCount =100 (same value on all the gpus, It's the same with or without load).

2

ts := uint64(time.Now().Unix()-tm) * 1000
_, ret := dev.GetProcessUtilization(ts)

I have tried: ts=0, ts= current second * 1000, ts = (current second - 10s) * 1000 microseconds,
But none of them work.

3

I find :

  1. when calling ret := nvmlDeviceGetProcessUtilization(Device, &Utilization[0], &ProcessSamplesCount, LastSeenTimeStamp), ProcessSamplesCount must be 100, 0 or 10 does't work.

  2. &Utilization[0] must not be nil.

4

Follow code works:

// nvml.DeviceGetProcessUtilization()
func DeviceGetProcessUtilization(Device Device, LastSeenTimeStamp uint64) ([]ProcessUtilizationSample, Return) {
	var ProcessSamplesCount uint32 = 100
	Utilization := make([]ProcessUtilizationSample, 100)

	ret := nvmlDeviceGetProcessUtilization(Device, &Utilization[0], &ProcessSamplesCount, LastSeenTimeStamp)
	if ret != SUCCESS {
		return nil, ret
	}
	if ProcessSamplesCount >= 100 || ProcessSamplesCount == 0 {
		return []ProcessUtilizationSample{}, ret
	}
	UtilizationRes := make([]ProcessUtilizationSample, 0)
	var i uint32 = 0
	for ; i < ProcessSamplesCount; i++ {
		UtilizationRes = append(UtilizationRes, Utilization[i])
	}
	return UtilizationRes, ret
}

I checked with the NVML maintainer last night on this.

Apparently, the first call to get the sample size is meant to return an NVML_ERROR_INSUFFICIENT_SIZE and not an NVML_SUCCESS.

So we need to change the code to:

func DeviceGetProcessUtilization(Device Device, LastSeenTimeStamp uint64) ([]ProcessUtilizationSample, Return) {
	var ProcessSamplesCount uint32
	ret := nvmlDeviceGetProcessUtilization(Device, nil, &ProcessSamplesCount, LastSeenTimeStamp)
	if ret != ERROR_INSUFFICIENT_SIZE {
		return nil, ret
	}
	if ProcessSamplesCount == 0 {
		return []ProcessUtilizationSample{}, ret
	}
	Utilization := make([]ProcessUtilizationSample, ProcessSamplesCount)
	ret = nvmlDeviceGetProcessUtilization(Device, &Utilization[0], &ProcessSamplesCount, LastSeenTimeStamp)
	return Utilization[:ProcessSamplesCount], ret
}

Unfortunately, this isn't obvious from the documentation:
https://github.com/NVIDIA/go-nvml/blob/master/gen/nvml/nvml.h#L5844

Which is why we missed it the first time around.

@qisikai Can you run the variant of the function above and make sure it fixes things for you.

@qisikai Can you run the variant of the function above and make sure it fixes things for you.

@qisikai Did you mean to write more in your comment above?

@qisikai Can you run the variant of the function above and make sure it fixes things for you.

@elezar @klueska Thanks. The code works, it can get info of running process. but I met follows situation on gpus without any load:

**I add a Printf statement **

func DeviceGetProcessUtilization(Device Device, LastSeenTimeStamp uint64) ([]ProcessUtilizationSample, Return) {
	var ProcessSamplesCount uint32
	ret := nvmlDeviceGetProcessUtilization(Device, nil, &ProcessSamplesCount, LastSeenTimeStamp)
	if ret != ERROR_INSUFFICIENT_SIZE {
		return nil, ret
	}
	if ProcessSamplesCount == 0 {
		return []ProcessUtilizationSample{}, ret
	}
	log.Printf("ProcessSamplesCount= %d \n", ProcessSamplesCount)
	Utilization := make([]ProcessUtilizationSample, ProcessSamplesCount)
	ret = nvmlDeviceGetProcessUtilization(Device, &Utilization[0], &ProcessSamplesCount, LastSeenTimeStamp)
	return Utilization[:ProcessSamplesCount], ret
}

Output

2021/03/01 17:24:19 ProcessSamplesCount= 100 

when there is no process running on certain gpu, DeviceGetProcessUtilization will return a slice with 100 items.

Thanks for the quick test @qisikai. With regards to:

when there is no process running on certain gpu, DeviceGetProcessUtilization will return a slice with 100 items.

Is this not expected? According to the docs:

One utilization sample structure is returned per process running, that had some non-zero utilization during the last sample period.

Could it be that these are old samples that are being returned?

I don't have a good answer as to why the API returns 100 if nothing is running, but I just did a quick check on the underlying C API, and it returns that same thing:

#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/types.h>

#include <nvml.h>

int main()
{
    nvmlReturn_t ret;
    nvmlDevice_t device;
    uint32_t processSamplesCount;

    ret = nvmlInit();
    printf("nvmlInit: %s\n", nvmlErrorString(ret));

    ret = nvmlDeviceGetHandleByIndex(0, &device);
    printf("nvmlDeviceGetHandleByIndex: %s\n", nvmlErrorString(ret));

    ret = nvmlDeviceGetProcessUtilization(device, NULL, &processSamplesCount, 0);
    printf("nvmlDeviceGetProcessUtilization: %d, %s\n", processSamplesCount, nvmlErrorString(ret));

    ret = nvmlShutdown();
    printf("nvmlShutdown: %s\n", nvmlErrorString(ret));
}
$ gcc nvml.c -o nvml -lnvidia-ml
$ ./nvml
nvmlInit: Success
nvmlDeviceGetHandleByIndex: Success
nvmlDeviceGetProcessUtilization: 100, Insufficient Size
nvmlShutdown: Success

Yes, this is a strange behavior.
So I add an if statement:

if ProcessSamplesCount >= 100 || ProcessSamplesCount == 0 {
    return []ProcessUtilizationSample{}, ret
 }

As bindings, our job is to simply pass through the values returned by the underlying C API. So will will make the change to adhere to the ERROR_INSUFFICIENT_SIZE issue, but we won't be adding that if statement you have above. I would recommend adding that to your application code for now if you need it, i.e.:

utilization, ret := device.GetProcessUtilization(0)
if ret != nvml.SUCCESS {
    .....
}
if len(utilization) >= 100 || len(utilization) == 0 {
    ....
}

I will check with the NVML team in the meantime, to see what they say about this.

@klueska Ok, Thanks. That is enough. I will add a check statement after calling GetProcessUtilization.