GetProcessUtilization says: Insufficient Size

Question

GetProcessUtilization says: Insufficient Size

qisikai opened this issue 3 years ago · comments

I run pytorch in a docker container, with the 4th gpu.
I run go-nvml in host environment.

Code:

_, ret := dev.GetProcessUtilization(ts)
if ret != nvml.SUCCESS {
	log.Printf("[x] Unable to call GetProcessUtilization", nvml.ErrorString(ret))
}

OUTPUTS:

   Unable to call GetProcessUtilization%!(EXTRA string=Insufficient Size)

qisikai · Answer 1 · Sun Feb 28 2021 21:00:42 GMT+0800 (China Standard Time)

I modify the code, and this code works:

var ProcessSamplesCount uint32 = 100
Utilization := make([]ProcessUtilizationSample, 100)
ret := nvmlDeviceGetProcessUtilization(Device, &Utilization[0], &ProcessSamplesCount, LastSeenTimeStamp)

Evan Lezar · Answer 2 · Mon Mar 01 2021 17:00:50 GMT+0800 (China Standard Time)

Thanks @qisikai. It seems as if the first call to nvmlDeviceGetProcessUtilization which should be setting the number of samples is not returning the expected value in ProcessSamplesCount. Could you check what value ProcessSamplesCount after the first call?

As a matter of interest, what is the timestamp value that you pass?

qisikai · Answer 3 · Mon Mar 01 2021 17:10:49 GMT+0800 (China Standard Time)

@elezar Thanks.

1

value: ProcessSamplesCount =100 （same value on all the gpus, It's the same with or without load).

2

ts := uint64(time.Now().Unix()-tm) * 1000
_, ret := dev.GetProcessUtilization(ts)

I have tried: ts=0, ts= current second * 1000, ts = (current second - 10s) * 1000 microseconds,
But none of them work.

3

I find :

when calling ret := nvmlDeviceGetProcessUtilization(Device, &Utilization[0], &ProcessSamplesCount, LastSeenTimeStamp), ProcessSamplesCount must be 100, 0 or 10 does't work.
&Utilization[0] must not be nil.

4

Follow code works:

// nvml.DeviceGetProcessUtilization()
func DeviceGetProcessUtilization(Device Device, LastSeenTimeStamp uint64) ([]ProcessUtilizationSample, Return) {
	var ProcessSamplesCount uint32 = 100
	Utilization := make([]ProcessUtilizationSample, 100)

	ret := nvmlDeviceGetProcessUtilization(Device, &Utilization[0], &ProcessSamplesCount, LastSeenTimeStamp)
	if ret != SUCCESS {
		return nil, ret
	}
	if ProcessSamplesCount >= 100 || ProcessSamplesCount == 0 {
		return []ProcessUtilizationSample{}, ret
	}
	UtilizationRes := make([]ProcessUtilizationSample, 0)
	var i uint32 = 0
	for ; i < ProcessSamplesCount; i++ {
		UtilizationRes = append(UtilizationRes, Utilization[i])
	}
	return UtilizationRes, ret
}

Kevin Klues · Answer 4 · Mon Mar 01 2021 17:12:24 GMT+0800 (China Standard Time)

I checked with the NVML maintainer last night on this.

Apparently, the first call to get the sample size is meant to return an NVML_ERROR_INSUFFICIENT_SIZE and not an NVML_SUCCESS.

So we need to change the code to:

func DeviceGetProcessUtilization(Device Device, LastSeenTimeStamp uint64) ([]ProcessUtilizationSample, Return) {
	var ProcessSamplesCount uint32
	ret := nvmlDeviceGetProcessUtilization(Device, nil, &ProcessSamplesCount, LastSeenTimeStamp)
	if ret != ERROR_INSUFFICIENT_SIZE {
		return nil, ret
	}
	if ProcessSamplesCount == 0 {
		return []ProcessUtilizationSample{}, ret
	}
	Utilization := make([]ProcessUtilizationSample, ProcessSamplesCount)
	ret = nvmlDeviceGetProcessUtilization(Device, &Utilization[0], &ProcessSamplesCount, LastSeenTimeStamp)
	return Utilization[:ProcessSamplesCount], ret
}

Kevin Klues · Answer 5 · Mon Mar 01 2021 17:15:48 GMT+0800 (China Standard Time)

Unfortunately, this isn't obvious from the documentation:
https://github.com/NVIDIA/go-nvml/blob/master/gen/nvml/nvml.h#L5844

Which is why we missed it the first time around.

Kevin Klues · Answer 6 · Mon Mar 01 2021 17:17:19 GMT+0800 (China Standard Time)

@qisikai Can you run the variant of the function above and make sure it fixes things for you.

qisikai · Answer 7 · Mon Mar 01 2021 17:17:53 GMT+0800 (China Standard Time)

@qisikai Can you run the variant of the function above and make sure it fixes things for you.

Kevin Klues · Answer 8 · Mon Mar 01 2021 17:23:53 GMT+0800 (China Standard Time)

@qisikai Did you mean to write more in your comment above?

qisikai · Answer 9 · Mon Mar 01 2021 17:27:56 GMT+0800 (China Standard Time)

@qisikai Can you run the variant of the function above and make sure it fixes things for you.

@elezar @klueska Thanks. The code works, it can get info of running process. but I met follows situation on gpus without any load：

**I add a Printf statement **

func DeviceGetProcessUtilization(Device Device, LastSeenTimeStamp uint64) ([]ProcessUtilizationSample, Return) {
	var ProcessSamplesCount uint32
	ret := nvmlDeviceGetProcessUtilization(Device, nil, &ProcessSamplesCount, LastSeenTimeStamp)
	if ret != ERROR_INSUFFICIENT_SIZE {
		return nil, ret
	}
	if ProcessSamplesCount == 0 {
		return []ProcessUtilizationSample{}, ret
	}
	log.Printf("ProcessSamplesCount= %d \n", ProcessSamplesCount)
	Utilization := make([]ProcessUtilizationSample, ProcessSamplesCount)
	ret = nvmlDeviceGetProcessUtilization(Device, &Utilization[0], &ProcessSamplesCount, LastSeenTimeStamp)
	return Utilization[:ProcessSamplesCount], ret
}

Output

2021/03/01 17:24:19 ProcessSamplesCount= 100

qisikai · Answer 10 · Mon Mar 01 2021 17:31:38 GMT+0800 (China Standard Time)

when there is no process running on certain gpu, DeviceGetProcessUtilization will return a slice with 100 items.

Evan Lezar · Answer 11 · Mon Mar 01 2021 17:39:58 GMT+0800 (China Standard Time)

Thanks for the quick test @qisikai. With regards to:

when there is no process running on certain gpu, DeviceGetProcessUtilization will return a slice with 100 items.

Is this not expected? According to the docs:

One utilization sample structure is returned per process running, that had some non-zero utilization during the last sample period.

Could it be that these are old samples that are being returned?

Kevin Klues · Answer 12 · Mon Mar 01 2021 17:40:04 GMT+0800 (China Standard Time)

I don't have a good answer as to why the API returns 100 if nothing is running, but I just did a quick check on the underlying C API, and it returns that same thing:

#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/types.h>

#include <nvml.h>

int main()
{
    nvmlReturn_t ret;
    nvmlDevice_t device;
    uint32_t processSamplesCount;

    ret = nvmlInit();
    printf("nvmlInit: %s\n", nvmlErrorString(ret));

    ret = nvmlDeviceGetHandleByIndex(0, &device);
    printf("nvmlDeviceGetHandleByIndex: %s\n", nvmlErrorString(ret));

    ret = nvmlDeviceGetProcessUtilization(device, NULL, &processSamplesCount, 0);
    printf("nvmlDeviceGetProcessUtilization: %d, %s\n", processSamplesCount, nvmlErrorString(ret));

    ret = nvmlShutdown();
    printf("nvmlShutdown: %s\n", nvmlErrorString(ret));
}

$ gcc nvml.c -o nvml -lnvidia-ml
$ ./nvml
nvmlInit: Success
nvmlDeviceGetHandleByIndex: Success
nvmlDeviceGetProcessUtilization: 100, Insufficient Size
nvmlShutdown: Success

qisikai · Answer 13 · Mon Mar 01 2021 17:42:37 GMT+0800 (China Standard Time)

Yes, this is a strange behavior.
So I add an if statement:

if ProcessSamplesCount >= 100 || ProcessSamplesCount == 0 {
    return []ProcessUtilizationSample{}, ret
 }

Kevin Klues · Answer 14 · Mon Mar 01 2021 17:47:48 GMT+0800 (China Standard Time)

As bindings, our job is to simply pass through the values returned by the underlying C API. So will will make the change to adhere to the ERROR_INSUFFICIENT_SIZE issue, but we won't be adding that if statement you have above. I would recommend adding that to your application code for now if you need it, i.e.:

utilization, ret := device.GetProcessUtilization(0)
if ret != nvml.SUCCESS {
    .....
}
if len(utilization) >= 100 || len(utilization) == 0 {
    ....
}

I will check with the NVML team in the meantime, to see what they say about this.

qisikai · Answer 15 · Mon Mar 01 2021 17:49:19 GMT+0800 (China Standard Time)

@klueska Ok, Thanks. That is enough. I will add a check statement after calling GetProcessUtilization.