GetProcessUtilization says: Insufficient Size
qisikai opened this issue · comments
I run pytorch in a docker container, with the 4th gpu.
I run go-nvml in host environment.
Code:
_, ret := dev.GetProcessUtilization(ts)
if ret != nvml.SUCCESS {
log.Printf("[x] Unable to call GetProcessUtilization", nvml.ErrorString(ret))
}
OUTPUTS:
Unable to call GetProcessUtilization%!(EXTRA string=Insufficient Size)
I modify the code, and this code works:
var ProcessSamplesCount uint32 = 100
Utilization := make([]ProcessUtilizationSample, 100)
ret := nvmlDeviceGetProcessUtilization(Device, &Utilization[0], &ProcessSamplesCount, LastSeenTimeStamp)
Thanks @qisikai. It seems as if the first call to nvmlDeviceGetProcessUtilization
which should be setting the number of samples is not returning the expected value in ProcessSamplesCount
. Could you check what value ProcessSamplesCount
after the first call?
As a matter of interest, what is the timestamp value that you pass?
@elezar Thanks.
1
value: ProcessSamplesCount =100
(same value on all the gpus, It's the same with or without load).
2
ts := uint64(time.Now().Unix()-tm) * 1000
_, ret := dev.GetProcessUtilization(ts)
I have tried: ts=0
, ts= current second * 1000
, ts = (current second - 10s) * 1000 microseconds
,
But none of them work.
3
I find :
-
when calling
ret := nvmlDeviceGetProcessUtilization(Device, &Utilization[0], &ProcessSamplesCount, LastSeenTimeStamp)
,ProcessSamplesCount
must be100
,0
or10
does't work. -
&Utilization[0]
must not be nil.
4
Follow code works:
// nvml.DeviceGetProcessUtilization()
func DeviceGetProcessUtilization(Device Device, LastSeenTimeStamp uint64) ([]ProcessUtilizationSample, Return) {
var ProcessSamplesCount uint32 = 100
Utilization := make([]ProcessUtilizationSample, 100)
ret := nvmlDeviceGetProcessUtilization(Device, &Utilization[0], &ProcessSamplesCount, LastSeenTimeStamp)
if ret != SUCCESS {
return nil, ret
}
if ProcessSamplesCount >= 100 || ProcessSamplesCount == 0 {
return []ProcessUtilizationSample{}, ret
}
UtilizationRes := make([]ProcessUtilizationSample, 0)
var i uint32 = 0
for ; i < ProcessSamplesCount; i++ {
UtilizationRes = append(UtilizationRes, Utilization[i])
}
return UtilizationRes, ret
}
I checked with the NVML maintainer last night on this.
Apparently, the first call to get the sample size is meant to return an NVML_ERROR_INSUFFICIENT_SIZE
and not an NVML_SUCCESS
.
So we need to change the code to:
func DeviceGetProcessUtilization(Device Device, LastSeenTimeStamp uint64) ([]ProcessUtilizationSample, Return) {
var ProcessSamplesCount uint32
ret := nvmlDeviceGetProcessUtilization(Device, nil, &ProcessSamplesCount, LastSeenTimeStamp)
if ret != ERROR_INSUFFICIENT_SIZE {
return nil, ret
}
if ProcessSamplesCount == 0 {
return []ProcessUtilizationSample{}, ret
}
Utilization := make([]ProcessUtilizationSample, ProcessSamplesCount)
ret = nvmlDeviceGetProcessUtilization(Device, &Utilization[0], &ProcessSamplesCount, LastSeenTimeStamp)
return Utilization[:ProcessSamplesCount], ret
}
Unfortunately, this isn't obvious from the documentation:
https://github.com/NVIDIA/go-nvml/blob/master/gen/nvml/nvml.h#L5844
Which is why we missed it the first time around.
@qisikai Can you run the variant of the function above and make sure it fixes things for you.
@qisikai Can you run the variant of the function above and make sure it fixes things for you.
@qisikai Did you mean to write more in your comment above?
@qisikai Can you run the variant of the function above and make sure it fixes things for you.
@elezar @klueska Thanks. The code works, it can get info of running process. but I met follows situation on gpus without any load:
**I add a Printf statement **
func DeviceGetProcessUtilization(Device Device, LastSeenTimeStamp uint64) ([]ProcessUtilizationSample, Return) {
var ProcessSamplesCount uint32
ret := nvmlDeviceGetProcessUtilization(Device, nil, &ProcessSamplesCount, LastSeenTimeStamp)
if ret != ERROR_INSUFFICIENT_SIZE {
return nil, ret
}
if ProcessSamplesCount == 0 {
return []ProcessUtilizationSample{}, ret
}
log.Printf("ProcessSamplesCount= %d \n", ProcessSamplesCount)
Utilization := make([]ProcessUtilizationSample, ProcessSamplesCount)
ret = nvmlDeviceGetProcessUtilization(Device, &Utilization[0], &ProcessSamplesCount, LastSeenTimeStamp)
return Utilization[:ProcessSamplesCount], ret
}
Output
2021/03/01 17:24:19 ProcessSamplesCount= 100
when there is no process running on certain gpu, DeviceGetProcessUtilization
will return a slice with 100 items.
Thanks for the quick test @qisikai. With regards to:
when there is no process running on certain gpu, DeviceGetProcessUtilization will return a slice with 100 items.
Is this not expected? According to the docs:
One utilization sample structure is returned per process running, that had some non-zero utilization during the last sample period.
Could it be that these are old samples that are being returned?
I don't have a good answer as to why the API returns 100 if nothing is running, but I just did a quick check on the underlying C API, and it returns that same thing:
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/types.h>
#include <nvml.h>
int main()
{
nvmlReturn_t ret;
nvmlDevice_t device;
uint32_t processSamplesCount;
ret = nvmlInit();
printf("nvmlInit: %s\n", nvmlErrorString(ret));
ret = nvmlDeviceGetHandleByIndex(0, &device);
printf("nvmlDeviceGetHandleByIndex: %s\n", nvmlErrorString(ret));
ret = nvmlDeviceGetProcessUtilization(device, NULL, &processSamplesCount, 0);
printf("nvmlDeviceGetProcessUtilization: %d, %s\n", processSamplesCount, nvmlErrorString(ret));
ret = nvmlShutdown();
printf("nvmlShutdown: %s\n", nvmlErrorString(ret));
}
$ gcc nvml.c -o nvml -lnvidia-ml
$ ./nvml
nvmlInit: Success
nvmlDeviceGetHandleByIndex: Success
nvmlDeviceGetProcessUtilization: 100, Insufficient Size
nvmlShutdown: Success
Yes, this is a strange behavior.
So I add an if
statement:
if ProcessSamplesCount >= 100 || ProcessSamplesCount == 0 {
return []ProcessUtilizationSample{}, ret
}
As bindings, our job is to simply pass through the values returned by the underlying C API. So will will make the change to adhere to the ERROR_INSUFFICIENT_SIZE
issue, but we won't be adding that if
statement you have above. I would recommend adding that to your application code for now if you need it, i.e.:
utilization, ret := device.GetProcessUtilization(0)
if ret != nvml.SUCCESS {
.....
}
if len(utilization) >= 100 || len(utilization) == 0 {
....
}
I will check with the NVML team in the meantime, to see what they say about this.