NVIDIA / go-nvml

Go Bindings for the NVIDIA Management Library (NVML)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wrong output of `device.GetComputeRunningProcesses()` given multiple processes

qzweng opened this issue · comments

Hi Kevin & Evan,

When multiple processes run on one GPU, I found the output of device.GetComputeRunningProcesses() is wrong -- the values are misplaced across different process's ProcessInfo. The bug appears on both V100-SXM2-16GB and GTX 1080 Ti, with CUDA version 10.2.

The testing code snippet is as follows.

package main

import "fmt"
import "github.com/NVIDIA/go-nvml/pkg/nvml"

func main() {
	nvml.Init()
	device, _ := nvml.DeviceGetHandleByIndex(0)
	processInfos, _ := device.GetComputeRunningProcesses()
	for i, processInfo := range processInfos {
		fmt.Printf("\t[%2d] ProcessInfo: %v\n", i, processInfo)
	}
}

On V100 machines, I got this.

$nvidia-smi -L
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-?) # UUID manually removed

$nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   55C    P0   130W / 300W |  13244MiB / 16160MiB |     53%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     48959      C   python                                      1179MiB |
|    0     72754      C   python                                      9611MiB |
|    0     73422      C   python                                      2443MiB |
+-----------------------------------------------------------------------------+

$go run main.go
	[ 0] ProcessInfo: {72754 10077863936 73422 0}
	[ 1] ProcessInfo: {2561671168 48959 1236271104 0}
	[ 2] ProcessInfo: {0 0 0 0}
# it is expected to be
#	[ 0] ProcessInfo: {72754 10077863936 0 0} # {PID, 9611 MiB, 0, 0}
#	[ 1] ProcessInfo: {73422 2561671168 0 0}  # {PID, 2443 MiB, 0, 0}
#	[ 2] ProcessInfo: {48959 1236271104 0 0}  # {PID, 1179 MiB, 0, 0}

On GTX 1080 Ti machines, I got this.

$nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 1: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 2: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 3: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 4: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 5: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 6: GeForce GTX 1080 Ti (UUID: GPU-?)
GPU 7: GeForce GTX 1080 Ti (UUID: GPU-?) # UUID manually removed

$nvidia-smi -i 0
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64       Driver Version: 440.64       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:3D:00.0 Off |                  N/A |
| 38%   67C    P2   247W / 250W |   6630MiB / 11178MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     25907      C   python                                      2581MiB |
|    0     27576      C   python                                      4039MiB |
+-----------------------------------------------------------------------------+

$go run main.go
	[ 0] ProcessInfo: {25907 2706374656 27576 0}
	[ 1] ProcessInfo: {4235198464 0 0 0}
# it is expected to be
#	[ 0] ProcessInfo: {25907 2706374656 0 0} # {PID, 2581 MiB, 0, 0}
#	[ 1] ProcessInfo: {27576 4235198464 0 0} # {PID, 4039 MiB, 0, 0}

As a quick fix, I write a wrapper function to correct the faulty processInfo as it returns, hoping this bug could be solved in the near future.

After all, thanks for providing such nice library, especially the useful "thread-safe" feature. :)

Thanks for the report @qzweng and sorry for taking so long to respond. Considering that you're on CUDA 10.2 our current hypothesis is that the nvmlProcessInfo_st that is returned by the NVML call has had fields added in the CUDA 11.x API and that we are not handling this correctly.

We will look into it. As a side note, could you check whether GetGraphicsRunningProcesses shows the same behaviour?

Update: looking at nvml.h from CUDA 10.2 we have:

typedef struct nvmlProcessInfo_st
{
    unsigned int pid;                 //!< Process ID
    unsigned long long usedGpuMemory; //!< Amount of used GPU memory in bytes.
                                      //! Under WDDM, \ref NVML_VALUE_NOT_AVAILABLE is always reported
                                      //! because Windows KMD manages all the memory and not the NVIDIA driver
} nvmlProcessInfo_t;

whereas in CUDA 11 this is

typedef struct nvmlProcessInfo_st
{
    unsigned int        pid;                //!< Process ID
    unsigned long long  usedGpuMemory;      //!< Amount of used GPU memory in bytes.
                                            //! Under WDDM, \ref NVML_VALUE_NOT_AVAILABLE is always reported
                                            //! because Windows KMD manages all the memory and not the NVIDIA driver
    unsigned int        gpuInstanceId;      //!< If MIG is enabled, stores a valid GPU instance ID. gpuInstanceId is set to
                                            //  0xFFFFFFFF otherwise.
    unsigned int        computeInstanceId;  //!< If MIG is enabled, stores a valid compute instance ID. computeInstanceId is set to
                                            //  0xFFFFFFFF otherwise.
} nvmlProcessInfo_t;

Which would explain the behaviour.

Thanks for the detailed feedback! The GetGraphicsRunningProcesses returns nothing under the benchmark of image resizing and watermarking using nvJPEG. The encoding & decoding utilization (via device.GetEncoderUtilization()) are also zero.

Could you suggest any benchmark that may utilize encoder/decoder/graphics? I'm afraid we don't have many workloads, other than the AI/ML-related ones, at hand.

$nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   54C    P0   234W / 300W |  10199MiB / 16160MiB |     84%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     80857      C   ./build/imageResizeWatermark                7017MiB |
|    0     80978      C   ./build/imageResizeWatermark                3171MiB |
+-----------------------------------------------------------------------------+

$go run main.go
Compute Running Processes
	[ 0] ProcessInfo: {80857 8364490752 80978 0}
	[ 1] ProcessInfo: {38797312 0 0 0}
Graphics Running Processes
	# i.e., no processInfo return

EncoderCapacity= 100
EncoderUtilization= 0
SamplingPeriodUs= 167000
DecoderUtilization= 0
SamplingPeriodUs= 167000

GetProcessUtilization
	[ 0] processUtilInfo: {80857 1624440381802281 16 3 0 0}
	[ 1] processUtilInfo: {80978 1624440382304243 63 15 0 0}

The output GetGraphicsRunningProcesses is not critical in this case, I just noted that it uses the same pattern to call the API and convert to Go structs and as such I would expect the behaviour to be the same.

@qzweng one quick question.

You mentioned:

As a quick fix, I write a wrapper function to correct the faulty processInfo as it returns, hoping this bug could be solved in the near future.

Could you paste a snippet here?

By assuming that valid pid and usedGpuMemory should not be 0 (is it always true?), I go through all the values in the returned list of ProcessInfo, skip the zero ones, and re-build the whole list as output.

func CorrectProcessInfo(infos []nvml.ProcessInfo) []nvml.ProcessInfo {
    outInfos := make([]nvml.ProcessInfo, len(infos))
    var c uint32
    for _, info := range infos {
        data := [4]uint64{uint64(info.Pid), info.UsedGpuMemory, uint64(info.GpuInstanceId), uint64(info.ComputeInstanceId)}
        for _, d := range data {
            if d != 0 {
                if c%2 == 0 {
                    outInfos[c/2].Pid = uint32(d)
                } else {
                    outInfos[c/2].UsedGpuMemory = d
                }
                c += 1
            }
        }
    }
    return outInfos
}

(2021.07.26) P.S. please refer to @elezar's solution #22 for better compatibility. This code snippet may cause runtime error when the ProcessInfo is correct or in other formats (e.g., in CUDA 11).

This is not a go-nvml issue. NVML itself will always return (at least) 100 entries, with 0 entries for invalid entries.

Please see the following for more info:
#11 (comment)

i also found this problem

wrong:[{1078 6498025472 1876 0} {2599419904 0 0 0}]
except:[{1078 6498025472 0 0} {1876 2599419904 0 0}]

@klueska @elezar @qzweng could i use qzweng's wrapper function to solve the problem

@zhangxinming1991 the reason gpu-monitoring-tools works under CUDA 10.x is that the struct that is returned by the NVML call matches the definition in CUDA 10.x.

We will work on handling slices of structs better across CUDA versions, and in the meantime something like @qzweng's wrapper could be used, although I would consider adapting it a bit to handle the actual memory layout and not rely on the assumption that certain values are zero.

@zhangxinming1991 @qzweng I have a PR out #22 with a quick attempt to detect whether the conversion is required. I think it can still be improved significantly, but it would be great if either of you could validate the approach.

We do not see this as a long-term solution for slices of structs in this wrapper, but this should unblock users such as yourselves.

@elezar thank you very much! i will validate the approach and tell you the result althoug i have solved the problem with myself way. In my way, i add new nvmlProcessInfo_st_v1 struct in nvml.h, and use it in all v1 interface, at last i disable the replace in init.go,
just let it use nvmlDeviceGetComputeRunningProcesses_v1 and nvmlDeviceGetGraphicsRunningProcesses_v1

@elezar some wrong happen in vgpu.go and device.go when build, please help to see the problem, detail as follow:

// nvml.VgpuInstanceGetGpuInstanceId()
func VgpuInstanceGetGpuInstanceId(VgpuInstance VgpuInstance) (int, Return) {
var gpuInstanceId uint32
ret := nvmlVgpuInstanceGetGpuInstanceId(VgpuInstance, &gpuInstanceId)
return int(gpuInstanceId), SessionInfo, ret
}
Too many arguments to return

// nvml.DeviceSetTemperatureThreshold()
func DeviceSetTemperatureThreshold(Device Device, ThresholdType TemperatureThresholds, Temp int) Return {
ret := nvmlDeviceSetTemperatureThreshold(Device, ThresholdType, &Temp)
return ret
}
Cannot use '&Temp' (type *int) as the type *int32

@zhangxinming1991 you are right. The update to 11.2 in #20 introduced these bugs.

I have created #23 to address these and also add examples for some basic testing on our side. I have also rebased #22 on this new branch, so in theory you should be able to make examples (or make docker-examples) on your system and run the generated compute-processes example to verify that the new mechanism works as expected.

OK, i will try it

@zhangxinming1991 @qzweng I have a PR out #22 with a quick attempt to detect whether the conversion is required. I think it can still be improved significantly, but it would be great if either of you could validate the approach.

We do not see this as a long-term solution for slices of structs in this wrapper, but this should unblock users such as yourselves.

Hi @elezar, thanks for the fix; it seems more robust and fundamental than my wrapper function, from which I have learned a lot. Thanks!

However, there is still some problems where "Unknown Error" and wrong output will be returned (see below):

commit 11adf3bc708d869a4e98a9f950525d5321dbbd78
Author: Evan Lezar <elezar@nvidia.com>
Date:   Mon Jun 28 10:56:19 2021 +0200

    Fix GetComputeRunningProcesses on CUDA 10.x
    ...

1) Unknown Error on T4

$nvidia-smi
Tue Jul 27 10:32:35 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:40:00.0 Off |                    0 |
| N/A   65C    P0    53W /  70W |  12860MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:E4:00.0 Off |                    0 |
| N/A   65C    P0    53W /  70W |  12860MiB / 15109MiB |     18%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    232822      C   python                                      4283MiB |
|    0    232824      C   python                                      4283MiB |
|    0    232825      C   python                                      4283MiB |
|    1     22978      C   python                                      2251MiB |
|    1     22980      C   python                                      2251MiB |
|    1     22981      C   python                                      2251MiB |
|    1     22983      C   python                                      2251MiB |
|    1     22985      C   python                                      2251MiB |
+-----------------------------------------------------------------------------+

I run the example code in examples/compute-processes/main.go, and get the output as below:

$./main
2021/07/27 10:48:04 Unable to get process info for device at index 0: Unknown Error

2) Wrong Output on V100

A more serious problem is that, the returned value of UsedGpuMemory may be corrupted due to the limit of uint32. See this example below,

$nvidia-smi
Tue Jul 27 11:05:12 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   50C    P0   137W / 300W |  15753MiB / 16160MiB |     45%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    103351      C   python                                      9611MiB |
|    0    198942      C   python                                      6131MiB |
+-----------------------------------------------------------------------------+
$./main
Found 2 processes on device 0
	[ 0] ProcessInfo: {Pid:198942 UsedGpuMemory:6428819456 GpuInstanceId:103351 ComputeInstanceId:0}
	[ 1] ProcessInfo: {Pid:1487929344 UsedGpuMemory:0 GpuInstanceId:0 ComputeInstanceId:0}

Where 198942 -- 6428819456 (i.e., 6131M) is correct while 103351 -- 1487929344 (i.e., 1419M) is wrong even after organization. The reason is that the first 2 bits of the real memory usage is chopped off.

   0101,1000,1011,0000,0000,0000,0000,0000 (32 bits binary) ==  1487929344 Bytes (dec) == 1419 MiB (dec) -- ×
10,0101,1000,1011,0000,0000,0000,0000,0000 (34 bits binary) == 10077863936 Bytes (dec) == 9611 MiB (dec) -- √

Recalling the ProcessInfo struct, the values for UsedGpuMemory should not be accepted by other fields.

type ProcessInfo struct {
	Pid               uint32
	UsedGpuMemory     uint64
	GpuInstanceId     uint32
	ComputeInstanceId uint32
}

Is there any method to fix this? Your help will be greatly appreciated. Thanks.

In L#953 of @elezar's fixed device.go, the condition judgement of whether ProcessInfo is corrupted may not be fully correct.

	if Infos[InfoCount-1].Pid == 0 && Infos[InfoCount-1].UsedGpuMemory == 0 {
		// in the case of the _v1 API we need to adjust the size of the process info data structure
		adjusted, err := adjustProcessInfoSlice(Infos[:InfoCount])
		if err != nil {
			return nil, ERROR_UNKNOWN
		}
		return adjusted, SUCCESS
	}
	return Infos[:InfoCount], SUCCESS

However, in my cases,

Found 2 processes on device 0
	[ 0] ProcessInfo: {Pid:198942 UsedGpuMemory:6428819456 GpuInstanceId:103351 ComputeInstanceId:0}
	[ 1] ProcessInfo: {Pid:1487929344 UsedGpuMemory:0 GpuInstanceId:0 ComputeInstanceId:0}

The last ProcessInfo got non-zero Pid but zero UsedGpuMemory, which will not trigger the adjustProcessInfoSlice. It might better be if Infos[InfoCount-1].Pid == 0 || Infos[InfoCount-1].UsedGpuMemory == 0 ?

Even after I fix the condition, the adjustProcessInfoSlice seems not working correctly; as it reported

error reading intermediate values: unexpected EOF # via fmt.Println(err)
Unable to get process info for device at index 0: Unknown Error

Still working on it.

same issue

@zhangxinming1991 you are right. The update to 11.2 in #20 introduced these bugs.

I have created #23 to address these and also add examples for some basic testing on our side. I have also rebased #22 on this new branch, so in theory you should be able to make examples (or make docker-examples) on your system and run the generated compute-processes example to verify that the new mechanism works as expected.

@elezar, i have test for the fix, it work fine. Comparing nvidia-smi pmon -s m and the go-nvml ,

/var/paas/nvidia/bin/nvidia-smi pmon -s m
0 7941 C 6209 python
0 5486 C 6405 python

the collect example using go-nvml write by myself
containerId:ae8ff3843daddb124e095e590b015ad5b5fa8b60ce50596ad8ddc694bec74cd0, cardId:0, gpuUtil:45, gpuMemUsed:6209, gpuMemUsage:38.422000
containerId:a1aca972f493c7a723fd5531c28c01762678e5527caff6ceb18b2fd5c1b4616a, cardId:0, gpuUtil:67, gpuMemUsed:6405, gpuMemUsage:39.635000

containerId:ae8ff3843dad => pid:7941
containerId:a1aca972f493 => pid:5486

PS: the example in compute-processes also work fine

@zhangxinming1991 @qzweng I have a PR out #22 with a quick attempt to detect whether the conversion is required. I think it can still be improved significantly, but it would be great if either of you could validate the approach.
We do not see this as a long-term solution for slices of structs in this wrapper, but this should unblock users such as yourselves.

Hi @elezar, thanks for the fix; it seems more robust and fundamental than my wrapper function, from which I have learned a lot. Thanks!

However, there is still some problems where "Unknown Error" and wrong output will be returned (see below):

commit 11adf3bc708d869a4e98a9f950525d5321dbbd78
Author: Evan Lezar <elezar@nvidia.com>
Date:   Mon Jun 28 10:56:19 2021 +0200

    Fix GetComputeRunningProcesses on CUDA 10.x
    ...

1) Unknown Error on T4

$nvidia-smi
Tue Jul 27 10:32:35 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:40:00.0 Off |                    0 |
| N/A   65C    P0    53W /  70W |  12860MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:E4:00.0 Off |                    0 |
| N/A   65C    P0    53W /  70W |  12860MiB / 15109MiB |     18%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    232822      C   python                                      4283MiB |
|    0    232824      C   python                                      4283MiB |
|    0    232825      C   python                                      4283MiB |
|    1     22978      C   python                                      2251MiB |
|    1     22980      C   python                                      2251MiB |
|    1     22981      C   python                                      2251MiB |
|    1     22983      C   python                                      2251MiB |
|    1     22985      C   python                                      2251MiB |
+-----------------------------------------------------------------------------+

I run the example code in examples/compute-processes/main.go, and get the output as below:

$./main
2021/07/27 10:48:04 Unable to get process info for device at index 0: Unknown Error

2) Wrong Output on V100

A more serious problem is that, the returned value of UsedGpuMemory may be corrupted due to the limit of uint32. See this example below,

$nvidia-smi
Tue Jul 27 11:05:12 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   50C    P0   137W / 300W |  15753MiB / 16160MiB |     45%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    103351      C   python                                      9611MiB |
|    0    198942      C   python                                      6131MiB |
+-----------------------------------------------------------------------------+
$./main
Found 2 processes on device 0
	[ 0] ProcessInfo: {Pid:198942 UsedGpuMemory:6428819456 GpuInstanceId:103351 ComputeInstanceId:0}
	[ 1] ProcessInfo: {Pid:1487929344 UsedGpuMemory:0 GpuInstanceId:0 ComputeInstanceId:0}

Where 198942 -- 6428819456 (i.e., 6131M) is correct while 103351 -- 1487929344 (i.e., 1419M) is wrong even after organization. The reason is that the first 2 bits of the real memory usage is chopped off.

   0101,1000,1011,0000,0000,0000,0000,0000 (32 bits binary) ==  1487929344 Bytes (dec) == 1419 MiB (dec) -- ×
10,0101,1000,1011,0000,0000,0000,0000,0000 (34 bits binary) == 10077863936 Bytes (dec) == 9611 MiB (dec) -- √

Recalling the ProcessInfo struct, the values for UsedGpuMemory should not be accepted by other fields.

type ProcessInfo struct {
	Pid               uint32
	UsedGpuMemory     uint64
	GpuInstanceId     uint32
	ComputeInstanceId uint32
}

Is there any method to fix this? Your help will be greatly appreciated. Thanks.

@qzweng to your 2) problem, it is ok in my test case, so i wonder you have update the code of go-nvml. To your 1), i have not test it in T4, so i don't know what happen

@qzweng to your 2) problem, it is ok in my test case, so i wonder you have update the code of go-nvml. To your 1), i have not test it in T4, so i don't know what happen

Hi @zhangxinming1991 , thanks for the reply. Could you kindly point out which commit are you using?

@zhangxinming1991 Thanks for the reply. But I re-run the experiment, it still does not work. :(

@qzweng to your 2) problem, it is ok in my test case, so i wonder you have update the code of go-nvml. To your 1), i have not test it in T4, so i don't know what happen

-- If I had not updated the code, it would not show the "Unknown Error". The main.go is from Evan's example code.

[~/go/src/github.com/elezar/go-nvml-11adf3b/examples/compute-processes]
$git status
# HEAD detached at 11adf3b
nothing to commit, working directory clean

[~/go/src/github.com/elezar/go-nvml-11adf3b/examples/compute-processes]
$nvidia-smi
Wed Aug  4 00:08:57 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   56C    P0   160W / 300W |  11180MiB / 16160MiB |     57%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    104259      C   python                                      2187MiB |
|    0    104260      C   python                                      2443MiB |
|    0    104271      C   python                                      6539MiB |
+-----------------------------------------------------------------------------+

[~/go/src/github.com/elezar/go-nvml-11adf3b/examples/compute-processes]
$go run main.go
2021/08/04 00:09:01 Unable to get process info for device at index 0: Unknown Error
exit status 1

@qzweng i don't know what happen, i show message of nvidia driver version in my env, i wonder whether you version casue the problem, i think @elezar can provide more Precise reason of your problem
[root@p2v-with-disk-a97ba compute-processes]# go run main.go
Found 2 processes on device 0
[ 0] ProcessInfo: {Pid:10825 UsedGpuMemory:6487539712 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
[ 1] ProcessInfo: {Pid:14988 UsedGpuMemory:6563037184 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:21:01.0 Off | 0 |

@qzweng i don't know what happen, i show message of nvidia driver version in my env, i wonder whether you version casue the problem, i think @elezar can provide more Precise reason of your problem
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+

Thanks for the information! Now that we can say that the fix code may work fine under CUDA 11 (your case) but still errs under CUDA 10 (my case). The reason is explained in Evan's this comment.

@qzweng i don't know what happen, i show message of nvidia driver version in my env, i wonder whether you version casue the problem, i think @elezar can provide more Precise reason of your problem
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+

Thanks for the information! Now that we can say that the fix code may work fine under CUDA 11 (your case) but still errs under CUDA 10 (my case). The reason is explained in Evan's this comment.

yes, i test the case in CUDA10 and i also meet your problem, result as below:

[root@train-gpu-v100-32351-e6plj compute-processes]# go run main.go
Found 2 processes on device 0
[ 0] ProcessInfo: {Pid:30006 UsedGpuMemory:6498025472 GpuInstanceId:30090 ComputeInstanceId:0}
[ 1] ProcessInfo: {Pid:2226126848 UsedGpuMemory:0 GpuInstanceId:0 ComputeInstanceId:0}

[root@train-gpu-v100-32351-e6plj compute-processes]# /var/paas/nvidia/bin/nvidia-smi
Sat Aug 7 14:14:15 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:21:01.0 Off | 0 |
| N/A 55C P0 233W / 300W | 12427MiB / 16160MiB | 100% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 30006 C python 6197MiB |
| 0 30090 C python 6219MiB |
+-----------------------------------------------------------------------------+

@qzweng Actually, i have solved the problem in myself way which is not very elegant, #21 (comment), result as blelow of my way,

[root@train-gpu-v100-32351-e6plj compute-processes]# go run main.go
Found 2 processes on device 0
[ 0] ProcessInfo: {Pid:9980 UsedGpuMemory:6498025472}
[ 1] ProcessInfo: {Pid:10121 UsedGpuMemory:6521094144}

[root@train-gpu-v100-32351-e6plj compute-processes]# /var/paas/nvidia/bin/nvidia-smi
Sat Aug 7 15:20:49 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:21:01.0 Off | 0 |
| N/A 57C P0 231W / 300W | 12427MiB / 16160MiB | 99% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 9980 C python 6197MiB |
| 0 10121 C python 6219MiB |
+-----------------------------------------------------------------------------+

Same issue on CUDA 10.0 (driver 430.64) due to the struct size change of nvmlProcessInfo_st (#21 (comment)).

Similar issue for the official NVML Python bindings (nvidia-ml-py). nvidia-ml-py>=11.450.129 does not compatible with CUDA 10.x either (same reason of #21 (comment)). BTW, does anyone know how to report this to the maintainer of nvidia-ml-py? I cannot find any bug report link on PyPI.

Thanks @XuehaiPan. I am trying to track down the maintainer of nvidia-ml-py.

Thanks for all the comments here. I have updated the conditional check as mentioned in #21 (comment). I now always adjust the values depending on whether the _v1 APIs are used or not. I am also in communication with the NVML team to see if there is a better way to address this.

@qzweng @zhangxinming1991 would you be able to check the updated implementation in #23? One issue that we may see is that the memory will be reported incorrectly due to 32-bit truncation (#21 (comment)).

The patch (10a3a25 PR #23) still gets wrong results on CUDA 10.x. And (elezar/go-nvml@d566199 PR #22) will cause an unkown error on CUDA 10.x. But CUDA 11.x always works fine.

Results on Ubuntu 16.04 LTS with NVIDIA driver 430.64 (CUDA 10.x)

$ git clone git@github.com:NVIDIA/go-nvml.git "$GOROOT/src/go-nvml"
$ cd "$GOROOT/src/go-nvml"
$ git rev-parse HEAD
10a3a255c928e24f42c3d43ed6f0a087ece02e7a

# Create one process
$ python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &
[1] 27513

$ go run go-nvml/examples/compute-processes
Found 1 processes on device 0
	[ 0] ProcessInfo: {Pid:27513 UsedGpuMemory:173015040 GpuInstanceId:0 ComputeInstanceId:0}
Found 0 processes on device 1
Found 0 processes on device 2

# Create more processes
$ python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &
[2] 27783
$ python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &
[3] 27817

$ go run go-nvml/examples/compute-processes
Found 3 processes on device 0
	[ 0] ProcessInfo: {Pid:27513 UsedGpuMemory:173015040 GpuInstanceId:27783 ComputeInstanceId:0}
	[ 1] ProcessInfo: {Pid:173015040 UsedGpuMemory:27817 GpuInstanceId:173015040 ComputeInstanceId:0}
	[ 2] ProcessInfo: {Pid:0 UsedGpuMemory:0 GpuInstanceId:0 ComputeInstanceId:0}
Found 0 processes on device 1
Found 0 processes on device 2

$ git remote add patch git@github.com:elezar/go-nvml.git
$ git fetch --all
$ git checkout -t patch/fix-get-running-processes
$ go run go-nvml/examples/compute-processes
2021/08/12 19:48:15 Unable to get process info for device at index 0: Unknown Error
exit status 1

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
...

$ nvidia-smi | grep -i version
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |

Results on Ubuntu 20.04 LTS with NVIDIA driver 470.57.02 (CUDA 11.x)

$ git clone git@github.com:NVIDIA/go-nvml.git "$GOROOT/src/go-nvml"
$ cd "$GOROOT/src/go-nvml"
$ git rev-parse HEAD
10a3a255c928e24f42c3d43ed6f0a087ece02e7a

# Create processes on device 0
$ python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &
[1] 82581
$ python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &
[2] 82943
$ python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &
[3] 83049

$ go run go-nvml/examples/compute-processes
Found 3 processes on device 0
	[ 0] ProcessInfo: {Pid:82581 UsedGpuMemory:189792256 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
	[ 1] ProcessInfo: {Pid:82943 UsedGpuMemory:189792256 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
	[ 2] ProcessInfo: {Pid:83049 UsedGpuMemory:189792256 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 1 processes on device 1
	[ 0] ProcessInfo: {Pid:141273 UsedGpuMemory:8781824000 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 1 processes on device 2
	[ 0] ProcessInfo: {Pid:441233 UsedGpuMemory:9322889216 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 1 processes on device 3
	[ 0] ProcessInfo: {Pid:442050 UsedGpuMemory:8437891072 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 1 processes on device 4
	[ 0] ProcessInfo: {Pid:442892 UsedGpuMemory:8777629696 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 1 processes on device 5
	[ 0] ProcessInfo: {Pid:437948 UsedGpuMemory:9280946176 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 1 processes on device 6
	[ 0] ProcessInfo: {Pid:38104 UsedGpuMemory:9966714880 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 1 processes on device 7
	[ 0] ProcessInfo: {Pid:38134 UsedGpuMemory:9966714880 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 0 processes on device 8
Found 0 processes on device 9

$ git remote add patch git@github.com:elezar/go-nvml.git
$ git fetch --all
$ git checkout -t patch/fix-get-running-processes
$ go run go-nvml/examples/compute-processes
Found 3 processes on device 0
	[ 0] ProcessInfo: {Pid:82581 UsedGpuMemory:189792256 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
	[ 1] ProcessInfo: {Pid:82943 UsedGpuMemory:189792256 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
	[ 2] ProcessInfo: {Pid:83049 UsedGpuMemory:189792256 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 1 processes on device 1
	[ 0] ProcessInfo: {Pid:141273 UsedGpuMemory:8781824000 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 1 processes on device 2
	[ 0] ProcessInfo: {Pid:441233 UsedGpuMemory:9322889216 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 1 processes on device 3
	[ 0] ProcessInfo: {Pid:442050 UsedGpuMemory:8437891072 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 1 processes on device 4
	[ 0] ProcessInfo: {Pid:442892 UsedGpuMemory:8777629696 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 1 processes on device 5
	[ 0] ProcessInfo: {Pid:437948 UsedGpuMemory:9280946176 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 1 processes on device 6
	[ 0] ProcessInfo: {Pid:38104 UsedGpuMemory:9966714880 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 1 processes on device 7
	[ 0] ProcessInfo: {Pid:38134 UsedGpuMemory:9966714880 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
Found 0 processes on device 8
Found 0 processes on device 9

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
...

$ nvidia-smi | grep -i version
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |

where GI ID = CI ID = 4294967295 = 0xFFFFFFFF.

Thanks for the detailed steps @XuehaiPan. I have found the issue in my fix and updated #22.

My output now looks as follows:

$ ./compute-processes
Found 1 processes on device 0
        [ 0] ProcessInfo: {Pid:77849 UsedGpuMemory:456130560 GpuInstanceId:0 ComputeInstanceId:0}

which matches:

$ nvidia-smi -i 0
Thu Aug 12 13:01:18 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   30C    P0   ERR! / 160W |    446MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     77849      C   python3                                      435MiB |
+-----------------------------------------------------------------------------+

It works now (no exceptions). But It gets wrong results when there are more processes.

$ python3 -c ... # create more processes
$ go run go-nvml/examples/compute-processes
Found 10 processes on device 0
        [ 0] ProcessInfo: {Pid:11652 UsedGpuMemory:173015040 GpuInstanceId:0 ComputeInstanceId:0}
        [ 1] ProcessInfo: {Pid:11682 UsedGpuMemory:743093938516131840 GpuInstanceId:0 ComputeInstanceId:0}
        [ 2] ProcessInfo: {Pid:11709 UsedGpuMemory:743093938516131840 GpuInstanceId:0 ComputeInstanceId:0}
        [ 3] ProcessInfo: {Pid:0 UsedGpuMemory:743093938516143802 GpuInstanceId:0 ComputeInstanceId:0}
        [ 4] ProcessInfo: {Pid:0 UsedGpuMemory:11987 GpuInstanceId:0 ComputeInstanceId:0}
        [ 5] ProcessInfo: {Pid:173015040 UsedGpuMemory:12012 GpuInstanceId:0 ComputeInstanceId:0}
        [ 6] ProcessInfo: {Pid:173015040 UsedGpuMemory:55332063674368 GpuInstanceId:0 ComputeInstanceId:0}
        [ 7] ProcessInfo: {Pid:173015040 UsedGpuMemory:57217554317312 GpuInstanceId:0 ComputeInstanceId:0}
        [ 8] ProcessInfo: {Pid:0 UsedGpuMemory:58184094973952 GpuInstanceId:0 ComputeInstanceId:0}
        [ 9] ProcessInfo: {Pid:0 UsedGpuMemory:173015040 GpuInstanceId:0 ComputeInstanceId:0}
Found 0 processes on device 1
Found 0 processes on device 2

Compare to unpatched results:

$ go run go-nvml/examples/compute-processes
Found 3 processes on device 0
	[ 0] ProcessInfo: {Pid:27513 UsedGpuMemory:173015040 GpuInstanceId:27783 ComputeInstanceId:0}
	[ 1] ProcessInfo: {Pid:173015040 UsedGpuMemory:27817 GpuInstanceId:173015040 ComputeInstanceId:0}
	[ 2] ProcessInfo: {Pid:0 UsedGpuMemory:0 GpuInstanceId:0 ComputeInstanceId:0}
Found 0 processes on device 1
Found 0 processes on device 2

I think this probably caused by Data Structure Alignment in C libraries.

/**
 * Information about running compute processes on the GPU
 */
typedef struct nvmlProcessInfo_st
{
    unsigned int        pid;                // 4 bytes, 0-3
    char                padding[4];         // 4 bytes. 4-7, added by compiler, align to 8 bytes
    unsigned long long  usedGpuMemory;      // 8 bytes, 8-15
    unsigned int        gpuInstanceId;      // 4 bytes, 16-19
    unsigned int        computeInstanceId;  // 4 bytes, 20-24
} nvmlProcessInfo_t;  // 24 bytes in total (!= 4 + 8 + 4 + 4)

It could be a non-trivial work to handle this. Simply write v2 struct into a byte stream and then read out as v1 struct would not work. And that many cause different behaviors on different machines (32-bit vs. 64-bit / x86 vs. arm, etc.).