NVIDIA / go-nvml

Go Bindings for the NVIDIA Management Library (NVML)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

undefined symbol: nvmlDeviceGetGpuInstanceId

xiyichan opened this issue · comments

CentOS 7 , When I get GpuInstanceId is error.

func main() {
	ret := nvml.Init()
	if ret != nvml.SUCCESS {
		log.Fatalf("Unable to initialize NVML: %v", nvml.ErrorString(ret))
	}
	defer func() {
		ret := nvml.Shutdown()
		if ret != nvml.SUCCESS {
			log.Fatalf("Unable to shutdown NVML: %v", nvml.ErrorString(ret))
		}
	}()

	count, ret := nvml.DeviceGetCount()
	if ret != nvml.SUCCESS {
		log.Fatalf("Unable to get device count: %v", nvml.ErrorString(ret))
	}
	for i := 0; i < count; i++ {
		device, ret := nvml.DeviceGetHandleByIndex(i)
		if ret != nvml.SUCCESS {
			return fmt.Errorf("Unable to get device at index %d: %v", i, nvml.ErrorString(ret))
		}
		id, ret := device.GetGpuInstanceId()
		if ret != nvml.SUCCESS {
			return fmt.Errorf("Unable to get id of device at index %d: %v", i, nvml.ErrorString(ret))
		}
         }
}

error

./gpu: symbol lookup error: ./gpu: undefined symbol: nvmlDeviceGetGpuInstanceId

@xiyichan what version of the CUDA driver / nvml library are you using?

@xiyichan what version of the CUDA driver / nvml library are you using?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40 24GB      On   | 00000000:00:06.0 Off |                    0 |
| N/A   23C    P8    16W / 250W |      0MiB / 22945MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40 24GB      On   | 00000000:00:07.0 Off |                    0 |
| N/A   22C    P8    17W / 250W |      0MiB / 22945MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

i try to use other function is can work

Also, nvmlDeviceGetGpuInstanceId is specifically for MIG devices. Looking at your sample, i don't think you're interested in the GPU instance ID. What information are you trying to extract?

With regards to the missing symbol. It is likely that this was added in a later CUDA version as the is currently based on CUDA 11.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40 24GB      On   | 00000000:00:06.0 Off |                    0 |
| N/A   25C    P0    57W / 250W |    144MiB / 22945MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40 24GB      On   | 00000000:00:07.0 Off |                    0 |
| N/A   24C    P0    58W / 250W |    174MiB / 22945MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      8330      C   python                                       134MiB |
|    1      6293      C   python                                       163MiB |
+-----------------------------------------------------------------------------+

GpuInstanceId appeared in the process. I want to know which gpu is using.

GPU Instance ID is a MIG specific construct, and is not applicable to full GPUs.
For more information on MIG, please see:
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html

As such, I don't think this is actually what you want. Are you trying to figure out the meaning of the 0 and the 1 in the nvidia-smi output? That is just the index of the GPU, which you already have in your example code.

You want the GPU index, not the GPU Instance ID.
Get it

type ProcessInfo struct {
	Pid               uint32
	UsedGpuMemory     uint64
	GpuInstanceId     uint32
	ComputeInstanceId uint32
}

I watch source code, i think it is GpuInstanceId.

Again, that is a MIG only construct.
It helps you dig into which MIG device the process is running on when you have MIG enabled.

However, MIG is only available on the A100 GPUs , and is only available through NVML in the R450 driver (or newer).
You are running on a Tesla M40, with driver version 418.87.01.

ok,thanks.