undefined symbol: nvmlDeviceGetGpuInstanceId

Question

undefined symbol: nvmlDeviceGetGpuInstanceId

xiyichan opened this issue 3 years ago · comments

CentOS 7 , When I get GpuInstanceId is error.

func main() {
	ret := nvml.Init()
	if ret != nvml.SUCCESS {
		log.Fatalf("Unable to initialize NVML: %v", nvml.ErrorString(ret))
	}
	defer func() {
		ret := nvml.Shutdown()
		if ret != nvml.SUCCESS {
			log.Fatalf("Unable to shutdown NVML: %v", nvml.ErrorString(ret))
		}
	}()

	count, ret := nvml.DeviceGetCount()
	if ret != nvml.SUCCESS {
		log.Fatalf("Unable to get device count: %v", nvml.ErrorString(ret))
	}
	for i := 0; i < count; i++ {
		device, ret := nvml.DeviceGetHandleByIndex(i)
		if ret != nvml.SUCCESS {
			return fmt.Errorf("Unable to get device at index %d: %v", i, nvml.ErrorString(ret))
		}
		id, ret := device.GetGpuInstanceId()
		if ret != nvml.SUCCESS {
			return fmt.Errorf("Unable to get id of device at index %d: %v", i, nvml.ErrorString(ret))
		}
         }
}

error

./gpu: symbol lookup error: ./gpu: undefined symbol: nvmlDeviceGetGpuInstanceId

Evan Lezar · Answer 1 · Tue Mar 16 2021 19:48:26 GMT+0800 (China Standard Time)

@xiyichan what version of the CUDA driver / nvml library are you using?

xiyichan · Answer 2 · Tue Mar 16 2021 19:50:19 GMT+0800 (China Standard Time)

@xiyichan what version of the CUDA driver / nvml library are you using?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40 24GB      On   | 00000000:00:06.0 Off |                    0 |
| N/A   23C    P8    16W / 250W |      0MiB / 22945MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40 24GB      On   | 00000000:00:07.0 Off |                    0 |
| N/A   22C    P8    17W / 250W |      0MiB / 22945MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

i try to use other function is can work

Evan Lezar · Answer 3 · Tue Mar 16 2021 19:52:20 GMT+0800 (China Standard Time)

Also, nvmlDeviceGetGpuInstanceId is specifically for MIG devices. Looking at your sample, i don't think you're interested in the GPU instance ID. What information are you trying to extract?

With regards to the missing symbol. It is likely that this was added in a later CUDA version as the is currently based on CUDA 11.

xiyichan · Answer 4 · Tue Mar 16 2021 19:56:15 GMT+0800 (China Standard Time)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40 24GB      On   | 00000000:00:06.0 Off |                    0 |
| N/A   25C    P0    57W / 250W |    144MiB / 22945MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40 24GB      On   | 00000000:00:07.0 Off |                    0 |
| N/A   24C    P0    58W / 250W |    174MiB / 22945MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      8330      C   python                                       134MiB |
|    1      6293      C   python                                       163MiB |
+-----------------------------------------------------------------------------+

GpuInstanceId appeared in the process. I want to know which gpu is using.

Kevin Klues · Answer 5 · Tue Mar 16 2021 20:17:28 GMT+0800 (China Standard Time)

GPU Instance ID is a MIG specific construct, and is not applicable to full GPUs.
For more information on MIG, please see:
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html

As such, I don't think this is actually what you want. Are you trying to figure out the meaning of the 0 and the 1 in the nvidia-smi output? That is just the index of the GPU, which you already have in your example code.

xiyichan · Answer 6 · Tue Mar 16 2021 20:24:49 GMT+0800 (China Standard Time)

You want the GPU index, not the GPU Instance ID.
Get it

type ProcessInfo struct {
	Pid               uint32
	UsedGpuMemory     uint64
	GpuInstanceId     uint32
	ComputeInstanceId uint32
}

I watch source code, i think it is GpuInstanceId.

Kevin Klues · Answer 7 · Tue Mar 16 2021 20:29:33 GMT+0800 (China Standard Time)

Again, that is a MIG only construct.
It helps you dig into which MIG device the process is running on when you have MIG enabled.

However, MIG is only available on the A100 GPUs , and is only available through NVML in the R450 driver (or newer).
You are running on a Tesla M40, with driver version 418.87.01.

xiyichan · Answer 8 · Tue Mar 16 2021 20:31:06 GMT+0800 (China Standard Time)

ok,thanks.