undefined symbol: nvmlDeviceGetGpuInstanceId
xiyichan opened this issue · comments
CentOS 7 , When I get GpuInstanceId is error.
func main() {
ret := nvml.Init()
if ret != nvml.SUCCESS {
log.Fatalf("Unable to initialize NVML: %v", nvml.ErrorString(ret))
}
defer func() {
ret := nvml.Shutdown()
if ret != nvml.SUCCESS {
log.Fatalf("Unable to shutdown NVML: %v", nvml.ErrorString(ret))
}
}()
count, ret := nvml.DeviceGetCount()
if ret != nvml.SUCCESS {
log.Fatalf("Unable to get device count: %v", nvml.ErrorString(ret))
}
for i := 0; i < count; i++ {
device, ret := nvml.DeviceGetHandleByIndex(i)
if ret != nvml.SUCCESS {
return fmt.Errorf("Unable to get device at index %d: %v", i, nvml.ErrorString(ret))
}
id, ret := device.GetGpuInstanceId()
if ret != nvml.SUCCESS {
return fmt.Errorf("Unable to get id of device at index %d: %v", i, nvml.ErrorString(ret))
}
}
}
error
./gpu: symbol lookup error: ./gpu: undefined symbol: nvmlDeviceGetGpuInstanceId
@xiyichan what version of the CUDA driver / nvml library are you using?
@xiyichan what version of the CUDA driver / nvml library are you using?
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 24GB On | 00000000:00:06.0 Off | 0 |
| N/A 23C P8 16W / 250W | 0MiB / 22945MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M40 24GB On | 00000000:00:07.0 Off | 0 |
| N/A 22C P8 17W / 250W | 0MiB / 22945MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
i try to use other function is can work
Also, nvmlDeviceGetGpuInstanceId
is specifically for MIG devices. Looking at your sample, i don't think you're interested in the GPU instance ID. What information are you trying to extract?
With regards to the missing symbol. It is likely that this was added in a later CUDA version as the is currently based on CUDA 11.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 24GB On | 00000000:00:06.0 Off | 0 |
| N/A 25C P0 57W / 250W | 144MiB / 22945MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M40 24GB On | 00000000:00:07.0 Off | 0 |
| N/A 24C P0 58W / 250W | 174MiB / 22945MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 8330 C python 134MiB |
| 1 6293 C python 163MiB |
+-----------------------------------------------------------------------------+
GpuInstanceId appeared in the process. I want to know which gpu is using.
GPU Instance ID is a MIG specific construct, and is not applicable to full GPUs.
For more information on MIG, please see:
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
As such, I don't think this is actually what you want. Are you trying to figure out the meaning of the 0
and the 1
in the nvidia-smi
output? That is just the index
of the GPU, which you already have in your example code.
You want the GPU index, not the GPU Instance ID.
Get it
type ProcessInfo struct {
Pid uint32
UsedGpuMemory uint64
GpuInstanceId uint32
ComputeInstanceId uint32
}
I watch source code, i think it is GpuInstanceId.
Again, that is a MIG only construct.
It helps you dig into which MIG device the process is running on when you have MIG enabled.
However, MIG is only available on the A100 GPUs , and is only available through NVML in the R450 driver (or newer).
You are running on a Tesla M40, with driver version 418.87.01.
ok,thanks.