utkuozdemir / nvidia_gpu_exporter

Nvidia GPU exporter for prometheus using nvidia-smi binary

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nvidia_gpu_exporter doesn't work for NVIDIA A10

JoephomChen opened this issue · comments

Describe the bug
The exporter doesn't work on lab with NVIDIA A10. It cannot collect the GPU information normally.

Console output
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: 2023/07/18 07:23:20.045" query_field_name=timestamp raw_value="2023/07/18 07:23:20.045"
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: 535.54.03" query_field_name=driver_version raw_value=535.54.03
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: [n/a]" query_field_name=vgpu_driver_capability.heterogenous_multivGPU raw_value=[N/A]
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: gpu-5e10b7bc-91f1-640a-e927-963f7f82de44" query_field_name=uuid raw_value=GPU-5e10b7bc-91f1-640a-e927-963f7f82de44
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: 00000000:00:0c.0" query_field_name=pci.bus_id raw_value=00000000:00:0C.0
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: [n/a]" query_field_name=vgpu_device_capability.fractional_multiVgpu raw_value=[N/A]
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: [n/a]" query_field_name=vgpu_device_capability.heterogeneous_timeSlice_profile raw_value=[N/A]
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: [n/a]" query_field_name=vgpu_device_capability.heterogeneous_timeSlice_sizes raw_value=[N/A]
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: [n/a]" query_field_name=pcie.link.gen.hostmax raw_value=[N/A]
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: none" query_field_name=addressing_mode raw_value=None
ts=2023-07-18T07:23:20.116Z caller=exporter.go:209 level=debug error="could not parse number from value: [n/a]" query_field_name=driver_model.current raw_value=[N/A]
ts=202

Model and Version

  • GPU Model: NVIDIA A10
  • Operating System: Ubuntu Server 20.04
  • Nvidia GPU driver version: 535.54.03

Additional context
$ dpkg -l | grep nvidia
ii libnvidia-cfg1-525:amd64 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-525 525.125.06-0ubuntu0.20.04.3 all Shared files used by the NVIDIA libraries
ii libnvidia-compute-525:amd64 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA libcompute package
rc libnvidia-compute-535:amd64 535.54.03-0ubuntu0.20.04.4 amd64 NVIDIA libcompute package
ii libnvidia-decode-525:amd64 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-encode-525:amd64 525.125.06-0ubuntu0.20.04.3 amd64 NVENC Video Encoding runtime library
ii libnvidia-extra-525:amd64 525.125.06-0ubuntu0.20.04.3 amd64 Extra libraries for the NVIDIA driver
ii libnvidia-fbc1-525:amd64 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-525:amd64 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii nvidia-compute-utils-525 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA compute utilities
ii nvidia-dkms-525 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA DKMS package
ii nvidia-driver-525 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA driver metapackage
ii nvidia-driver-local-repo-ubuntu2004-515.105.01 1.0-1 amd64 nvidia-driver-local repository configuration files
ii nvidia-kernel-common-525 525.125.06-0ubuntu0.20.04.3 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-525 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA kernel source package
ii nvidia-prime 0.8.16~0.20.04.2 all Tools to enable NVIDIA's Prime
ii nvidia-settings 470.57.01-0ubuntu0.20.04.3 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-525 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA driver support binaries
ii screen-resolution-extra 0.18build1 all Extension for the nvidia-settings control panel
ii xserver-xorg-video-nvidia-525 525.125.06-0ubuntu0.20.04.3 amd64 NVIDIA binary Xorg driver

$ nvidia-smi
Tue Jul 18 08:37:46 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10 Off | 00000000:00:0C.0 Off | 0 |
| 0% 54C P0 63W / 150W | 8594MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 28269 C python 8582MiB |
+---------------------------------------------------------------------------------------+