utkuozdemir / nvidia_gpu_exporter

Nvidia GPU exporter for prometheus using nvidia-smi binary

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

command failed. stderr: err: exit status 12

lex-em opened this issue · comments

commented

Describe the bug
Error command failed. stderr: err: exit status 12 when running in docker.

To Reproduce
docker-compose.yml

version: "3"
services:
  nvidia_smi_exporter:
    image: utkuozdemir/nvidia_gpu_exporter:0.4.0
    devices:
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia0:/dev/nvidia0
    volumes:
      - /usr/lib/libnvidia-ml.so:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so
      - /usr/lib/libnvidia-ml.so.1:/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
      - /usr/bin/nvidia-smi:/usr/bin/nvidia-smi
    ports:
      - 9835:9835

Console output
docker-compose service console output

ts=2022-03-05T12:14:57.407Z caller=exporter.go:108 level=warn msg="Failed to auto-determine query field names, falling back to the built-in list"
2022-03-05T12:14:57.408274606Z ts=2022-03-05T12:14:57.408Z caller=main.go:66 level=info msg="Listening on address" address=:9835
2022-03-05T12:14:57.408511827Z ts=2022-03-05T12:14:57.408Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
2022-03-05T12:15:01.058200295Z ts=2022-03-05T12:15:01.058Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"
2022-03-05T12:19:49.798104217Z ts=2022-03-05T12:19:49.797Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"
2022-03-05T12:20:14.187971066Z ts=2022-03-05T12:20:14.187Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"
2022-03-05T12:20:44.187095757Z ts=2022-03-05T12:20:44.186Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"
2022-03-05T12:21:14.187231908Z ts=2022-03-05T12:21:14.187Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"
2022-03-05T12:21:44.187147375Z ts=2022-03-05T12:21:44.187Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"
2022-03-05T12:22:14.186874585Z ts=2022-03-05T12:22:14.186Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"
2022-03-05T12:22:44.186995854Z ts=2022-03-05T12:22:44.186Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"
2022-03-05T12:23:14.188342901Z ts=2022-03-05T12:23:14.187Z caller=exporter.go:157 level=error error="command failed. stderr:  err: exit status 12"

Model and Version
OS: Fedora Linux 35
Qt version: 5.15.2
Kernel version: 5.16.11-200.fc35.x86_64
CPU: i7-11800H
GPU: NVIDIA GeForce RTX 3050 Ti Laptop GPU/PCIe/SSE2
NVIDIA Driver Version: 510.47.03
NVML Version: 11.510.47.03

$ ll /dev | grep nvidia
crw-rw-rw-.  1 root root    195,     0 Mar  5 16:14 nvidia0
crw-rw-rw-.  1 root root    195,   255 Mar  5 16:14 nvidiactl
crw-rw-rw-.  1 root root    195,   254 Mar  5 16:14 nvidia-modeset
crw-rw-rw-.  1 root root    505,     0 Mar  5 16:14 nvidia-uvm
crw-rw-rw-.  1 root root    505,     1 Mar  5 16:14 nvidia-uvm-tools
$ ll /usr/lib | grep nvidia
lrwxrwxrwx.  1 root root        26 Feb  1 20:33 libEGL_nvidia.so.0 -> libEGL_nvidia.so.510.47.03
-rwxr-xr-x.  1 root root   1224012 Jan 25 03:35 libEGL_nvidia.so.510.47.03
lrwxrwxrwx.  1 root root        32 Feb  1 20:33 libGLESv1_CM_nvidia.so.1 -> libGLESv1_CM_nvidia.so.510.47.03
-rwxr-xr-x.  1 root root     71120 Jan 25 03:34 libGLESv1_CM_nvidia.so.510.47.03
lrwxrwxrwx.  1 root root        29 Feb  1 20:33 libGLESv2_nvidia.so.2 -> libGLESv2_nvidia.so.510.47.03
-rwxr-xr-x.  1 root root    128464 Jan 25 03:34 libGLESv2_nvidia.so.510.47.03
lrwxrwxrwx.  1 root root        26 Feb  1 20:33 libGLX_nvidia.so.0 -> libGLX_nvidia.so.510.47.03
-rwxr-xr-x.  1 root root   1082980 Jan 25 03:34 libGLX_nvidia.so.510.47.03
lrwxrwxrwx.  1 root root        32 Feb  1 20:33 libnvidia-allocator.so.1 -> libnvidia-allocator.so.510.47.03
-rwxr-xr-x.  1 root root    121408 Jan 25 03:34 libnvidia-allocator.so.510.47.03
-rwxr-xr-x.  1 root root  59574832 Jan 25 03:56 libnvidia-compiler.so.510.47.03
-rwxr-xr-x.  1 root root  28190356 Jan 25 03:48 libnvidia-eglcore.so.510.47.03
lrwxrwxrwx.  1 root root        29 Feb  1 20:33 libnvidia-encode.so.1 -> libnvidia-encode.so.510.47.03
-rwxr-xr-x.  1 root root    124048 Jan 25 03:34 libnvidia-encode.so.510.47.03
lrwxrwxrwx.  1 root root        26 Feb  1 20:33 libnvidia-fbc.so.1 -> libnvidia-fbc.so.510.47.03
-rwxr-xr-x.  1 root root    136828 Jan 25 03:34 libnvidia-fbc.so.510.47.03
-rwxr-xr-x.  1 root root  30472084 Jan 25 03:49 libnvidia-glcore.so.510.47.03
-rwxr-xr-x.  1 root root    613928 Jan 25 03:35 libnvidia-glsi.so.510.47.03
-rwxr-xr-x.  1 root root  18955008 Jan 25 03:53 libnvidia-glvkspirv.so.510.47.03
lrwxrwxrwx.  1 root root        25 Feb  1 20:33 libnvidia-ml.so -> libnvidia-ml.so.510.47.03
lrwxrwxrwx.  1 root root        25 Feb  1 20:33 libnvidia-ml.so.1 -> libnvidia-ml.so.510.47.03
-rwxr-xr-x.  1 root root   1702708 Jan 25 03:36 libnvidia-ml.so.510.47.03
lrwxrwxrwx.  1 root root        29 Feb  1 20:33 libnvidia-opencl.so.1 -> libnvidia-opencl.so.510.47.03
-rwxr-xr-x.  1 root root  17126348 Jan 25 03:56 libnvidia-opencl.so.510.47.03
lrwxrwxrwx.  1 root root        34 Feb  1 20:33 libnvidia-opticalflow.so.1 -> libnvidia-opticalflow.so.510.47.03
-rwxr-xr-x.  1 root root     46224 Jan 25 03:33 libnvidia-opticalflow.so.510.47.03
lrwxrwxrwx.  1 root root        37 Feb  1 20:33 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.510.47.03
-rwxr-xr-x.  1 root root  12802792 Jan 25 03:40 libnvidia-ptxjitcompiler.so.510.47.03
-rwxr-xr-x.  1 root root     13560 Jan 25 03:33 libnvidia-tls.so.510.47.03
drwxr-xr-x.  2 root root      4096 Feb  6 18:06 nvidia
$ ll /usr/bin | grep nvidia
-rwxr-xr-x.  1 root root       36981 Jan 25 04:57 nvidia-bug-report.sh
-rwxr-xr-x.  1 root root       47528 Feb 14 17:03 nvidia-container-cli
-rwxr-xr-x.  1 root root     2260408 Feb 14 17:04 nvidia-container-runtime
lrwxrwxrwx.  1 root root          33 Feb 17 08:09 nvidia-container-runtime-hook -> /usr/bin/nvidia-container-toolkit
-rwxr-xr-x.  1 root root     2156344 Feb 14 17:04 nvidia-container-toolkit
-rwxr-xr-x.  1 root root       49920 Jan 25 04:09 nvidia-cuda-mps-control
-rwxr-xr-x.  1 root root       14488 Jan 25 04:09 nvidia-cuda-mps-server
-rwxr-xr-x.  1 root root      260912 Jan 25 03:48 nvidia-debugdump
-rwxr-xr-x.  1 root root         721 Feb 14 17:05 nvidia-docker
-rwxr-xr-x.  1 root root     3896400 Jan 25 03:49 nvidia-ngx-updater
-rwxr-xr-x.  1 root root       45272 Feb  2 01:13 nvidia-persistenced
-rwxr-xr-x.  1 root root      978560 Jan 25 03:49 nvidia-powerd
-rwxr-xr-x.  1 root root      323128 Feb  2 01:29 nvidia-settings
-rwxr-xr-x.  1 root root         904 Jan 25 03:45 nvidia-sleep.sh
-rwxr-xr-x.  1 root root      690808 Jan 25 03:49 nvidia-smi

Hi,

I have never tried this on Fedora Linux, tried only on Ubuntu. Can you please run the following command on the host system (not inside a container) and see if it works:

nvidia-smi --query-gpu="timestamp,driver_version" --format=csv

If it's not working, it's an issue with the driver installation on the host.

If it is working on the host, then you need to experiment a bit - find the correct locations of libnvidia-ml.so, libnvidia-ml.so.1, nvidia-smi in your Fedora system and so on and mount them into the container. Then you need to run the same command manually but inside the container this time with different mount configurations etc. until you get it working.

If you find a working config for Fedora+Docker, please share it here so I can add it to the documentation.

Also, looking at your error code (12), this might be helpful: influxdata/telegraf#4388

Please see the solutions suggested on this ticket - they might help with your problem.

commented

I foud a mistake, wrong source libraries, right ones:

    volumes:
      - /usr/lib64/libnvidia-ml.so:/usr/lib64/libnvidia-ml.so:ro
      - /usr/lib64/libnvidia-ml.so.1:/usr/lib64/libnvidia-ml.so.1:ro