D34DC3N73R / netdata-glibc

netdata with glibc package for use with nvidia-docker2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

libnvidia-ml.so

oamster opened this issue · comments

Having trouble getting netdata to work with nvidia. I am able to run nvidia-smi on the host machine (openmediavault), as well as another docker container (plex media server). I was getting the same error in the plex container as netdata, editing config.toml to use ldconfig = "/sbin/ldconfig.real" fixed the issue with plex, and doesn't help netdata.

Here's my kernal version and docker version:
Linux 5.10.0-0.bpo.9-amd64 #1 SMP Debian 5.10.70-1~bpo10+1 (2021-10-10) x86_64 GNU/Linux

Client: Docker Engine - Community
Version: 20.10.12
API version: 1.41
Go version: go1.16.12
Git commit: e91ed57
Built: Mon Dec 13 11:45:37 2021
OS/Arch: linux/amd64
Context: default
Experimental: true

Server: Docker Engine - Community
Engine:
Version: 20.10.12
API version: 1.41 (minimum version 1.12)
Go version: go1.16.12
Git commit: 459d0df
Built: Mon Dec 13 11:43:46 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.12
GitCommit: 7b11cfaabd73bb80907dd23182b9347b4245eb5d
nvidia:
Version: 1.0.2
GitCommit: v1.0.2-0-g52b36a2
docker-init:
Version: 0.19.0
GitCommit: de40ad0

I'm getting this error when running nvidia-smi in the container:

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

As well as error's like this in the error log:

2022-01-13 21:05:35: go.d ERROR: prometheus[nvidia_gpu_exporter_local] Get "http://127.0.0.1:9445/metrics": dial tcp 127.0.0.1:9445: connect: connection refused

2022-01-13 21:05:35: go.d ERROR: prometheus[nvidia_gpu_exporter_local] check failed

2022-01-13 21:05:35: go.d ERROR: prometheus[nvidia_smi_exporter_local] Get "http://127.0.0.1:9454/metrics": dial tcp 127.0.0.1:9454: connect: connection refused

2022-01-13 21:05:35: go.d ERROR: prometheus[nvidia_smi_exporter_local] check failed

2022-01-13 21:05:35: python.d INFO: plugin[main] : [nvidia_smi] built 1 job(s) configs

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/usr/bin/nvidia-smi' (disk '_usr_bin_nvidia-smi', filesystem 'ext4', root '/usr/lib/nvidia/current/nvidia-smi') is not a directory. (errno 22, Invalid argument)

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/usr/bin/nvidia-debugdump' (disk '_usr_bin_nvidia-debugdump', filesystem 'ext4', root '/usr/lib/nvidia/current/nvidia-debugdump') is not a directory.

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/usr/lib64/libnvidia-ml.so.460.73.01' (disk '_usr_lib64_libnvidia-ml.so.460.73.01', filesystem 'ext4', root '/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.460.73.01') is not a directory. (errno 22, Invalid argument)

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/usr/lib64/libcuda.so.460.73.01' (disk '_usr_lib64_libcuda.so.460.73.01', filesystem 'ext4', root '/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.460.73.01') is not a directory.

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/usr/lib64/libnvidia-ptxjitcompiler.so.460.73.01' (disk '_usr_lib64_libnvidia-ptxjitcompiler.so.460.73.01', filesystem 'ext4', root '/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.460.73.01') is not a directory. (errno 22, Invalid argument)

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/dev/nvidiactl' (disk '_dev_nvidiactl', filesystem 'devtmpfs', root '/nvidiactl') is not a directory.

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/dev/nvidia-uvm' (disk '_dev_nvidia-uvm', filesystem 'devtmpfs', root '/nvidia-uvm') is not a directory. (errno 22, Invalid argument)

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/dev/nvidia-uvm-tools' (disk '_dev_nvidia-uvm-tools', filesystem 'devtmpfs', root '/nvidia-uvm-tools') is not a directory.

2022-01-13 21:05:36: netdata ERROR : PLUGIN[diskspace] : DISKSPACE: Mount point '/dev/nvidia0' (disk '_dev_nvidia0', filesystem 'devtmpfs', root '/nvidia0') is not a directory. (errno 22, Invalid argument)

2022-01-13 21:06:06: python.d ERROR: nvidia_smi[nvidia_smi] : xml parse failed: "b"NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.\nPlease also try adding directory that contains libnvidia-ml.so to your system PATH.\n"", error: syntax error: line 1, column 0

2022-01-13 21:06:06: python.d INFO: plugin[main] : nvidia_smi[nvidia_smi] : check failed

I haven't tested or run openmediavault before, but this sounds kind of similar to issue #3
Does it work if you run
docker exec netdata bash -c 'LDCONFIG=$(find /usr/lib64/ -name libnvidia-ml.so.*) nvidia-smi'

Here's the output,

~# docker exec netdata bash -c 'LDCONFIG=$(find /usr/lib64/ -name libnvidia-ml.so.*) nvidia-smi' NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.

My libnvidia on the host machine is in:
/usr/lib/x86_64-linux-gnu/

Not sure if that's the reason it's not working. But my other containers are working fine with it. Right now I've resorted to grafana.

/usr/lib/x86_64-linux-gnu/ is also where libnvidia is on my host system as well (ubuntu 20.04). But in the container, it should be in /usr/lib64/. What steps did you take to install nvidia container toolkit as well as the nvidia drivers?

Edit: I also found this in regards to OMV + Nvidia
https://forum.openmediavault.org/index.php?thread/40883-nvidia-working-with-omv-6/

Also see this if you're running OMV 5
https://forum.openmediavault.org/index.php?thread/39413-nvidia-smi-couldn-t-find-libnvidia-ml-so-library-in-your-system-please-make-sure/

I had actually used this guide to set everything up, the drivers as well as installing the nvidia tool kit.
https://forum.openmediavault.org/index.php?thread/38013-howto-nvidia-hardware-transcoding-on-omv-5-in-a-plex-docker-container/

I removed and reinstalled drivers, but did not remove /usr/lib/x86_64-linux-gnu/ and anything in that directory manually. Maybe I should give that a try.
Just strange that everything else works with the GPU, just not the official netdata image, or yours.

Edit: Maybe it's an issues with /etc/nvidia-container-runtime/config.toml. As mine is:
#ldconfig = "@/sbin/ldconfig"
#ldconfig = "/sbin/ldconfig"
ldconfig = "/sbin/ldconfig.real"

Edit: But plex and other containers error when setting it ldconfig to anything other than ldconfig.real.

config.toml is the default

$ cat /etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"

Did reinstalling help at all?

Tried reinstalling, didn't help. Changed my config.toml to ldconfig = "@/sbin/ldconfig and getting this error when deploying the container:

OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: open failed: /sbin/ldconfig.real: no such file or directory: unknown

No error when using ldconfig = "/sbin/ldconfig.real" but still get the python.d error.

I resorted to using prometheus, nvidia smi exporter and grafana which works. But still cannot get it to work with netdata.

Any update/progress on this? I'm having the same exact issue on OMV 6.

I got it to work. I followed the following guide on OMV 6 to install nvidia-drivers and nvidia-docker2.

https://forum.openmediavault.org/index.php?thread/31206-how-to-setup-nvidia-in-plex-docker-for-hardware-transcoding/

It indicates that ldconfig should be set to /sbin/ldconfig.real in /etc/nvidia-container-runtime/config.toml. Leaving this set to @/sbin/ldconfig (the default after I installed) works for both the Plex container and netdata.

Note that I also downgraded nvidia packages as per this post. Using up to date nvidia packages causes the plex container to not work with the configuration noted above. The netdata-glibc container does work.

https://forums.developer.nvidia.com/t/issue-with-setting-up-triton-on-jetson-nano/248485/2

@cryptoDevTrader you may also want to give the dev image & instructions a try. We'll be moving to that with the next netdata release.
image: d34dc3n73r/netdata-glibc:dev
instructions: https://github.com/D34DC3N73R/netdata-glibc/tree/dev

When the official release happens you'll have to change the image to :stable or :latest depending on your preference.

@cryptoDevTrader you may also want to give the dev image & instructions a try. We'll be moving to that with the next netdata release. image: d34dc3n73r/netdata-glibc:dev instructions: https://github.com/D34DC3N73R/netdata-glibc/tree/dev

When the official release happens you'll have to change the image to :stable or :latest depending on your preference.

This was hugely helpful!

I am running both netdata-glibc and plex via docker-compose. netdata-glibc was already working properly with the previous config using the NVIDIA_VISIBLE_DEVICES env and nvidia runtime. Plex, however, was not working with the same configuration and the latest version of nvidia packages (older versions worked fine). Upgrading the nvidia packages to the latest versions and using the deploy method described in the dev branch worked for both deployments.

closing this, but feel free to reopen if it can be reproduced with the newest updates.