D34DC3N73R / netdata-glibc

netdata with glibc package for use with nvidia-docker2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can't run nvidia-smi in container

mathieu-b opened this issue Β· comments

Hello

first of all, thanks for figuring out a way to have NVIDIA GPU benchmarking working by just extending the base netdata image πŸ™

I followed the instructions as reported on the DockerHub page.
I can start the container , and then access the webserver running at :19999.
However, I can't see any section hinting at a GPU / nvidia-smi benchmarking.

Not seeing any stats, I thought that maybe there was some issue with the execution of nvidia-smi (if they use it internally in netdata).

I tried executing nvidia-smi in the container:

docker exec netdata  nvidia-smi

but received this error:

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

The only way that I found for having nvidia-smi successfully executing via docker exec was the following:

docker exec netdata bash -c 'LD_PRELOAD=$(find /usr/lib64/ -name "libnvidia-ml.so.*")  nvidia-smi'

based on this StackOverflow answer

Any clues about how this issue could be solved?

Maybe I'll try to give a peek at netdata's sources to see if I can "patch" the system (supposing that the solution is indeed using LD_PRELOAD).

Best regards.

Best regards.

Have you installed nvidia drivers on the host system? If so, how did you accomplish that? (There are a couple of ways, but I'd recommend adding the graphics-drivers ppa). Can you execute nvidia-smi on the host system? Have you installed the nvidia-container-toolkit? Are you using docker run or docker-compose?

Hi

Here goes some info:

Docker engine version:

$ docker --version
Docker version 18.06.2-ce, build 6d37f41

nvidia-smi on host machine:

$ nvidia-smi
Tue Nov 12 13:10:42 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
| 44%   64C    P2   115W / 250W |   3439MiB / 10989MiB |     19%      Default |
+-------------------------------+----------------------+----------------------+

Docker runtime:

$ docker info | grep "Runtime"
Runtimes: nvidia runc
Default Runtime: nvidia

nvidia-smi in container:

$ docker container run nvidia/cuda:10.1-devel-ubuntu16.04 nvidia-smi
Tue Nov 12 12:14:08 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
| 44%   64C    P2   113W / 250W |   3439MiB / 10989MiB |     22%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

The system was installed and configured by another person, however what I know is:

I see that in the main page of the GitHub repository, NVIDIA seems to have updated their "main" instructions for a more recent version of the Docker Engine, and it looks like they deprecated these "old" instructions:

Maybe a newer version / updated installation might fix the issue...

Regards

It does seem similar to this issue raised on the nvidia-docker package: NVIDIA/nvidia-docker#854

I'd recommend updating docker, nvidia drivers, and nvidia-docker/nvidia-docker-toolkit. If you're using docker run, a separate runtime is not required since docker v19.03. See the Docker 19.03 + nvidia-container-toolkit example.

I see, thanks for the heads-up.
I'm not sure how soon I'll be able to test the newer version and instructions.
If that happens, I'll try to report back in this thread.

Regards

going to close this issue but feel free to open up another if you have troubles after updating.