Can't run nvidia-smi in container

Question

Can't run nvidia-smi in container

mathieu-b opened this issue 5 years ago · comments

Hello

first of all, thanks for figuring out a way to have NVIDIA GPU benchmarking working by just extending the base netdata image 🙏

I followed the instructions as reported on the DockerHub page.
I can start the container , and then access the webserver running at :19999.
However, I can't see any section hinting at a GPU / nvidia-smi benchmarking.

Not seeing any stats, I thought that maybe there was some issue with the execution of nvidia-smi (if they use it internally in netdata).

I tried executing nvidia-smi in the container:

docker exec netdata  nvidia-smi

but received this error:

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

The only way that I found for having nvidia-smi successfully executing via docker exec was the following:

docker exec netdata bash -c 'LD_PRELOAD=$(find /usr/lib64/ -name "libnvidia-ml.so.*")  nvidia-smi'

based on this StackOverflow answer

Any clues about how this issue could be solved?

Maybe I'll try to give a peek at netdata's sources to see if I can "patch" the system (supposing that the solution is indeed using LD_PRELOAD).

Best regards.

D34DC3N73R · Answer 1 · Mon Nov 11 2019 02:44:41 GMT+0800 (China Standard Time)

Have you installed nvidia drivers on the host system? If so, how did you accomplish that? (There are a couple of ways, but I'd recommend adding the graphics-drivers ppa). Can you execute nvidia-smi on the host system? Have you installed the nvidia-container-toolkit? Are you using docker run or docker-compose?

Mathieu Bosi · Answer 2 · Tue Nov 12 2019 20:20:32 GMT+0800 (China Standard Time)

Hi

Here goes some info:

Docker engine version:

$ docker --version
Docker version 18.06.2-ce, build 6d37f41

nvidia-smi on host machine:

$ nvidia-smi
Tue Nov 12 13:10:42 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
| 44%   64C    P2   115W / 250W |   3439MiB / 10989MiB |     19%      Default |
+-------------------------------+----------------------+----------------------+

Docker runtime:

$ docker info | grep "Runtime"
Runtimes: nvidia runc
Default Runtime: nvidia

nvidia-smi in container:

$ docker container run nvidia/cuda:10.1-devel-ubuntu16.04 nvidia-smi
Tue Nov 12 12:14:08 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
| 44%   64C    P2   113W / 250W |   3439MiB / 10989MiB |     22%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

The system was installed and configured by another person, however what I know is:

these instructions from nvidia were followed: https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)
for nvidia-docker2 to work, the above exact version of the Docker engine had to be used.

I see that in the main page of the GitHub repository, NVIDIA seems to have updated their "main" instructions for a more recent version of the Docker Engine, and it looks like they deprecated these "old" instructions:

https://github.com/NVIDIA/nvidia-docker/tree/master#upgrading-with-nvidia-docker2-deprecated

Maybe a newer version / updated installation might fix the issue...

Regards

D34DC3N73R · Answer 3 · Wed Nov 13 2019 03:37:20 GMT+0800 (China Standard Time)

It does seem similar to this issue raised on the nvidia-docker package: NVIDIA/nvidia-docker#854

I'd recommend updating docker, nvidia drivers, and nvidia-docker/nvidia-docker-toolkit. If you're using docker run, a separate runtime is not required since docker v19.03. See the Docker 19.03 + nvidia-container-toolkit example.

Mathieu Bosi · Answer 4 · Wed Nov 13 2019 17:29:34 GMT+0800 (China Standard Time)

I see, thanks for the heads-up.
I'm not sure how soon I'll be able to test the newer version and instructions.
If that happens, I'll try to report back in this thread.

Regards

D34DC3N73R · Answer 5 · Fri Jan 03 2020 11:20:30 GMT+0800 (China Standard Time)

going to close this issue but feel free to open up another if you have troubles after updating.