Can't run nvidia-smi in container
mathieu-b opened this issue Β· comments
Hello
first of all, thanks for figuring out a way to have NVIDIA GPU benchmarking working by just extending the base netdata image π
I followed the instructions as reported on the DockerHub page.
I can start the container , and then access the webserver running at :19999.
However, I can't see any section hinting at a GPU / nvidia-smi benchmarking.
Not seeing any stats, I thought that maybe there was some issue with the execution of nvidia-smi
(if they use it internally in netdata).
I tried executing nvidia-smi
in the container:
docker exec netdata nvidia-smi
but received this error:
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
The only way that I found for having nvidia-smi
successfully executing via docker exec
was the following:
docker exec netdata bash -c 'LD_PRELOAD=$(find /usr/lib64/ -name "libnvidia-ml.so.*") nvidia-smi'
based on this StackOverflow answer
Any clues about how this issue could be solved?
Maybe I'll try to give a peek at netdata's sources to see if I can "patch" the system (supposing that the solution is indeed using LD_PRELOAD
).
Best regards.
Best regards.
Have you installed nvidia drivers on the host system? If so, how did you accomplish that? (There are a couple of ways, but I'd recommend adding the graphics-drivers ppa). Can you execute nvidia-smi on the host system? Have you installed the nvidia-container-toolkit? Are you using docker run or docker-compose?
Hi
Here goes some info:
Docker engine version:
$ docker --version
Docker version 18.06.2-ce, build 6d37f41
nvidia-smi
on host machine:
$ nvidia-smi
Tue Nov 12 13:10:42 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 Off | N/A |
| 44% 64C P2 115W / 250W | 3439MiB / 10989MiB | 19% Default |
+-------------------------------+----------------------+----------------------+
Docker runtime:
$ docker info | grep "Runtime"
Runtimes: nvidia runc
Default Runtime: nvidia
nvidia-smi
in container:
$ docker container run nvidia/cuda:10.1-devel-ubuntu16.04 nvidia-smi
Tue Nov 12 12:14:08 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 Off | N/A |
| 44% 64C P2 113W / 250W | 3439MiB / 10989MiB | 22% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
The system was installed and configured by another person, however what I know is:
- these instructions from nvidia were followed: https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)
- for
nvidia-docker2
to work, the above exact version of the Docker engine had to be used.
I see that in the main page of the GitHub repository, NVIDIA seems to have updated their "main" instructions for a more recent version of the Docker Engine, and it looks like they deprecated these "old" instructions:
Maybe a newer version / updated installation might fix the issue...
Regards
It does seem similar to this issue raised on the nvidia-docker package: NVIDIA/nvidia-docker#854
I'd recommend updating docker, nvidia drivers, and nvidia-docker/nvidia-docker-toolkit. If you're using docker run
, a separate runtime is not required since docker v19.03. See the Docker 19.03 + nvidia-container-toolkit example.
I see, thanks for the heads-up.
I'm not sure how soon I'll be able to test the newer version and instructions.
If that happens, I'll try to report back in this thread.
Regards
going to close this issue but feel free to open up another if you have troubles after updating.