NVIDIA / libnvidia-container

NVIDIA container runtime library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fail to start on second run. libs being set to 0 size

dbkinghorn opened this issue · comments

Now I've got a show stopper. I basically cannot use enroot with current driver and libnvidia-container*

Ubuntu Server 22.04
Driver Version: 535.113.01
nvidia-container-cli --version
cli-version: 1.14.2
lib-version: 1.14.2
enroot version 3.4.1

Example"
enroot import docker://nvcr.io#nvidia/cuda:12.2.0-runtime-ubuntu22.04
enroot create --name cuda12.2 nvidia+cuda+12.2.0-runtime-ubuntu22.04.sqsh
enroot start cuda12.2 # runs correctly

In the container on first run:

/lib/x86_64-linux-gnu$ ls -l | grep nvidia
lrwxrwxrwx  1 kinghorn kinghorn       33 Oct  5 11:27 libnvidia-allocator.so.1 -> libnvidia-allocator.so.535.113.01
-rw-r--r--  1 nobody   nogroup    160552 Sep 25 02:45 libnvidia-allocator.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn       27 Oct  5 11:27 libnvidia-cfg.so.1 -> libnvidia-cfg.so.535.113.01
-rw-r--r--  1 nobody   nogroup    270840 Sep 25 02:45 libnvidia-cfg.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn       26 Oct  5 11:27 libnvidia-ml.so.1 -> libnvidia-ml.so.535.113.01
-rw-r--r--  1 nobody   nogroup   1819968 Sep 25 02:45 libnvidia-ml.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn       28 Oct  5 11:27 libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.535.113.01
-rw-r--r--  1 nobody   nogroup  86140736 Sep 25 02:45 libnvidia-nvvm.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn       30 Oct  5 11:27 libnvidia-opencl.so.1 -> libnvidia-opencl.so.535.113.01
-rw-r--r--  1 nobody   nogroup  24224408 Sep 25 02:45 libnvidia-opencl.so.535.113.01
-rw-r--r--  1 nobody   nogroup     10176 Sep 25 02:45 libnvidia-pkcs11-openssl3.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn       38 Oct  5 11:27 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.535.113.01
-rw-r--r--  1 nobody   nogroup  23348992 Sep 25 02:45 libnvidia-ptxjitcompiler.so.535.113.01

On the host system the libs are already clobbered:

~/.local/share/enroot/cuda12.2/lib/x86_64-linux-gnu$ ls -l | grep nvidia
lrwxrwxrwx  1 kinghorn kinghorn      33 Oct  5 11:10 libnvidia-allocator.so.1 -> libnvidia-allocator.so.535.113.01
-rw-r--r--  1 kinghorn kinghorn       0 Oct  5 11:10 libnvidia-allocator.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn      27 Oct  5 11:10 libnvidia-cfg.so.1 -> libnvidia-cfg.so.535.113.01
-rw-r--r--  1 kinghorn kinghorn       0 Oct  5 11:10 libnvidia-cfg.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn      26 Oct  5 11:10 libnvidia-ml.so.1 -> libnvidia-ml.so.535.113.01
-rw-r--r--  1 kinghorn kinghorn       0 Oct  5 11:10 libnvidia-ml.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn      28 Oct  5 11:10 libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.535.113.01
-rw-r--r--  1 kinghorn kinghorn       0 Oct  5 11:10 libnvidia-nvvm.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn      30 Oct  5 11:10 libnvidia-opencl.so.1 -> libnvidia-opencl.so.535.113.01
-rw-r--r--  1 kinghorn kinghorn       0 Oct  5 11:10 libnvidia-opencl.so.535.113.01
-rw-r--r--  1 kinghorn kinghorn       0 Oct  5 11:10 libnvidia-pkcs11-openssl3.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn      38 Oct  5 11:10 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.535.113.01
-rw-r--r--  1 kinghorn kinghorn       0 Oct  5 11:10 libnvidia-ptxjitcompiler.so.535.113.01

exit
enroot start cuda12.2 # fails

nvidia-container-cli: initialization error: load library failed: /home/kinghorn/.local/share/enroot/cuda12.2/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file too short
[ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

I don't know that this is an enroot issue.?? Should I be reporting this somewhere else?

Did you end up doing #222 (comment)?
This might be the culprit since libnvidia-container will attempt to load NVML while the driver is still not mounted.
Not sure what you can do until #222 is fixed.
Maybe LD_PRELOAD instead of LD_LIBRARY_PATH would do it:

export LD_PRELOAD="${ENROOT_ROOTFS}/lib/x86_64-linux-gnu/libnvidia-container-go.so.1"

Ahhh! When I initially tested #222 I made a mistake! I tried it again and it does take care of my test case.