Fail to start on second run. libs being set to 0 size
dbkinghorn opened this issue · comments
Now I've got a show stopper. I basically cannot use enroot with current driver and libnvidia-container*
Ubuntu Server 22.04
Driver Version: 535.113.01
nvidia-container-cli --version
cli-version: 1.14.2
lib-version: 1.14.2
enroot version 3.4.1
Example"
enroot import docker://nvcr.io#nvidia/cuda:12.2.0-runtime-ubuntu22.04
enroot create --name cuda12.2 nvidia+cuda+12.2.0-runtime-ubuntu22.04.sqsh
enroot start cuda12.2 # runs correctly
In the container on first run:
/lib/x86_64-linux-gnu$ ls -l | grep nvidia
lrwxrwxrwx 1 kinghorn kinghorn 33 Oct 5 11:27 libnvidia-allocator.so.1 -> libnvidia-allocator.so.535.113.01
-rw-r--r-- 1 nobody nogroup 160552 Sep 25 02:45 libnvidia-allocator.so.535.113.01
lrwxrwxrwx 1 kinghorn kinghorn 27 Oct 5 11:27 libnvidia-cfg.so.1 -> libnvidia-cfg.so.535.113.01
-rw-r--r-- 1 nobody nogroup 270840 Sep 25 02:45 libnvidia-cfg.so.535.113.01
lrwxrwxrwx 1 kinghorn kinghorn 26 Oct 5 11:27 libnvidia-ml.so.1 -> libnvidia-ml.so.535.113.01
-rw-r--r-- 1 nobody nogroup 1819968 Sep 25 02:45 libnvidia-ml.so.535.113.01
lrwxrwxrwx 1 kinghorn kinghorn 28 Oct 5 11:27 libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.535.113.01
-rw-r--r-- 1 nobody nogroup 86140736 Sep 25 02:45 libnvidia-nvvm.so.535.113.01
lrwxrwxrwx 1 kinghorn kinghorn 30 Oct 5 11:27 libnvidia-opencl.so.1 -> libnvidia-opencl.so.535.113.01
-rw-r--r-- 1 nobody nogroup 24224408 Sep 25 02:45 libnvidia-opencl.so.535.113.01
-rw-r--r-- 1 nobody nogroup 10176 Sep 25 02:45 libnvidia-pkcs11-openssl3.so.535.113.01
lrwxrwxrwx 1 kinghorn kinghorn 38 Oct 5 11:27 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.535.113.01
-rw-r--r-- 1 nobody nogroup 23348992 Sep 25 02:45 libnvidia-ptxjitcompiler.so.535.113.01
On the host system the libs are already clobbered:
~/.local/share/enroot/cuda12.2/lib/x86_64-linux-gnu$ ls -l | grep nvidia
lrwxrwxrwx 1 kinghorn kinghorn 33 Oct 5 11:10 libnvidia-allocator.so.1 -> libnvidia-allocator.so.535.113.01
-rw-r--r-- 1 kinghorn kinghorn 0 Oct 5 11:10 libnvidia-allocator.so.535.113.01
lrwxrwxrwx 1 kinghorn kinghorn 27 Oct 5 11:10 libnvidia-cfg.so.1 -> libnvidia-cfg.so.535.113.01
-rw-r--r-- 1 kinghorn kinghorn 0 Oct 5 11:10 libnvidia-cfg.so.535.113.01
lrwxrwxrwx 1 kinghorn kinghorn 26 Oct 5 11:10 libnvidia-ml.so.1 -> libnvidia-ml.so.535.113.01
-rw-r--r-- 1 kinghorn kinghorn 0 Oct 5 11:10 libnvidia-ml.so.535.113.01
lrwxrwxrwx 1 kinghorn kinghorn 28 Oct 5 11:10 libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.535.113.01
-rw-r--r-- 1 kinghorn kinghorn 0 Oct 5 11:10 libnvidia-nvvm.so.535.113.01
lrwxrwxrwx 1 kinghorn kinghorn 30 Oct 5 11:10 libnvidia-opencl.so.1 -> libnvidia-opencl.so.535.113.01
-rw-r--r-- 1 kinghorn kinghorn 0 Oct 5 11:10 libnvidia-opencl.so.535.113.01
-rw-r--r-- 1 kinghorn kinghorn 0 Oct 5 11:10 libnvidia-pkcs11-openssl3.so.535.113.01
lrwxrwxrwx 1 kinghorn kinghorn 38 Oct 5 11:10 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.535.113.01
-rw-r--r-- 1 kinghorn kinghorn 0 Oct 5 11:10 libnvidia-ptxjitcompiler.so.535.113.01
exit
enroot start cuda12.2 # fails
nvidia-container-cli: initialization error: load library failed: /home/kinghorn/.local/share/enroot/cuda12.2/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file too short
[ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
I don't know that this is an enroot issue.?? Should I be reporting this somewhere else?
Did you end up doing #222 (comment)?
This might be the culprit since libnvidia-container will attempt to load NVML while the driver is still not mounted.
Not sure what you can do until #222 is fixed.
Maybe LD_PRELOAD
instead of LD_LIBRARY_PATH
would do it:
export LD_PRELOAD="${ENROOT_ROOTFS}/lib/x86_64-linux-gnu/libnvidia-container-go.so.1"
Ahhh! When I initially tested #222 I made a mistake! I tried it again and it does take care of my test case.