/run/nvidia-persistenced/socket: no such device or address

Question

/run/nvidia-persistenced/socket: no such device or address

matyro opened this issue 2 years ago · comments

Hi,
we are currently setting up a new cluster deployment environment with slurm, pyxis and enroot.
Our machines have DGX OS installed.

Container images like centos
srun --container-image=centos grep PRETTY /etc/os-release

finish without a problem. GPU Based images like
srun --container-image=nvcr.io/nvidia/tensorflow:22.08-tf2-py3 /bin/bash

experiencing a problem during startup

nvidia-container-cli: mount error: file creation failed: /raid/enroot-data/user-9011/pyxis_59.0/run/nvidia-persistenced/socket: no such device or address
[ERROR] /raid/enroot//hooks.d/98-nvidia.sh exited with return code 1

I think it is some misconfiguration, but at the moment I am not able to spot it.

Evan Lezar · Answer 1 · Fri Sep 23 2022 17:42:49 GMT+0800 (China Standard Time)

Hi @matyro. Could you file an issue at https://github.com/NVIDIA/enroot instead as they may have a better idea as to what is happening here. If there is an issue with the NVIDIA Container CLI then please update this issue with the relevant context.

Dominik Baack · Answer 2 · Wed Oct 05 2022 19:58:23 GMT+0800 (China Standard Time)

It seems unsure where the exact origin of the problem is.
At the moment we are using a DGX A100 node with no major changes to DGX OS for testing.

Evan Lezar · Answer 3 · Wed Oct 05 2022 21:43:40 GMT+0800 (China Standard Time)

Thanks @matyro. As a matter of interest, which version of libnvidia-container-tools is being used?

Dominik Baack · Answer 4 · Thu Oct 06 2022 01:49:20 GMT+0800 (China Standard Time)

Hi, not the newest release:

nvidia-container-cli -V
cli-version: 1.7.0
lib-version: 1.7.0
build date: 2021-11-30T19:53+00:00
build revision: f37bb387ad05f6e501069d99e4135a97289faf1f
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

Dominik Baack · Answer 5 · Thu Oct 06 2022 21:00:40 GMT+0800 (China Standard Time)

Updating to the newest system available did not help:

enroot start -e NVIDIA_VISIBLE_DEVICES=all --root --rw cuda_root
nvidia-container-cli: container error: stat failed: /raid/enroot-data/cuda_root/proc/12474: no such file or directory
[ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1


root@ml2ran03:~# nvidia-container-cli -V
cli-version: 1.10.0
lib-version: 1.10.0
build date: 2022-06-13T10:39+00:00
build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

Enroot base is working, if GPUs are deactivated:
enroot start -e NVIDIA_VISIBLE_DEVICES=void --root --rw cuda_root

So it must be somewhere in the interconnection between booth

Evan Lezar · Answer 6 · Thu Oct 06 2022 21:04:21 GMT+0800 (China Standard Time)

@matyro this doesn't show the same error creating /run though, or are the error messages in fact identical?

Dominik Baack · Answer 7 · Thu Oct 06 2022 21:09:46 GMT+0800 (China Standard Time)

You are right, it seems proc was not mounted correctly, its empty

root@ml2ran03:/# ll /
total 84
drwxrwxr-x 17 root root  4096 Oct  4 19:29 ./
drwxrwxr-x 17 root root  4096 Oct  4 19:29 ../
-rw-r--r--  1 root root     0 Oct  6 13:01 .lock
-rw-r--r--  1 root root 16047 Aug  8 21:08 NGC-DL-CONTAINER-LICENSE
lrwxrwxrwx  1 root root     7 Aug  1 13:22 bin -> usr/bin/
drwxr-xr-x  2 root root  4096 Apr 15  2020 boot/
drwxr-xr-x  2 root root  4096 Oct  5 10:17 dev/
drwxrwxr-x 36 root root  4096 Oct  4 18:57 etc/
drwxr-xr-x  2 root root  4096 Apr 15  2020 home/
lrwxrwxrwx  1 root root     7 Aug  1 13:22 lib -> usr/lib/
lrwxrwxrwx  1 root root     9 Aug  1 13:22 lib32 -> usr/lib32/
lrwxrwxrwx  1 root root     9 Aug  1 13:22 lib64 -> usr/lib64/
lrwxrwxrwx  1 root root    10 Aug  1 13:22 libx32 -> usr/libx32/
drwxr-xr-x  2 root root  4096 Aug  1 13:22 media/
drwxr-xr-x  2 root root  4096 Aug  1 13:22 mnt/
drwxr-xr-x  2 root root  4096 Aug  1 13:22 opt/
drwxr-xr-x  2 root root  4096 Apr 15  2020 proc/
drwx------  2 root root  4096 Oct  5 17:46 root/
drwxr-xr-x  5 root root  4096 Aug  1 13:25 run/
lrwxrwxrwx  1 root root     8 Aug  1 13:22 sbin -> usr/sbin/
drwxr-xr-x  2 root root  4096 Aug  1 13:22 srv/
drwxr-xr-x  3 root root  4096 Oct  5 10:17 sys/
drwxrwxrwt  2 root root  4096 Aug  1 13:25 tmp/
drwxr-xr-x 13 root root  4096 Aug  1 13:22 usr/
drwxr-xr-x 11 root root  4096 Aug  1 13:25 var/

The error depends on the config and/or version installed.
With default config files, and up-to-date version /proc seems to be the first error.

Evan Lezar · Answer 8 · Wed Oct 12 2022 17:10:08 GMT+0800 (China Standard Time)

Hi @matyro with regards to the error related to /run/nvidia-persistenced/socket after seeing similar behaviour in NVIDIA/nvidia-docker#1690 I was able to reproduce this locally (by including the -v /var/run:/var/run flag when starting a docker container.

I have updated the "mount" code in libnvidia-container to check whether a file exists before creating it in
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/187.

Are you in a position to test this patch?

Dominik Baack · Answer 9 · Wed Oct 12 2022 21:46:32 GMT+0800 (China Standard Time)

We are currently having a hardware problem on one of our nodes and are in contact with enterprise support. When the hardware exchange is done next week we will have a spare node available and should be able to run some tests.

Dominik Baack · Answer 10 · Wed Oct 26 2022 19:22:21 GMT+0800 (China Standard Time)

The machine is back, what is the simplest way to test it now?
I already cloned the repository and switched the branch to your MR.

Evan Lezar · Answer 11 · Wed Oct 26 2022 21:08:03 GMT+0800 (China Standard Time)

Running make -f mk/docker.mk ubunut18.04-amd64 from the libnvidia-container repo root should generate debian packages in ./dist/ubuntu18.04/amd64 (replace this with your distro of choice).

You can then install the deb files directly by: sudo dpkg -i dist/ubuntu18.04/amd64/libnvidia-container1_1.12.0~rc.1-1_amd64.deb dist/ubunut18.04/amd64/libnvidia-container-tools_1.12.0~rc.2-1_amd64.deb

This will then be available system-wide. You can confirm the commit used to build the nvidia-container-cli by running:

nvidia-container-cli --version

A (forced) reinstall of the libnvidia-container1 and libnvidia-container-tools packages from our public repos should restore the state.

Dominik Baack · Answer 12 · Thu Oct 27 2022 14:44:33 GMT+0800 (China Standard Time)

Hi,
I deployed the fixed version today on our test machine and enroot is working fine.
GPUs are limited to the selected device and the container starts without an error message.

root@ml2ran01:~# NVIDIA_VISIBLE_DEVICES=1 enroot start --root --rw cuda10.2-U18.04 nvidia-smi
Thu Oct 27 08:55:45 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   28C    P0    50W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@ml2ran01:~#

Only SLURM is still a bit problematic, but this should have nothing to do with libnvidia-container:

root@gwkilab:~# srun --gres=gpu:1 --container-image=nvcr.io/nvidia/cuda:10.2-devel-ubuntu18.04 nvidia-smi
pyxis: importing docker image: nvcr.io/nvidia/cuda:10.2-devel-ubuntu18.04
pyxis: imported docker image: nvcr.io/nvidia/cuda:10.2-devel-ubuntu18.04
Thu Oct 27 08:49:27 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   28C    P0    52W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   28C    P0    51W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   28C    P0    52W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4E:00.0 Off |                    0 |
| N/A   29C    P0    54W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:87:00.0 Off |                    0 |
| N/A   34C    P0    57W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:90:00.0 Off |                    0 |
| N/A   31C    P0    55W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:B7:00.0 Off |                    0 |
| N/A   31C    P0    51W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   32C    P0    52W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Evan Lezar · Answer 13 · Thu Oct 27 2022 16:58:41 GMT+0800 (China Standard Time)

Thanks for the confirimation @matyro. That's good news,

I'm not sure how srun interacts with the NVIDIA container stack to provide the isolation. Do you have a link to their documentation on this?

Dominik Baack · Answer 14 · Thu Oct 27 2022 17:03:00 GMT+0800 (China Standard Time)

https://slurm.schedmd.com/gres.html#GPU_Management

In the slurm Job CUDA_VISIBLE_DEVICES is set correctly. But even directly on the node
CUDA_VISIBLE_DEVICES=0 nvidia-smi

shows all GPUs

Evan Lezar · Answer 15 · Thu Oct 27 2022 18:10:34 GMT+0800 (China Standard Time)

Since nvidia-smi uses the low-level NVIDIA Management Library under the hood, I don't think it checks CUDA_VISIBLE_DEVICES to filter the list. This seems like an issue with how enroot is being triggered (through pyxis) when invoking srun.

For clarification, the NVIDIA container stack uses the NVIDIA_VISIBLE_DEVICES environment variable in the container to determine which devices to inject. This is processed specifically by the NVIDIA Container Runtime Hook and is (as far as I am aware) not applicable to enroot directly.

Dominik Baack · Answer 16 · Fri Oct 28 2022 19:41:15 GMT+0800 (China Standard Time)

Everything works now, and I think the problem can be closed from my side.
The only remaining question would be if there is an estimated date when the patched libnvidia-container will be available directly in DGX OS.

Thanks for your help
Dominik

Evan Lezar · Answer 17 · Fri Oct 28 2022 21:47:43 GMT+0800 (China Standard Time)

Thanks @matyro. Our current timeline is to release the final version including this fix at the start of December. It should be picked up for distribution through the DGX repos shortly after that.

Note that we will release an rc.2 with these changes between now and then and this will be installable through our public experimental repositories.

Evan Lezar · Answer 18 · Tue Nov 29 2022 22:41:22 GMT+0800 (China Standard Time)

@matyro sorry for the delayed response from my end. We have released 1.12.0-rc.2 including this fix. Once the final version is released (still at the start of december), that should be picked up by the DGX repositories.

Evan Lezar · Answer 19 · Mon Feb 06 2023 21:16:38 GMT+0800 (China Standard Time)

@matyro v1.12.0 has now been released. Sorry for the delay on this.

Please reopen the issue if you're still having problems.