NVIDIA / libcudacxx

[ARCHIVED] The C++ Standard Library for your entire system. See https://github.com/NVIDIA/cccl

Home Page:https://nvidia.github.io/libcudacxx

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Nvidia nvmlInit() blocks simultaneous calls.

chenja2000 opened this issue · comments

This issue appears when there are multiple GPU applications running and they call nvmlInit() simultaneously from Nvidia library.

The symptom is that GPU applications is hanging at calling nvmlInit() for a while.

How to reproduce?
We can see the delay by running few hundreds "time nvidia-smi &" simultaneously on one gpu node.
time nvidia-smi & time nvidia-smi & time nvidia-smi & ...

Example test with 200 simultaneous runs
Result: The first one takes 1.646s, but the last takes over 12 seconds.

[root@supp20 ~]# Mon Nov 1 09:40:21 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
Mon Nov 1 09:40:21 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:06:00.0 Off | 0 |
| N/A 29C P0 30W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 0 Tesla P100-PCIE... Off | 00000000:06:00.0 Off | 0 |
| N/A 29C P0 30W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000000:81:00.0 Off | 0 |
| N/A 32C P0 26W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000000:81:00.0 Off | 0 |
| N/A 32C P0 26W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
| No running processes found |
+-----------------------------------------------------------------------------+

real 0m1.646s
user 0m0.003s
sys 0m0.986s

//The last "time nvidia-smi" takes 12 seconds
real 0m12.057s
user 0m0.006s
sys 0m0.568s

This issue stops spectrum LSF from using Nvidia GPUs properly, becomes very urgent for our customers now.
Any advices and solutions from Nvidia are appreciated.

This is not an issue with libcudacxx. Please forward issues to https://forums.developer.nvidia.com/

@wmaxey Thanks! I created one topic there: https://forums.developer.nvidia.com/t/nvidia-nvmlinit-blocks-simultaneous-calls/193837
Not sure if Nvidia developer will help.