NVIDIA 'runtime' images don't have necessary CUDA components

Question

NVIDIA 'runtime' images don't have necessary CUDA components

hkelley opened this issue a year ago · comments

Describe the bug
When using the runtime flavor of NVIDIA images,
https://github.com/f0cker/crackq/blob/675a5b62191cd999b3f3a5304138ef021800e156/docker/nvidia/ubuntu/Dockerfile#L1C2-L1C2

hashcat does not recognize NVIDIA T4 GPUs, even though nvidia-smi does.

To Reproduce
Steps to reproduce the behavior:

Build containers.
Open shell in crackq
Run nvidia-smi

sudo docker exec -it crackq /bin/bash

crackq@crackq:/opt/crackq/build$ nvidia-smi
Tue Aug  8 13:12:18 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   66C    P0    30W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:00:05.0 Off |                    0 |
| N/A   67C    P0    28W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            Off  | 00000000:00:06.0 Off |                    0 |
| N/A   67C    P0    30W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            Off  | 00000000:00:07.0 Off |                    0 |
| N/A   67C    P0    30W /  70W |      2MiB / 15360MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Run haschat -I or a benchmark.

clGetPlatformIDs(): CL_PLATFORM_NOT_FOUND_KHR

ATTENTION! No OpenCL-compatible or CUDA-compatible platform found.

You are probably missing the OpenCL or CUDA runtime installation.

Expected behavior
Hashcat recognizes CUDA-compatible GPUs.

Additional context
This seems to work if you use the devel flavor of image, e.g.
FROM nvidia/cuda:12.2.0-devel-ubuntu20.04

Per NVIDIA:

Three flavors of images are provided:

base: Includes the CUDA runtime (cudart)
runtime: Builds on the base and includes the [CUDA math libraries](https://developer.nvidia.com/gpu-accelerated-libraries), and [NCCL](https://developer.nvidia.com/nccl). A runtime image that also includes [cuDNN](https://developer.nvidia.com/cudnn) is available.
devel: Builds on the runtime and includes headers, development tools for building CUDA images. These images are particularly useful for multi-stage builds.

hkelley · Answer 1 · Wed Sep 13 2023 20:58:24 GMT+0800 (China Standard Time)

Even once the devel images are used, this (tangential) issue is still present:

NVIDIA/nvidia-docker#1730

Need to resolve this as well to achieve maximum stability.

Some workarounds are presented in that link. The symlink creation may be the best one for CrackQ.

When the container loses access to the GPU, you will see the following error message from the console output:
Failed to initialize NVML: Unknown Error
The container needs to be deleted once the issue occurs.

When it is restarted (manually or automatically depending on the use of a container orchestration platform), it will regain access to the GPU.

The issue originates from the fact that recent versions of runc require that symlinks be present under /dev/char to any device nodes being injected into a container. Unfortunately, these symlinks are not present for NVIDIA devices, and the NVIDIA GPU driver does not (currently) provide a means for them to be created automatically.
A fix will be present in the next patch release of all supported NVIDIA GPU drivers

f0cker · Answer 2 · Wed Dec 20 2023 21:22:57 GMT+0800 (China Standard Time)

Thanks for reporting this. Can you try the v0.1.2 branch and let me know if you're still seeing this issue?