NVIDIA 'runtime' images don't have necessary CUDA components
hkelley opened this issue · comments
Describe the bug
When using the runtime
flavor of NVIDIA images,
https://github.com/f0cker/crackq/blob/675a5b62191cd999b3f3a5304138ef021800e156/docker/nvidia/ubuntu/Dockerfile#L1C2-L1C2
hashcat does not recognize NVIDIA T4 GPUs, even though nvidia-smi
does.
To Reproduce
Steps to reproduce the behavior:
- Build containers.
- Open shell in crackq
- Run
nvidia-smi
sudo docker exec -it crackq /bin/bash
crackq@crackq:/opt/crackq/build$ nvidia-smi
Tue Aug 8 13:12:18 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 66C P0 30W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:00:05.0 Off | 0 |
| N/A 67C P0 28W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000000:00:06.0 Off | 0 |
| N/A 67C P0 30W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000000:00:07.0 Off | 0 |
| N/A 67C P0 30W / 70W | 2MiB / 15360MiB | 8% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
- Run
haschat -I
or a benchmark.
clGetPlatformIDs(): CL_PLATFORM_NOT_FOUND_KHR
ATTENTION! No OpenCL-compatible or CUDA-compatible platform found.
You are probably missing the OpenCL or CUDA runtime installation.
Expected behavior
Hashcat recognizes CUDA-compatible GPUs.
Additional context
This seems to work if you use the devel
flavor of image, e.g.
FROM nvidia/cuda:12.2.0-devel-ubuntu20.04
Per NVIDIA:
Three flavors of images are provided:
base: Includes the CUDA runtime (cudart) runtime: Builds on the base and includes the [CUDA math libraries](https://developer.nvidia.com/gpu-accelerated-libraries), and [NCCL](https://developer.nvidia.com/nccl). A runtime image that also includes [cuDNN](https://developer.nvidia.com/cudnn) is available. devel: Builds on the runtime and includes headers, development tools for building CUDA images. These images are particularly useful for multi-stage builds.
Even once the devel images are used, this (tangential) issue is still present:
Need to resolve this as well to achieve maximum stability.
Some workarounds are presented in that link. The symlink creation may be the best one for CrackQ.
When the container loses access to the GPU, you will see the following error message from the console output:
Failed to initialize NVML: Unknown Error
The container needs to be deleted once the issue occurs.
When it is restarted (manually or automatically depending on the use of a container orchestration platform), it will regain access to the GPU.
The issue originates from the fact that recent versions of runc require that symlinks be present under /dev/char to any device nodes being injected into a container. Unfortunately, these symlinks are not present for NVIDIA devices, and the NVIDIA GPU driver does not (currently) provide a means for them to be created automatically.
A fix will be present in the next patch release of all supported NVIDIA GPU drivers
Thanks for reporting this. Can you try the v0.1.2 branch and let me know if you're still seeing this issue?