Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
guidoveritone opened this issue · comments
Hey guys, i am trying to run the Mistral 7b model using the guide on the page.
I am running:
docker run --gpus all \
-e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
ghcr.io/mistralai/mistral-src/vllm:latest \
--host 0.0.0.0 \
--model mistralai/Mistral-7B-Instruct-v0.2
and I am getting the following error:
└─$ docker run --gpus '"device=0"' -e HF_TOKEN=$HF_TOKEN -p 8000:8000 ghcr.io/mistralai/mistral-src/vllm:latest --host 0.0.0.0 --model mistralai/Mistral-7B-Instruct-v0.2
The HF_TOKEN environment variable set, logging to Hugging Face.
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful
INFO 04-15 15:25:32 api_server.py:719] args: Namespace(host='0.0.0.0', port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, chat_template=None, response_role='assistant', model='mistralai/Mistral-7B-Instruct-v0.2', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
config.json: 100%|██████████| 596/596 [00:00<00:00, 6.74MB/s]
INFO 04-15 15:25:33 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mistral-7B-Instruct-v0.2', tokenizer='mistralai/Mistral-7B-Instruct-v0.2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
tokenizer_config.json: 100%|██████████| 1.46k/1.46k [00:00<00:00, 19.4MB/s]
tokenizer.model: 100%|██████████| 493k/493k [00:00<00:00, 9.14MB/s]
tokenizer.json: 100%|██████████| 1.80M/1.80M [00:00<00:00, 3.16MB/s]
special_tokens_map.json: 100%|██████████| 72.0/72.0 [00:00<00:00, 953kB/s]
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 729, in <module>
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 269, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 314, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 109, in __init__
self._init_workers(distributed_init_method)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 141, in _init_workers
self._run_workers(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 750, in _run_workers
self._run_workers_in_batch(workers, method, *args, **kwargs))
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 724, in _run_workers_in_batch
output = executor(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 59, in init_model
torch.cuda.set_device(self.device)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 404, in set_device
torch._C._cuda_setDevice(device)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 298, in _lazy_init
torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
I tried several things to fix this, found things to do on
- lllyasviel/Fooocus#2169
- https://stackoverflow.com/questions/66371130/cuda-initialization-unexpected-error-from-cudagetdevicecount
and nothing worked! also tried some nvidia default containers to check if CUDA is working and everything seems to work!
my nvidia-smi
output:
└─$ nvidia-smi
Mon Apr 15 12:29:36 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:09:00.0 On | N/A |
| 30% 45C P0 58W / 170W | 490MiB / 12288MiB | 12% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1800 G /usr/lib/xorg/Xorg 179MiB |
| 0 N/A N/A 1999 G /usr/bin/gnome-shell 47MiB |
| 0 N/A N/A 2214 G /usr/bin/nvidia-settings 0MiB |
| 0 N/A N/A 2757 G ...--variations-seed-version 45MiB |
| 0 N/A N/A 2900 G ...b/firefox-esr/firefox-esr 114MiB |
| 0 N/A N/A 3904 G ...on=20240414-180149.278000 98MiB |
+-----------------------------------------------------------------------------+
my /etc/nvidia-container-runtime/config.toml
file:
#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"
[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig"
load-kmods = true
no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"
[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc", "crun"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false
[nvidia-ctk]
path = "nvidia-ctk"
note: if I change the no-cgroups
flag to true
I get a No CUDA Gpus available
error.
OS:
└─$ neofetch
.............. guido@kali
..,;:ccc,. ----------
......''';lxO. OS: Kali GNU/Linux Rolling x86_64
.....''''..........,:ld; Kernel: 6.6.9-amd64
.';;;:::;,,.x, Uptime: 37 mins
..'''. 0Xxoc:,. ... Packages: 2981 (dpkg), 12 (snap)
.... ,ONkc;,;cokOdc',. Shell: bash 5.2.21
. OMo ':ddo. Resolution: 1920x1080, 1920x1080
dMc :OO; DE: GNOME 45.3
0M. .:o. WM: Mutter
;Wd WM Theme: Kali-Purple-Dark
;XO, Theme: Kali-Purple-Dark [GTK2/3]
,d0Odlc;,.. Icons: Flat-Remix-Blue-Light [GTK2/3]
..',;:cdOOd::,. Terminal: terminator
.:d;.':;. CPU: AMD Ryzen 7 5800X (16) @ 4.200GHz
'd, .' GPU: NVIDIA GeForce RTX 3060 Lite Hash Rate
;l .. Memory: 4669MiB / 32013MiB
.o
c
.'
.
also, if i run an example cuda container to check everything works... everything works:
└─$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Mon Apr 15 15:53:55 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:09:00.0 On | N/A |
| 30% 45C P0 57W / 170W | 383MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+