bug: Can't run llama2-7b with WasmEdge on RTX 4080
alabulei1 opened this issue · comments
Summary
I can't run llama2-7b on RTX 4080.
Current State
Return error messages:
CUDA error 222 at /mnt/build_nv20_cuda_120_b1656/_deps/llama-src/ggml-cuda.cu:7788: the provided PTX was compiled with an unsupported toolchain.
current device: 0
GGML_ASSERT: /mnt/build_nv20_cuda_120_b1656/_deps/llama-src/ggml-cuda.cu:7788: !"CUDA error"
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
/dev/fd/63: line 379: 304930 Aborted (core dumped) wasmedge --dir .:. --nn-preload default:GGML:AUTO:$model_file llama-chat.wasm --stream-stdout --prompt-template $prompt_template $log_stat
Expected State
Can interact with llama2
Reproduction steps
- run the following command line
bash <(curl -sSfL 'https://code.flows.network/webhook/iwYN1SdN3AmPgR5ao5Gt/run-llm.sh')
- Choose 1) install WasmEdge, and then choose 1)Llama2-7b-chat, and then choose 1) run with CLI, and then choose 1) Yes
- Ask a question "Where is Paris?" And then return the following error messages
[You]:
Where is Paris?
---------------- [LOG: STATISTICS] -----------------
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 2048.00 MB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 2307.22 MiB
llama_new_context_with_model: VRAM scratch buffer: 2304.03 MiB
llama_new_context_with_model: total VRAM used: 8826.96 MiB (model: 4474.93 MiB, context: 4352.03 MiB)
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 2048.00 MB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 2307.22 MiB
llama_new_context_with_model: VRAM scratch buffer: 2304.03 MiB
llama_new_context_with_model: total VRAM used: 8826.96 MiB (model: 4474.93 MiB, context: 4352.03 MiB)
CUDA error 222 at /mnt/build_nv20_cuda_120_b1656/_deps/llama-src/[ggml-cuda.cu:7788](http://ggml-cuda.cu:7788/): the provided PTX was compiled with an unsupported toolchain.
current device: 0
GGML_ASSERT: /mnt/build_nv20_cuda_120_b1656/_deps/llama-src/[ggml-cuda.cu:7788](http://ggml-cuda.cu:7788/): !"CUDA error"
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
/dev/fd/63: line 379: 304930 Aborted (core dumped) wasmedge --dir .:. --nn-preload default:GGML:AUTO:$model_file llama-chat.wasm --stream-stdout --prompt-template $prompt_template $log_stat
Screenshots
No response
Any logs you want to share for showing the specific issue
(base) ai4080@Ai4080-System:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) [2005-2023](tel:2005-2023) NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.[32415258](tel:32415258)_0
(base) ai4080@Ai4080-System:~$ nvidia-smi
Fri Dec 29 19:40:18 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4080 On | 00000000:01:00.0 Off | N/A |
| 0% 39C P8 8W / 320W| 3825MiB / 16376MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1323 G /usr/lib/xorg/Xorg 155MiB |
| 0 N/A N/A 1591 G /usr/bin/gnome-shell 30MiB |
| 0 N/A N/A 3890 G ...irefox/3068/usr/lib/firefox/firefox 161MiB |
| 0 N/A N/A 12777 C python3 3470MiB |
+---------------------------------------------------------------------------------------+
(base) ai4080@Ai4080-System:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.2 LTS
Release: 22.04
Codename: jammy
Components
CLI
WasmEdge Version or Commit you used
0.13.5
Operating system information
Ubuntu 22.04.2 LTS
Hardware Architecture
x86_64
Compiler flags and options
No response
https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/discussions/23
Looks like your nvcc is too old. I guess the only solution is to upgrade the nvcc & nvidia driver.
From the official developer forums:
It indicates a mismatch between driver and compilation toolchain. Not having any other details (GPU in use, CUDA version in use from nvcc --version, GPU driver version) from you, I can’t be specific. My recommendation (based on the 11.2 in your cub path) would be to update your driver to the latest one available for your GPU. If your driver is less 460.39, I would update your driver.
Sounds like updating the driver is the only solution to fix it.
@alabulei1 Please check if this issue still needs to be fixed.
Since WasmEdge now supports CUDA 12.1, problem is solved. Thanks!