bug: Can't run llama2-7b with WasmEdge on RTX 4080

Question

bug: Can't run llama2-7b with WasmEdge on RTX 4080

alabulei1 opened this issue 7 months ago · comments

alabulei1 commented 7 months ago

Summary

I can't run llama2-7b on RTX 4080.

Current State

Return error messages:

CUDA error 222 at /mnt/build_nv20_cuda_120_b1656/_deps/llama-src/ggml-cuda.cu:7788: the provided PTX was compiled with an unsupported toolchain.
current device: 0
GGML_ASSERT: /mnt/build_nv20_cuda_120_b1656/_deps/llama-src/ggml-cuda.cu:7788: !"CUDA error"
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
/dev/fd/63: line 379: 304930 Aborted                 (core dumped) wasmedge --dir .:. --nn-preload default:GGML:AUTO:$model_file llama-chat.wasm --stream-stdout --prompt-template $prompt_template $log_stat

Expected State

Can interact with llama2

Reproduction steps

run the following command line

bash <(curl -sSfL 'https://code.flows.network/webhook/iwYN1SdN3AmPgR5ao5Gt/run-llm.sh')

Choose 1) install WasmEdge, and then choose 1)Llama2-7b-chat, and then choose 1) run with CLI, and then choose 1) Yes
Ask a question "Where is Paris?" And then return the following error messages

[You]:
Where is Paris?

---------------- [LOG: STATISTICS] -----------------

llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 2048.00 MB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 2307.22 MiB
llama_new_context_with_model: VRAM scratch buffer: 2304.03 MiB
llama_new_context_with_model: total VRAM used: 8826.96 MiB (model: 4474.93 MiB, context: 4352.03 MiB)
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 2048.00 MB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 2307.22 MiB
llama_new_context_with_model: VRAM scratch buffer: 2304.03 MiB
llama_new_context_with_model: total VRAM used: 8826.96 MiB (model: 4474.93 MiB, context: 4352.03 MiB)

CUDA error 222 at /mnt/build_nv20_cuda_120_b1656/_deps/llama-src/[ggml-cuda.cu:7788](http://ggml-cuda.cu:7788/): the provided PTX was compiled with an unsupported toolchain.
current device: 0
GGML_ASSERT: /mnt/build_nv20_cuda_120_b1656/_deps/llama-src/[ggml-cuda.cu:7788](http://ggml-cuda.cu:7788/): !"CUDA error"
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
/dev/fd/63: line 379: 304930 Aborted                 (core dumped) wasmedge --dir .:. --nn-preload default:GGML:AUTO:$model_file llama-chat.wasm --stream-stdout --prompt-template $prompt_template $log_stat

Screenshots

No response

Any logs you want to share for showing the specific issue

(base) ai4080@Ai4080-System:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) [2005-2023](tel:2005-2023) NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.[32415258](tel:32415258)_0
(base) ai4080@Ai4080-System:~$ nvidia-smi
Fri Dec 29 19:40:18 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4080         On | 00000000:01:00.0 Off |                  N/A |
|  0%   39C    P8                8W / 320W|   3825MiB / 16376MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1323      G   /usr/lib/xorg/Xorg                          155MiB |
|    0   N/A  N/A      1591      G   /usr/bin/gnome-shell                         30MiB |
|    0   N/A  N/A      3890      G   ...irefox/3068/usr/lib/firefox/firefox      161MiB |
|    0   N/A  N/A     12777      C   python3                                    3470MiB |
+---------------------------------------------------------------------------------------+
(base) ai4080@Ai4080-System:~$ lsb_release -a
No LSB modules are available.
Distributor ID:    Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:    22.04
Codename:    jammy

Components

CLI

WasmEdge Version or Commit you used

0.13.5

Operating system information

Ubuntu 22.04.2 LTS

Hardware Architecture

x86_64

Compiler flags and options

No response

hydai · Answer 1 · Tue Jan 09 2024 19:37:44 GMT+0800 (China Standard Time)

https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/discussions/23

Looks like your nvcc is too old. I guess the only solution is to upgrade the nvcc & nvidia driver.

hydai · Answer 2 · Tue Jan 09 2024 19:40:20 GMT+0800 (China Standard Time)

https://forums.developer.nvidia.com/t/provided-ptx-was-compiled-with-an-unsupported-toolchain-error-using-cub/168292

From the official developer forums:

It indicates a mismatch between driver and compilation toolchain. Not having any other details (GPU in use, CUDA version in use from nvcc --version, GPU driver version) from you, I can’t be specific. My recommendation (based on the 11.2 in your cub path) would be to update your driver to the latest one available for your GPU. If your driver is less 460.39, I would update your driver.

Sounds like updating the driver is the only solution to fix it.

hydai · Answer 3 · Wed Feb 07 2024 18:53:42 GMT+0800 (China Standard Time)

@alabulei1 Please check if this issue still needs to be fixed.

alabulei1 · Answer 4 · Tue Feb 20 2024 19:06:26 GMT+0800 (China Standard Time)

Since WasmEdge now supports CUDA 12.1, problem is solved. Thanks!