WASI-NN with GPU on Jetson Orin Nano

Question

WASI-NN with GPU on Jetson Orin Nano

hetvishastri opened this issue 4 months ago · comments

Summary

Hi,
I am trying to run the example LLM inference (https://wasmedge.org/docs/develop/rust/wasinn/llm_inference) on Jetson Orin Nano using GPU.

I tried to build wasi-nn plugin from source for making it compatible with Jetson Orin Nano by setting CUDAARCHS as 87.

cd <path/to/your/wasmedge/source/folder>

# Due to cuda-related files, it will produce some warning.
# Disable the warning as an error to avoid failures.
export CXXFLAGS="-Wno-error"
# Please make sure you set up the correct CUDAARCHS.
# 87 is for NVIDIA Jetson Orin Nano
export CUDAARCHS=87

# BLAS cannot work with CUBLAS
cmake -GNinja -Bbuild -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
  -DWASMEDGE_PLUGIN_WASI_NN_BACKEND="GGML" \
  -DWASMEDGE_PLUGIN_WASI_NN_GGML_LLAMA_BLAS=OFF \
  -DWASMEDGE_PLUGIN_WASI_NN_GGML_LLAMA_CUBLAS=ON \
  .

cmake --build build

# For the WASI-NN plugin, you should install this project.
cmake --install build

When I am running the command for the llm chat inference, I am not getting any error but I am loosing connection from my board. Even I am following output but after this I am loosing connection from my board.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-chat.wasm
[INFO] Model alias: default
[INFO] Prompt context size: 512
[INFO] Number of tokens to predict: 1024
[INFO] Number of layers to run on the GPU: 100
[INFO] Batch size for prompt processing: 512
[INFO] Temperature for sampling: 0.8
[INFO] Top-p sampling (1.0 = disabled): 0.9
[INFO] Penalize repeat sequence of tokens: 1.1
[INFO] presence penalty (0.0 = disabled): 0
[INFO] frequency penalty (0.0 = disabled): 0
[INFO] Use default system prompt
[INFO] Prompt template: Llama2Chat
[INFO] Log prompts: false
[INFO] Log statistics: false
[INFO] Log all information: false
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7, VMM: yes
[INFO] Plugin version: b2230 (commit 89febfed)

================================== Running in interactive mode. ===================================

    - Press [Ctrl+C] to interject at any time.
    - Press [Return] to end the input.
    - For multi-line inputs, end each line with '\' and press [Return] to get another line.


[You]: 
I have two apples, each costing 5 dollars. What is the total cost of these apple?

[Bot]:

Is it possible to use GPU on Jetson Orin Nano board while working with wasmedge with wasi-nn?

Appendix

No response

Michael Yuan · Answer 1 · Sun Mar 03 2024 11:10:14 GMT+0800 (China Standard Time)

Llama2-7b requires at least 8GB of RAM. For a smaller device, I would recommend TinyLlama or Gemma-2b.

https://x.com/realwasmedge/status/1725538013780890100

https://x.com/realwasmedge/status/1760758628347195601

hydai · Answer 2 · Mon Mar 04 2024 21:28:53 GMT+0800 (China Standard Time)

Choose a model that is smaller than your VRAM instead. Or you can set the ngl to a small value to reduce the usage of the VRAM.

hetvishastri · Answer 3 · Tue Mar 05 2024 07:01:30 GMT+0800 (China Standard Time)

Ok Thank you so much for your help. I had a general question I tried two methods
1). Used wasmedge and wasi-nn plugin build-in library.
2). Built from source with compute capability 87 for jetson orin nano.

I found that both the methods worked for TinyLlama. According to my understanding and documentation method 1). supports Jetson orin agx which has different compute capability (CUDAARCHS) compared to Jetson orin nano hence it should not have worked for Jetson orin nano. Is my understanding correct?

hydai · Answer 4 · Tue Mar 05 2024 11:21:19 GMT+0800 (China Standard Time)

The Jetson Orin AGX is using CUDAARCHS=72 to build. The 87 is newer than it, so I think it's totally fine to use the pre-built one.

hetvishastri · Answer 5 · Wed Mar 06 2024 00:19:47 GMT+0800 (China Standard Time)

Ok got it. Thank you for your reply.

hydai · Answer 6 · Sat Mar 23 2024 16:55:17 GMT+0800 (China Standard Time)

It seems the issue is resolved. Closing it.