WasmEdge / WasmEdge

WasmEdge is a lightweight, high-performance, and extensible WebAssembly runtime for cloud native, edge, and decentralized applications. It powers serverless apps, embedded functions, microservices, smart contracts, and IoT devices.

Home Page:https://WasmEdge.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

WASI-NN with GPU on Jetson Orin Nano

hetvishastri opened this issue · comments

Summary

Hi,
I am trying to run the example LLM inference (https://wasmedge.org/docs/develop/rust/wasinn/llm_inference) on Jetson Orin Nano using GPU.

I tried to build wasi-nn plugin from source for making it compatible with Jetson Orin Nano by setting CUDAARCHS as 87.

cd <path/to/your/wasmedge/source/folder>

# Due to cuda-related files, it will produce some warning.
# Disable the warning as an error to avoid failures.
export CXXFLAGS="-Wno-error"
# Please make sure you set up the correct CUDAARCHS.
# 87 is for NVIDIA Jetson Orin Nano
export CUDAARCHS=87

# BLAS cannot work with CUBLAS
cmake -GNinja -Bbuild -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
  -DWASMEDGE_PLUGIN_WASI_NN_BACKEND="GGML" \
  -DWASMEDGE_PLUGIN_WASI_NN_GGML_LLAMA_BLAS=OFF \
  -DWASMEDGE_PLUGIN_WASI_NN_GGML_LLAMA_CUBLAS=ON \
  .

cmake --build build

# For the WASI-NN plugin, you should install this project.
cmake --install build

When I am running the command for the llm chat inference, I am not getting any error but I am loosing connection from my board. Even I am following output but after this I am loosing connection from my board.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf llama-chat.wasm
[INFO] Model alias: default
[INFO] Prompt context size: 512
[INFO] Number of tokens to predict: 1024
[INFO] Number of layers to run on the GPU: 100
[INFO] Batch size for prompt processing: 512
[INFO] Temperature for sampling: 0.8
[INFO] Top-p sampling (1.0 = disabled): 0.9
[INFO] Penalize repeat sequence of tokens: 1.1
[INFO] presence penalty (0.0 = disabled): 0
[INFO] frequency penalty (0.0 = disabled): 0
[INFO] Use default system prompt
[INFO] Prompt template: Llama2Chat
[INFO] Log prompts: false
[INFO] Log statistics: false
[INFO] Log all information: false
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7, VMM: yes
[INFO] Plugin version: b2230 (commit 89febfed)

================================== Running in interactive mode. ===================================

    - Press [Ctrl+C] to interject at any time.
    - Press [Return] to end the input.
    - For multi-line inputs, end each line with '\' and press [Return] to get another line.


[You]: 
I have two apples, each costing 5 dollars. What is the total cost of these apple?

[Bot]:

Is it possible to use GPU on Jetson Orin Nano board while working with wasmedge with wasi-nn?

Appendix

No response

Llama2-7b requires at least 8GB of RAM. For a smaller device, I would recommend TinyLlama or Gemma-2b.

https://x.com/realwasmedge/status/1725538013780890100

https://x.com/realwasmedge/status/1760758628347195601

Choose a model that is smaller than your VRAM instead. Or you can set the ngl to a small value to reduce the usage of the VRAM.

Ok Thank you so much for your help. I had a general question I tried two methods
1). Used wasmedge and wasi-nn plugin build-in library.
2). Built from source with compute capability 87 for jetson orin nano.

I found that both the methods worked for TinyLlama. According to my understanding and documentation method 1). supports Jetson orin agx which has different compute capability (CUDAARCHS) compared to Jetson orin nano hence it should not have worked for Jetson orin nano. Is my understanding correct?

The Jetson Orin AGX is using CUDAARCHS=72 to build. The 87 is newer than it, so I think it's totally fine to use the pre-built one.

Ok got it. Thank you for your reply.

It seems the issue is resolved. Closing it.