WasmEdge / WasmEdge

WasmEdge is a lightweight, high-performance, and extensible WebAssembly runtime for cloud native, edge, and decentralized applications. It powers serverless apps, embedded functions, microservices, smart contracts, and IoT devices.

Home Page:https://WasmEdge.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

feat: Crun + GGML plugin tracking issue

CaptainVincent opened this issue · comments

Summary

One solution that involves utilizing the Wasmedge runtime's GGML plugin within a container.

Details

Supports a webassembly application based on the wasi_nn proposal, and the application utilizes the GGML plugin.

Steps:

  • Install experimental crun and containerd Ref
  • Install Wasmedge's plugin through the installer and collect all dependencies Ref
  • Download model Ref
  • Pull demo image
    sudo ctr image pull ghcr.io/captainvincent/runwasi-demo:llama-api-server
  • Run ggml demo (llama api server) with containerd
sudo ctr run --rm --net-host --runc-binary crun --runtime io.containerd.runc.v2 \
    --mount type=bind,src=$HOME/.wasmedge/plugin/,dst=/opt/containerd/lib,options=bind:ro \
    --mount type=bind,src=$PWD,dst=/resource,options=bind:ro \
    --env WASMEDGE_PLUGIN_PATH=/opt/containerd/lib \
    --env WASMEDGE_WASINN_PRELOAD=default:GGML:CPU:/resource/llama-2-7b-chat.Q5_K_M.gguf \
    --label module.wasm.image/variant=compat-smart ghcr.io/captainvincent/runwasi-demo:llama-api-server ggml \
    /app.wasm -p llama-2-chat
  • Test api server
curl -X POST http://localhost:8080/v1/chat/completions \
    -H 'accept:application/json' \
    -H 'Content-Type: application/json' \
    -d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "Who is Robert Oppenheimer?"}], "model":"llama-2-chat"}'

Output

{"id":"f1e08777-cff2-4f26-97c7-cfdaff4e76f5","object":"chat.completion","created":1707981676,"model":"llama-2-chat","choices":[{"index":0,"message":{"role":"assistant","content":"Ah, an excellent question! Robert Oppenheimer (1904-1967) was an American theoretical physicist and science administrator who is best known for his leading role in the development of the atomic bomb during World War II. He is often referred to as the \"Father of the Atomic Bomb.\"\nOppenheimer was born in New York City and grew up in a family that valued education and intellectual pursuits. He showed a keen interest in science and mathematics from an early age and went on to study physics at Harvard University, where he earned his bachelor's degree in 1925. He then went on to pursue graduate studies at Cambridge University and the University of Göttingen, where he worked under the supervision of some of the most renowned physicists of the time, including Max Planck and Niels Bohr.\nIn 1932, Oppenheimer joined the faculty of the University of California, Berkeley, where he established a research group that became known as the \"Berkeley Group.\" This group, which included scientists such as Ernest Lawrence and Edward Teller, made significant contributions to the development of quantum mechanics and the theory of nuclear reactions.\nDuring World War II, Oppenheimer played a crucial role in the Manhattan Project, a secret research and development project led by the United States government to develop an atomic bomb. He was appointed as the director of the Los Alamos National Laboratory in New Mexico, where he oversaw the development of the first nuclear weapon ever detonated, known as \"Trinity.\"\nAfter the war, Oppenheimer became a prominent figure in the American scientific community and served as the director of the Institute for Advanced Study in Princeton, New Jersey. He was also a vocal advocate for international cooperation in science and a critic of nuclear weapons proliferation.\nDespite his groundbreaking contributions to physics, Oppenheimer's legacy is complex and controversial. He was known to be a brilliant thinker and a gifted teacher, but he was also criticized for his elitist views and his involvement in the development of nuclear weapons. Nevertheless, his work continues to inspire scientists and engineers around the world, and he remains one of the most important figures in the history of modern physics."},"finish_reason":"stop"}],"usage":{"prompt_tokens":13,"completion_tokens":360,"total_tokens":373}}% 

Current Status:

  • GPU (cuda) support, If we execute the above command on a machine with an NVIDIA graphics card, below error will be encountered, and it will switch back to CPU mode.
WARNING: failed to allocate 2048.00 MB of pinned memory: no CUDA-capable device is detected
WARNING: failed to allocate 128.05 MB of pinned memory: no CUDA-capable device is detected
WARNING: failed to allocate 0.00 MB of pinned memory: no CUDA-capable device is detected
WARNING: failed to allocate 2464.00 MB of pinned memory: no CUDA-capable device is detected
WARNING: failed to allocate 2048.00 MB of pinned memory: no CUDA-capable device is detected
WARNING: failed to allocate 128.05 MB of pinned memory: no CUDA-capable device is detected
WARNING: failed to allocate 0.00 MB of pinned memory: no CUDA-capable device is detected
WARNING: failed to allocate 2464.00 MB of pinned memory: no CUDA-capable device is detected

Appendix

No response

Does this help? It seems that you can access CUDA devices from Podman + crun.

NVIDIA/nvidia-container-toolkit#46

I am not sure ~
IIUC, this involves taking a different approach with the NVIDIA runtime.
Running application with the NVIDIA runtime doesn't actually coexist with the other runtime (crun + wasmedge handler).

The way I understand this is that there is an "Nvidia runtime" that acts as a runc / crun replacement.

But there is also an Nvidia library that works with crio / containers + runc (ie pre start hook etc) to accomplish the same. I guess we could explore this approach for crun as well?

GPU (cuda) support, If we execute the above command on a machine with an NVIDIA graphics card, the below error will be encountered, and it will switch back to CPU mode.

I'm interested in running these models in CPU mode. Can I execute them on my local machine? What adjustments must I make to ensure it works with the CRI-O + crun (wasm) combination?

@CaptainVincent Could you provide more information on the above question?

Certainly, it can be run on a local machine. Later, I will add a CI workflow using crun to run the GGM as a demo. Once I've completed it, I will provide the link here.

I added an pure crun + llama example here.

The steps aligns with the description in this issue. It's worth noting that this experimental feature is available only in our forked version of crun. The modifications we made differ exclusively in the wasmedge handler from the official version. If you wish to try a newer version of crun, migrate the changes is easily achievable. The decision to maintain this specific version is partly influenced by compatibility constraints with k8s.

Here is my test WASM image source. You can build your own image containing your WASM application, too. There are various tools for building Docker images, and the most straightforward approach is to refer to our example, write a Dockerfile to compile your application, and copy it into the image.

If you have any questions, feel free to bring them up.

@CaptainVincent Thanks for the information. I was wondering what needs to be changed to follow a similar workflow for CRI-O.

If you are still using our fork of crun as the runtime for cri-o, you should pay attention to the following modifications.

  1. You need to bind the plugin folder to a specific path to ensure that the cri-o service can look up the location of dynamic libraries (following the instructions in the CI action; at this point, the folder will contain the plugin library and its dependencies). Alternatively, you can modify the LD_LIBRARY_PATH of the cri-o service.
    Below path is for containerd only
    --mount type=bind,src=$HOME/.wasmedge/plugin/,dst=/opt/containerd/lib,options=bind:ro
  2. Additionally, update the following environment variable to let WasmEdge know where to load our plugin path:
    --env WASMEDGE_PLUGIN_PATH=/opt/containerd/lib

@CaptainVincent Thanks for your pointers. I have made a few changes to crun to load the wasmedge plugin to /usr/lib/wasmedge. Here's the PR for it. This works fine with crun-wasm. I'm using this config.json to create a container.

Config.json
cat config.json 
{
  "ociVersion": "1.0.2-dev",
  "process": {
  	"terminal": true,
  	"user": {
  		"uid": 0,
  		"gid": 0
  	},
  	"args": [
  		"/llama-chat.wasm"
  	],
  	"env": [
  		"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
  		"WASMEDGE_PLUGIN_PATH=/usr/lib/wasmedge",
  		"WASMEDGE_WASINN_PRELOAD=default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf",
  		"TERM=xterm"
  	],
  	"cwd": "/",
  	"capabilities": {
  		"bounding": [
  			"CAP_AUDIT_WRITE",
  			"CAP_KILL",
  			"CAP_NET_BIND_SERVICE"
  		],
  		"effective": [
  			"CAP_AUDIT_WRITE",
  			"CAP_KILL",
  			"CAP_NET_BIND_SERVICE"
  		],
  		"permitted": [
  			"CAP_AUDIT_WRITE",
  			"CAP_KILL",
  			"CAP_NET_BIND_SERVICE"
  		],
  		"ambient": [
  			"CAP_AUDIT_WRITE",
  			"CAP_KILL",
  			"CAP_NET_BIND_SERVICE"
  		]
  	},
  	"rlimits": [
  		{
  			"type": "RLIMIT_NOFILE",
  			"hard": 1024,
  			"soft": 1024
  		}
  	],
  	"noNewPrivileges": true
  },
  "root": {
  	"path": "rootfs",
  	"readonly": false
  },
  "hostname": "wasm-test",
  "mounts": [
  	{
  		"destination": "/proc",
  		"type": "proc",
  		"source": "proc"
  	},
  	{
  		"destination": "/dev",
  		"type": "tmpfs",
  		"source": "tmpfs",
  		"options": [
  			"nosuid",
  			"strictatime",
  			"mode=755",
  			"size=65536k"
  		]
  	},
  	{
  		"destination": "/dev/pts",
  		"type": "devpts",
  		"source": "devpts",
  		"options": [
  			"nosuid",
  			"noexec",
  			"newinstance",
  			"ptmxmode=0666",
  			"mode=0620",
  			"gid=5"
  		]
  	},
  	{
  		"destination": "/dev/shm",
  		"type": "tmpfs",
  		"source": "shm",
  		"options": [
  			"nosuid",
  			"noexec",
  			"nodev",
  			"mode=1777",
  			"size=65536k"
  		]
  	},
  	{
  		"destination": "/dev/mqueue",
  		"type": "mqueue",
  		"source": "mqueue",
  		"options": [
  			"nosuid",
  			"noexec",
  			"nodev"
  		]
  	},
  	{
  		"destination": "/sys",
  		"type": "sysfs",
  		"source": "sysfs",
  		"options": [
  			"nosuid",
  			"noexec",
  			"nodev",
  			"ro"
  		]
  	},
  	{
  		"destination": "/sys/fs/cgroup",
  		"type": "cgroup",
  		"source": "cgroup",
  		"options": [
  			"nosuid",
  			"noexec",
  			"nodev",
  			"relatime",
  			"ro"
  		]
  	}
  ],
  "linux": {
  	"resources": {
  		"devices": [
  			{
  				"allow": false,
  				"access": "rwm"
  			}
  		]
  	},
  	"namespaces": [
  		{
  			"type": "pid"
  		},
  		{
  			"type": "network"
  		},
  		{
  			"type": "ipc"
  		},
  		{
  			"type": "uts"
  		},
  		{
  			"type": "mount"
  		},
  		{
  			"type": "cgroup"
  		}
  	],
  	"maskedPaths": [
  		"/proc/acpi",
  		"/proc/asound",
  		"/proc/kcore",
  		"/proc/keys",
  		"/proc/latency_stats",
  		"/proc/timer_list",
  		"/proc/timer_stats",
  		"/proc/sched_debug",
  		"/sys/firmware",
  		"/proc/scsi"
  	],
  	"readonlyPaths": [
  		"/proc/bus",
  		"/proc/fs",
  		"/proc/irq",
  		"/proc/sys",
  		"/proc/sysrq-trigger"
  	]
  }
}
$ ls -ltr rootfs/
total 4673668
-rw-r--r--. 1 skunkerk skunkerk 4783156800 Mar  8 13:54 llama-2-7b-chat-q5_k_m.gguf
-rwxr-xr-x. 1 skunkerk skunkerk    2675405 Mar  8 13:54 llama-chat.wasm
drwxr-xr-x. 1 skunkerk skunkerk          0 Mar  8 13:54 sys
drwxr-xr-x. 1 skunkerk skunkerk          0 Mar  8 13:54 proc
drwxr-xr-t. 1 skunkerk skunkerk          0 Mar  8 13:54 dev
drwxr-xr-t. 1 skunkerk skunkerk          6 Mar  8 13:54 usr

$ ❯ WASMEDGE_PLUGIN_MOUNT=/home/skunkerk/.wasmedge/plugin crun-wasm run wasm-test
RECEIVED WASMEDGE_PLUGIN_MOUNT value: /home/skunkerk/.wasmedge/plugin
[INFO] Model alias: default
[INFO] Prompt context size: 512
[INFO] Number of tokens to predict: 1024
[INFO] Number of layers to run on the GPU: 100
[INFO] Batch size for prompt processing: 512
[INFO] Temperature for sampling: 0.8
[INFO] Top-p sampling (1.0 = disabled): 0.9
[INFO] Penalize repeat sequence of tokens: 1.1
[INFO] presence penalty (0.0 = disabled): 0
[INFO] frequency penalty (0.0 = disabled): 0
[INFO] Use default system prompt
[INFO] Prompt template: Llama2Chat
[INFO] Log prompts: false
[INFO] Log statistics: false
[INFO] Log all information: false
[INFO] Plugin version: b2230 (commit 89febfed)

================================== Running in interactive mode. ===================================

    - Press [Ctrl+C] to interject at any time.
    - Press [Return] to end the input.
    - For multi-line inputs, end each line with '\' and press [Return] to get another line.


[You]:

Now, I tried integrating it with CRI-O and Kubernetes. I made a few changes to CRI-O to pass the env variable WASMEDGE_PLUGIN_MOUNT via the crio config.

$ cat llm-wasm.yml
apiVersion: v1
kind: Pod
metadata:
  name: llama-pod
spec:
  containers:
  - name: llama-container
    image: quay.io/sohankunkerkar/llama-crun:v1
    command: ["./llama-chat.wasm"]
    securityContext:
      privileged: true
  restartPolicy: Always
  
  $ cat Contaierfile
  # Use a base image
FROM scratch

ENV WASMEDGE_WASINN_PRELOAD default:GGML:AUTO:/app/model.gguf
ENV WASMEDGE_PLUGIN_PATH=/usr/lib/wasmedge

WORKDIR /app

# Copy wasm files to the directory
COPY llama-chat.wasm /app

COPY model.gguf /app

$ k get po
NAME        READY   STATUS      RESTARTS   AGE
llama-pod   0/1     Completed   0          3h58m

$  k logs llama-pod
[INFO] Model alias: default
[INFO] Prompt context size: 512
[INFO] Number of tokens to predict: 1024
[INFO] Number of layers to run on the GPU: 100
[INFO] Batch size for prompt processing: 512
[INFO] Temperature for sampling: 0.8
[INFO] Top-p sampling (1.0 = disabled): 0.9
[INFO] Penalize repeat sequence of tokens: 1.1
[INFO] presence penalty (0.0 = disabled): 0
[INFO] frequency penalty (0.0 = disabled): 0
[INFO] Use default system prompt
[INFO] Prompt template: Llama2Chat
[INFO] Log prompts: false
[INFO] Log statistics: false
[INFO] Log all information: false
Error: "Fail to load model into wasi-nn: Backend Error: WASI-NN Backend Error: Not Found"
RECEIVED WASMEDGE_PLUGIN_MOUNT value: /home/skunkerk/.wasmedge/plugin

I added the debug statement RECEIVED WASMEDGE_PLUGIN_MOUNT value: /home/skunkerk/.wasmedge/plugin to crun to investigate potential issues. Upon inspection, it appears to be populating the correct value.
The /app directory does have both the files:

root:/var/lib/containers/storage/overlay/323126b6306097007bf68ea1957508ca275cf3a3ee361cd3860a1bc7243986a4/merged# ls
app  dev  etc  proc  run  sys  usr  var

root:/var/lib/containers/storage/overlay/323126b6306097007bf68ea1957508ca275cf3a3ee361cd3860a1bc7243986a4/merged# ls -ltr app
total 4673668
-rw-r--r--. 1 root root 4783156800 Feb 29 00:13 model.gguf
-rwxr-xr-x. 1 root root    2675405 Mar  2 09:47 llama-chat.wasm

I'm still intrigued about this failure -> Error: "Fail to load model into wasi-nn: Backend Error: WASI-NN Backend Error: Not Found".
Any idea what could be the problem here?

I'm not entirely sure, but I'll give it a try on my machine as well.

I have a question: is the order of environment variables the same in crun and cri-o? I noticed that the current implementation relies on the order of WASMEDGE_WASINN_PRELOAD and WASMEDGE_PLUGIN_PATH in the environment variables to determine the flow. However, wasmedge has strong limitation; nn_preload must precede load plugin because the plugin initialization process of loading plugins will handle preload the model. I'm concerned that the order here might lead to nn_preload occurring later than the load plugin process.

Close this item because all related issues it tracked have been closed, considering the goal accomplished.