feat: Crun + GGML plugin tracking issue

Question

feat: Crun + GGML plugin tracking issue

CaptainVincent opened this issue 5 months ago · comments

vincent commented 5 months ago

Summary

One solution that involves utilizing the Wasmedge runtime's GGML plugin within a container.

Details

Supports a webassembly application based on the wasi_nn proposal, and the application utilizes the GGML plugin.

Steps:

Install experimental crun and containerd Ref
Install Wasmedge's plugin through the installer and collect all dependencies Ref
Download model Ref
Pull demo image
sudo ctr image pull ghcr.io/captainvincent/runwasi-demo:llama-api-server
Run ggml demo (llama api server) with containerd

sudo ctr run --rm --net-host --runc-binary crun --runtime io.containerd.runc.v2 \
    --mount type=bind,src=$HOME/.wasmedge/plugin/,dst=/opt/containerd/lib,options=bind:ro \
    --mount type=bind,src=$PWD,dst=/resource,options=bind:ro \
    --env WASMEDGE_PLUGIN_PATH=/opt/containerd/lib \
    --env WASMEDGE_WASINN_PRELOAD=default:GGML:CPU:/resource/llama-2-7b-chat.Q5_K_M.gguf \
    --label module.wasm.image/variant=compat-smart ghcr.io/captainvincent/runwasi-demo:llama-api-server ggml \
    /app.wasm -p llama-2-chat

Test api server

curl -X POST http://localhost:8080/v1/chat/completions \
    -H 'accept:application/json' \
    -H 'Content-Type: application/json' \
    -d '{"messages":[{"role":"system", "content": "You are a helpful assistant."}, {"role":"user", "content": "Who is Robert Oppenheimer?"}], "model":"llama-2-chat"}'

Output

{"id":"f1e08777-cff2-4f26-97c7-cfdaff4e76f5","object":"chat.completion","created":1707981676,"model":"llama-2-chat","choices":[{"index":0,"message":{"role":"assistant","content":"Ah, an excellent question! Robert Oppenheimer (1904-1967) was an American theoretical physicist and science administrator who is best known for his leading role in the development of the atomic bomb during World War II. He is often referred to as the \"Father of the Atomic Bomb.\"\nOppenheimer was born in New York City and grew up in a family that valued education and intellectual pursuits. He showed a keen interest in science and mathematics from an early age and went on to study physics at Harvard University, where he earned his bachelor's degree in 1925. He then went on to pursue graduate studies at Cambridge University and the University of Göttingen, where he worked under the supervision of some of the most renowned physicists of the time, including Max Planck and Niels Bohr.\nIn 1932, Oppenheimer joined the faculty of the University of California, Berkeley, where he established a research group that became known as the \"Berkeley Group.\" This group, which included scientists such as Ernest Lawrence and Edward Teller, made significant contributions to the development of quantum mechanics and the theory of nuclear reactions.\nDuring World War II, Oppenheimer played a crucial role in the Manhattan Project, a secret research and development project led by the United States government to develop an atomic bomb. He was appointed as the director of the Los Alamos National Laboratory in New Mexico, where he oversaw the development of the first nuclear weapon ever detonated, known as \"Trinity.\"\nAfter the war, Oppenheimer became a prominent figure in the American scientific community and served as the director of the Institute for Advanced Study in Princeton, New Jersey. He was also a vocal advocate for international cooperation in science and a critic of nuclear weapons proliferation.\nDespite his groundbreaking contributions to physics, Oppenheimer's legacy is complex and controversial. He was known to be a brilliant thinker and a gifted teacher, but he was also criticized for his elitist views and his involvement in the development of nuclear weapons. Nevertheless, his work continues to inspire scientists and engineers around the world, and he remains one of the most important figures in the history of modern physics."},"finish_reason":"stop"}],"usage":{"prompt_tokens":13,"completion_tokens":360,"total_tokens":373}}%

Current Status:

GPU (cuda) support, If we execute the above command on a machine with an NVIDIA graphics card, below error will be encountered, and it will switch back to CPU mode.

WARNING: failed to allocate 2048.00 MB of pinned memory: no CUDA-capable device is detected
WARNING: failed to allocate 128.05 MB of pinned memory: no CUDA-capable device is detected
WARNING: failed to allocate 0.00 MB of pinned memory: no CUDA-capable device is detected
WARNING: failed to allocate 2464.00 MB of pinned memory: no CUDA-capable device is detected
WARNING: failed to allocate 2048.00 MB of pinned memory: no CUDA-capable device is detected
WARNING: failed to allocate 128.05 MB of pinned memory: no CUDA-capable device is detected
WARNING: failed to allocate 0.00 MB of pinned memory: no CUDA-capable device is detected
WARNING: failed to allocate 2464.00 MB of pinned memory: no CUDA-capable device is detected

Appendix

No response

Michael Yuan · Answer 1 · Thu Feb 15 2024 16:24:23 GMT+0800 (China Standard Time)

Does this help? It seems that you can access CUDA devices from Podman + crun.

NVIDIA/nvidia-container-toolkit#46

vincent · Answer 2 · Thu Feb 15 2024 19:33:51 GMT+0800 (China Standard Time)

I am not sure ~
IIUC, this involves taking a different approach with the NVIDIA runtime.
Running application with the NVIDIA runtime doesn't actually coexist with the other runtime (crun + wasmedge handler).

Michael Yuan · Answer 3 · Fri Feb 16 2024 01:31:56 GMT+0800 (China Standard Time)

The way I understand this is that there is an "Nvidia runtime" that acts as a runc / crun replacement.

But there is also an Nvidia library that works with crio / containers + runc (ie pre start hook etc) to accomplish the same. I guess we could explore this approach for crun as well?

Sohan Kunkerkar · Answer 4 · Mon Feb 19 2024 10:37:25 GMT+0800 (China Standard Time)

GPU (cuda) support, If we execute the above command on a machine with an NVIDIA graphics card, the below error will be encountered, and it will switch back to CPU mode.

I'm interested in running these models in CPU mode. Can I execute them on my local machine? What adjustments must I make to ensure it works with the CRI-O + crun (wasm) combination?

Sohan Kunkerkar · Answer 5 · Wed Feb 21 2024 08:08:29 GMT+0800 (China Standard Time)

@CaptainVincent Could you provide more information on the above question?

vincent · Answer 6 · Wed Feb 21 2024 09:53:47 GMT+0800 (China Standard Time)

Certainly, it can be run on a local machine. Later, I will add a CI workflow using crun to run the GGM as a demo. Once I've completed it, I will provide the link here.

vincent · Answer 7 · Sat Feb 24 2024 14:41:46 GMT+0800 (China Standard Time)

I added an pure crun + llama example here.

The steps aligns with the description in this issue. It's worth noting that this experimental feature is available only in our forked version of crun. The modifications we made differ exclusively in the wasmedge handler from the official version. If you wish to try a newer version of crun, migrate the changes is easily achievable. The decision to maintain this specific version is partly influenced by compatibility constraints with k8s.

Here is my test WASM image source. You can build your own image containing your WASM application, too. There are various tools for building Docker images, and the most straightforward approach is to refer to our example, write a Dockerfile to compile your application, and copy it into the image.

If you have any questions, feel free to bring them up.

Sohan Kunkerkar · Answer 8 · Fri Mar 01 2024 03:31:46 GMT+0800 (China Standard Time)

@CaptainVincent Thanks for the information. I was wondering what needs to be changed to follow a similar workflow for CRI-O.

vincent · Answer 9 · Sat Mar 02 2024 18:49:14 GMT+0800 (China Standard Time)

If you are still using our fork of crun as the runtime for cri-o, you should pay attention to the following modifications.

You need to bind the plugin folder to a specific path to ensure that the cri-o service can look up the location of dynamic libraries (following the instructions in the CI action; at this point, the folder will contain the plugin library and its dependencies). Alternatively, you can modify the LD_LIBRARY_PATH of the cri-o service.
Below path is for containerd only
--mount type=bind,src=$HOME/.wasmedge/plugin/,dst=/opt/containerd/lib,options=bind:ro
Additionally, update the following environment variable to let WasmEdge know where to load our plugin path:
--env WASMEDGE_PLUGIN_PATH=/opt/containerd/lib

Sohan Kunkerkar · Answer 10 · Sat Mar 09 2024 13:13:24 GMT+0800 (China Standard Time)

@CaptainVincent Thanks for your pointers. I have made a few changes to crun to load the wasmedge plugin to /usr/lib/wasmedge. Here's the PR for it. This works fine with crun-wasm. I'm using this config.json to create a container.

Config.json

cat config.json 
{
  "ociVersion": "1.0.2-dev",
  "process": {
  	"terminal": true,
  	"user": {
  		"uid": 0,
  		"gid": 0
  	},
  	"args": [
  		"/llama-chat.wasm"
  	],
  	"env": [
  		"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
  		"WASMEDGE_PLUGIN_PATH=/usr/lib/wasmedge",
  		"WASMEDGE_WASINN_PRELOAD=default:GGML:AUTO:llama-2-7b-chat-q5_k_m.gguf",
  		"TERM=xterm"
  	],
  	"cwd": "/",
  	"capabilities": {
  		"bounding": [
  			"CAP_AUDIT_WRITE",
  			"CAP_KILL",
  			"CAP_NET_BIND_SERVICE"
  		],
  		"effective": [
  			"CAP_AUDIT_WRITE",
  			"CAP_KILL",
  			"CAP_NET_BIND_SERVICE"
  		],
  		"permitted": [
  			"CAP_AUDIT_WRITE",
  			"CAP_KILL",
  			"CAP_NET_BIND_SERVICE"
  		],
  		"ambient": [
  			"CAP_AUDIT_WRITE",
  			"CAP_KILL",
  			"CAP_NET_BIND_SERVICE"
  		]
  	},
  	"rlimits": [
  		{
  			"type": "RLIMIT_NOFILE",
  			"hard": 1024,
  			"soft": 1024
  		}
  	],
  	"noNewPrivileges": true
  },
  "root": {
  	"path": "rootfs",
  	"readonly": false
  },
  "hostname": "wasm-test",
  "mounts": [
  	{
  		"destination": "/proc",
  		"type": "proc",
  		"source": "proc"
  	},
  	{
  		"destination": "/dev",
  		"type": "tmpfs",
  		"source": "tmpfs",
  		"options": [
  			"nosuid",
  			"strictatime",
  			"mode=755",
  			"size=65536k"
  		]
  	},
  	{
  		"destination": "/dev/pts",
  		"type": "devpts",
  		"source": "devpts",
  		"options": [
  			"nosuid",
  			"noexec",
  			"newinstance",
  			"ptmxmode=0666",
  			"mode=0620",
  			"gid=5"
  		]
  	},
  	{
  		"destination": "/dev/shm",
  		"type": "tmpfs",
  		"source": "shm",
  		"options": [
  			"nosuid",
  			"noexec",
  			"nodev",
  			"mode=1777",
  			"size=65536k"
  		]
  	},
  	{
  		"destination": "/dev/mqueue",
  		"type": "mqueue",
  		"source": "mqueue",
  		"options": [
  			"nosuid",
  			"noexec",
  			"nodev"
  		]
  	},
  	{
  		"destination": "/sys",
  		"type": "sysfs",
  		"source": "sysfs",
  		"options": [
  			"nosuid",
  			"noexec",
  			"nodev",
  			"ro"
  		]
  	},
  	{
  		"destination": "/sys/fs/cgroup",
  		"type": "cgroup",
  		"source": "cgroup",
  		"options": [
  			"nosuid",
  			"noexec",
  			"nodev",
  			"relatime",
  			"ro"
  		]
  	}
  ],
  "linux": {
  	"resources": {
  		"devices": [
  			{
  				"allow": false,
  				"access": "rwm"
  			}
  		]
  	},
  	"namespaces": [
  		{
  			"type": "pid"
  		},
  		{
  			"type": "network"
  		},
  		{
  			"type": "ipc"
  		},
  		{
  			"type": "uts"
  		},
  		{
  			"type": "mount"
  		},
  		{
  			"type": "cgroup"
  		}
  	],
  	"maskedPaths": [
  		"/proc/acpi",
  		"/proc/asound",
  		"/proc/kcore",
  		"/proc/keys",
  		"/proc/latency_stats",
  		"/proc/timer_list",
  		"/proc/timer_stats",
  		"/proc/sched_debug",
  		"/sys/firmware",
  		"/proc/scsi"
  	],
  	"readonlyPaths": [
  		"/proc/bus",
  		"/proc/fs",
  		"/proc/irq",
  		"/proc/sys",
  		"/proc/sysrq-trigger"
  	]
  }
}

$ ls -ltr rootfs/
total 4673668
-rw-r--r--. 1 skunkerk skunkerk 4783156800 Mar  8 13:54 llama-2-7b-chat-q5_k_m.gguf
-rwxr-xr-x. 1 skunkerk skunkerk    2675405 Mar  8 13:54 llama-chat.wasm
drwxr-xr-x. 1 skunkerk skunkerk          0 Mar  8 13:54 sys
drwxr-xr-x. 1 skunkerk skunkerk          0 Mar  8 13:54 proc
drwxr-xr-t. 1 skunkerk skunkerk          0 Mar  8 13:54 dev
drwxr-xr-t. 1 skunkerk skunkerk          6 Mar  8 13:54 usr

$ ❯ WASMEDGE_PLUGIN_MOUNT=/home/skunkerk/.wasmedge/plugin crun-wasm run wasm-test
RECEIVED WASMEDGE_PLUGIN_MOUNT value: /home/skunkerk/.wasmedge/plugin
[INFO] Model alias: default
[INFO] Prompt context size: 512
[INFO] Number of tokens to predict: 1024
[INFO] Number of layers to run on the GPU: 100
[INFO] Batch size for prompt processing: 512
[INFO] Temperature for sampling: 0.8
[INFO] Top-p sampling (1.0 = disabled): 0.9
[INFO] Penalize repeat sequence of tokens: 1.1
[INFO] presence penalty (0.0 = disabled): 0
[INFO] frequency penalty (0.0 = disabled): 0
[INFO] Use default system prompt
[INFO] Prompt template: Llama2Chat
[INFO] Log prompts: false
[INFO] Log statistics: false
[INFO] Log all information: false
[INFO] Plugin version: b2230 (commit 89febfed)

================================== Running in interactive mode. ===================================

    - Press [Ctrl+C] to interject at any time.
    - Press [Return] to end the input.
    - For multi-line inputs, end each line with '\' and press [Return] to get another line.


[You]:

Now, I tried integrating it with CRI-O and Kubernetes. I made a few changes to CRI-O to pass the env variable WASMEDGE_PLUGIN_MOUNT via the crio config.

$ cat llm-wasm.yml
apiVersion: v1
kind: Pod
metadata:
  name: llama-pod
spec:
  containers:
  - name: llama-container
    image: quay.io/sohankunkerkar/llama-crun:v1
    command: ["./llama-chat.wasm"]
    securityContext:
      privileged: true
  restartPolicy: Always
  
  $ cat Contaierfile
  # Use a base image
FROM scratch

ENV WASMEDGE_WASINN_PRELOAD default:GGML:AUTO:/app/model.gguf
ENV WASMEDGE_PLUGIN_PATH=/usr/lib/wasmedge

WORKDIR /app

# Copy wasm files to the directory
COPY llama-chat.wasm /app

COPY model.gguf /app

$ k get po
NAME        READY   STATUS      RESTARTS   AGE
llama-pod   0/1     Completed   0          3h58m

$  k logs llama-pod
[INFO] Model alias: default
[INFO] Prompt context size: 512
[INFO] Number of tokens to predict: 1024
[INFO] Number of layers to run on the GPU: 100
[INFO] Batch size for prompt processing: 512
[INFO] Temperature for sampling: 0.8
[INFO] Top-p sampling (1.0 = disabled): 0.9
[INFO] Penalize repeat sequence of tokens: 1.1
[INFO] presence penalty (0.0 = disabled): 0
[INFO] frequency penalty (0.0 = disabled): 0
[INFO] Use default system prompt
[INFO] Prompt template: Llama2Chat
[INFO] Log prompts: false
[INFO] Log statistics: false
[INFO] Log all information: false
Error: "Fail to load model into wasi-nn: Backend Error: WASI-NN Backend Error: Not Found"
RECEIVED WASMEDGE_PLUGIN_MOUNT value: /home/skunkerk/.wasmedge/plugin

I added the debug statement RECEIVED WASMEDGE_PLUGIN_MOUNT value: /home/skunkerk/.wasmedge/plugin to crun to investigate potential issues. Upon inspection, it appears to be populating the correct value.
The /app directory does have both the files:

root:/var/lib/containers/storage/overlay/323126b6306097007bf68ea1957508ca275cf3a3ee361cd3860a1bc7243986a4/merged# ls
app  dev  etc  proc  run  sys  usr  var

root:/var/lib/containers/storage/overlay/323126b6306097007bf68ea1957508ca275cf3a3ee361cd3860a1bc7243986a4/merged# ls -ltr app
total 4673668
-rw-r--r--. 1 root root 4783156800 Feb 29 00:13 model.gguf
-rwxr-xr-x. 1 root root    2675405 Mar  2 09:47 llama-chat.wasm

I'm still intrigued about this failure -> Error: "Fail to load model into wasi-nn: Backend Error: WASI-NN Backend Error: Not Found".
Any idea what could be the problem here?

vincent · Answer 11 · Mon Mar 11 2024 16:10:31 GMT+0800 (China Standard Time)

I'm not entirely sure, but I'll give it a try on my machine as well.

vincent · Answer 12 · Mon Mar 11 2024 16:33:35 GMT+0800 (China Standard Time)

I have a question: is the order of environment variables the same in crun and cri-o? I noticed that the current implementation relies on the order of WASMEDGE_WASINN_PRELOAD and WASMEDGE_PLUGIN_PATH in the environment variables to determine the flow. However, wasmedge has strong limitation; nn_preload must precede load plugin because the plugin initialization process of loading plugins will handle preload the model. I'm concerned that the order here might lead to nn_preload occurring later than the load plugin process.

vincent · Answer 13 · Thu Apr 04 2024 10:15:30 GMT+0800 (China Standard Time)

Close this item because all related issues it tracked have been closed, considering the goal accomplished.