The Artificial Intelligence Controller Interface (AICI) lets you build Controllers that constrain and direct output of a Large Language Model (LLM) in real time. Controllers are light-weight WebAssembly (Wasm) modules which run on the same machine as the LLM inference engine, utilizing the CPU while the GPU is busy with token generation.
AICI is meant to run both locally and in the cloud, including (eventually) multi-tenant LLM deployments. It is designed to allow control libraries such as Guidance, LMQL, and others to run efficiently and portably across LLM inference and serving engines.
AICI is a prototype, designed and built at Microsoft Research.
Tip
We are looking for a research intern. You have to be accepted or currently enrolled in a PhD program or an equivalent research-oriented program in Computer Science or related STEM field.
AICI is:
- Secure: Controllers are sandboxed and cannot access the filesystem, network, or any other resources
- Fast: Wasm modules are compiled to native code and run in parallel with the LLM inference engine, inducing only a minimal overhead to the generation process
- Flexible: Controllers can be written in any language that can compile to Wasm (Rust, C, C++, ...), or be interpreted inside Wasm (Python, JavaScript, ...)
This repository contains a number of components, and which ones you need depends on your use case.
You can use an existing controller module.
We provide PyCtrl and JsCtrl
that let you script controllers using server-side Python and JavaScript, respectively.
The pyaici package contains aici
command line tool that lets you
upload and run scripts with any controller
(we also provide REST API definition for the curious).
We anticipate libraries will be built on top of controllers. We provide an example in promptlib - a client-side Python library that generates interacts with DeclCtrl via the pyaici package, see example notebooks.
The controllers can be run in a cloud or local AICI-enabled LLM inference engine. You can run the provided reference engine (rLLM) locally with either libtorch+CUDA or llama.cpp backend.
To develop a new controller, use a Rust starter project that shows usage of aici_abi library, which simplifies implementing the low-level AICI interface.
To add AICI support to a new LLM inference engine, you will need to implement LLM-side of the protocol that talks to AICI runtime.
Finally, you may want to modify any of the provided components - PRs are most welcome!
To continue, follow one of the build setups below, and continue with running the server and interacting with the server afterwards.
All of the use cases above, except for running an existing controller on remote server, require a working Rust compiler, while compiling rllm-cuda also requires libtorch and CUDA.
- AICI Client-side has Rust and C/C++ compilers for developing controllers, rLLM on llama.cpp and aicirt
- AICI with CUDA has all of the above, plus CUDA and libtorch for rLLM on libtorch; this requires a CUDA-capable GPU (currently only 8.0 (A100) is supported)
- AICI with CUDA and vLLM (experimental) is for our outdated vLLM integration
If you're not familiar with devcontainers, you need to install the Dev Containers VSCode extension and from the command palette in VSCode select Dev Containers: Reopen in Container.... It pops a list of available devcontainers, select the one you want to use.
This should be roughly equivalent to the AICI Client-side devcontainer. See also common.dockerfile.
- install required packages; it's likely you already have some or all of these but the list should be exhaustive for fresh Ubuntu-22.04 install in WSL
sudo apt-get install -y --no-install-recommends \
build-essential ca-certificates ccache \
cmake curl libjpeg-dev libpng-dev \
strace linux-tools-common linux-tools-generic \
llvm-dev libclang-dev clang ccache apache2-utils git-lfs \
screen bsdmainutils pip python3-dev python-is-python3 \
nodejs npm pkg-config
pip install pytest pytest-forked ujson posix_ipc numpy requests
- install rustup and restart current shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
- install rustup components:
rustup target add wasm32-wasi
rustup component add rustfmt
- if you already had rust installed, or are getting complaints from cargo about outdated version, run:
rustup update
rustup target add wasm32-wasi
Make sure you have XCode command line tools installed
by running xcode-select -p
and if not installed, run xcode-select --install
.
Install required packages via brew:
brew install cmake git ccache
Install rustup as per the Linux instructions above.
Build the rllm-cpp
; it should auto-detect and use Metal acceleration on Apple Silicon.
Please use a devcontainer or WSL2, as per the Linux instructions above.
Tracking issue for native Windows support.
If you have CUDA, go to rllm-cuda/
and run ./server.sh orca
.
This will run the inference server with Orca-2 13B model (which is expected by testcases).
If you don't have CUDA, go to rllm-cpp/
and run ./cpp-server.sh phi2
(phi2 is small enough to run on a CPU).
You can also pass GGUF URL on HuggingFace.
Both of these commands first compile aicirt and the inference engine, and then run it. You can also try other models, see README.md files for rllm-cuda and rllm-cpp as well as the shell scripts themselves for details.
To get started interacting with a cloud AICI server first export the API key.
If running local server, leave AICI_API_BASE
unset.
export AICI_API_BASE="https://inference.example.com/v1/#key=wht_..."
Now, use query the model with or without AICI Controller:
./aici.sh infer "The answer to the ultimate question of life"
./aici.sh run --build pyctrl pyctrl/samples/test.py
./aici.sh run --build jsctrl jsctrl/samples/hello.js
./aici.sh run --build aici_abi::yesno
Run ./aici.sh -h
to see usage info.
If the server is running with Orca-2 13B model,
you can also run tests with pytest
for the DeclCtrl,
with ./scripts/test-pyctrl.sh
for PyCtrl,
or with ./scripts/test-jsctrl.sh
for JsCtrl.
AICI abstracts LLM inference engine from the controller and vice-versa, as in the picture below. The rounded nodes are aspirational. Additional layers can be built on top - we provide promptlib, but we strongly believe that Guidance, LMQL, Outlines, jsonformer, LMFE, etc. can also run on top of AICI (either with custom controllers or utilizing PyCtrl or JsCtrl).
graph TD
PyCtrl -- AICI --> aicirt[AICI-runtime]
JsCtrl -- AICI --> aicirt
guidance([GuidanceCtrl]) -- AICI --> aicirt
lmql([LMQL Ctrl]) -- AICI --> aicirt
aicirt -- POSIX SHM --> rLLM
aicirt -- POSIX SHM --> llama[llama.cpp]
aicirt -- POSIX SHM --> pyaici
pyaici -- Python --> vLLM(vLLM)
pyaici -- Python --> hf(HF Transformers)
The pyaici package makes it easier to integrate AICI with Python-based LLM inference engines. The support for HuggingFace Transformers and vLLM REST server is currently out of date. Please use the rLLM-cuda or rLLM-llama-cpp for now.
aicirt
runs in a separate process, and can run under a different user than the LLM engine- Wasm modules are sandboxed by Wasmtime
- Wasm only have access to
aici_host_*
functions, implemented in hostimpl.rs aicirt
also exposes a partial WASI interface; however almost all the functions are no-op, except forfd_write
which shims file descriptors 1 and 2 (stdout and stderr) to print debug messages- each Wasm module runs in a separate process, helping with Spectre/Meltdown mitigation and allowing limits on CPU usage
In particular, Wasm modules cannot access the filesystem, network, or any other resources. They also cannot spin threads or access any timers (this is relevant for Spectre/Meltdown attacks).
Most of computation in AICI Controllers occurs on the CPU, in parallel with the logit generation on the GPU. The generation occurs in steps, where logits are generated in parallel for a new token for each sequence in a batch (typically between 1 and 50). This involves reading the whole model and KV caches for sequences in the batch from the GPU memory. For optimal batch throughput, the model and KV caches should utilize a major fraction of the GPU memory, and reading the whole memory takes about 40ms on A100 GPU (80GB).
Thus, each step of generation takes on the order of 20-50ms. With careful engineering, this is more than enough to compute the set of allowed tokens in Rust compiled to Wasm. These can be combined either natively in Rust, or via Python or JavaScript interpreters we provide.
For example, computing allowed token set in the 32000-strong vocabulary of Llama model takes:
- about 2.0ms for Yacc grammar of the C programming language
- about 0.3ms for a regular expression
- about 0.2ms for a substring contraint, from 4kB string
The above numbers are for a single sequence, however each sequence is processed in separate process, and thus if there is more cores than sequences (which is typical), they do not change. They also include overhead of calling into Python interpreter implemented in Wasm, and then back into Rust-generated Wasm code for the constraint itself. They are all well within the 20-50ms budget, so do not affect the generation time at all.
There is also some overhead in the critical path of sampling. It comes down to about 0.3ms per generation step when executing 10 sequences in parallel (this is irrespective of the constraint used). The overhead goes up to around 0.7ms for 40 sequences (though it has not been fully optimized yet).
WebAssembly is designed to have minimal overhead, compared to native code. In our experience, highly optimized Rust code is less than 2x slower when run in Wasmtime than native. This is 10-100x better than JavaScript or Python.
All measurements done on AMD EPYC 7V13 with nVidia A100 GPU with 80GB of VRAM.
The low-level interface that AICI runtime provides allows for:
- interaction with the LLM inference engine before, during, and after every generated token
- constraining decoding to a set of tokens
- backtracking KV-cache to a previous state
- fast-forwarding several tokens at a time (if they are known)
- forking generation into multiple branches
- communication between forks via shared variables
- utility functions for converting between tokens and byte strings
It can be utilized from any language that compiles to Wasm.
This repository provides a Rust library that makes it easy to implement controllers in Rust, and provides efficient implementations of specific constraints (regular expressions, yacc grammars, substrings). We also provide Python and JavaScript interpreters that allow to glue these constraints together. All of these can be easily extended.
- Flash Attention kernels are copied from flash-attention repo; see BSD LICENSE
- Paged Attention kernels are copied from vLLM repo; see Apache LICENSE
- OpenAI API definitions are copied and modified from candle-vllm; see MIT LICENSE
- cache_engine.rs, config.rs, and scheduler.rs are loosely based on vLLM
- llama.rs, phi.rs and logits.rs are based on candle-transformers
- specific Python library files are copied from RustPython (as we only use a subset of them)
- the example ANSI C grammar is based on https://www.lysator.liu.se/c/ANSI-C-grammar-y.html by Jeff Lee (from 1985)
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.