exllama inference llama llama2 llama3 llm-inference server flash-attention-2 paged-attention exllamav2

illama

illama is a lightweight, fast inference server for Llama and ExLlamav2 based large language models (LLMs).

Features

Continuous batching - Handles multiple requests simultaneously.
Open-AI compatible server - Use official OpenAI API clients
Quantization Support - Load any quantized ExLlamaV2 compatible models (GPTQ, EXL2, or SafeTensors).
GPU Focused - Distribute model across any number of local GPUs.
Uses FlashAttention 2 with Paged Attention by default

Getting Started

To get started, clone the repo.

git clone https://github.com/nickpotafiy/illama.git
cd illama

With Conda

Optionally, create a new conda environment.

conda create -n illama python=3.10
conda activate illama

Install PyTorch

Install Nvidia Cuda Toolkit and PyTorch. Ideally, both versions should match to minimize incompatibilities. PyTorch CUDA 12.1 is recommended with Nvidia CUDA Toolkit 12.1+.

Install Torch w/ Pip

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Install Torch w/ Conda

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

Check Torch CUDA version with: python -c "import torch; print(torch.version.cuda)"

Install Illama

First, install setup libraries:

pip install packaging ninja

Then, install the main package:

pip install .

If installation fails, you may need to set MAX_JOBS=4 or export MAX_JOBS=4 (or lower) depending on system memory. This is a known flash-attn problem.

Running the Server

To start illama server, run this command:

python server.py --model-path "<path>" --batch-size 10 --host "0.0.0.0" --port 5000 --verbose

Run python server.py --help to get a list of all available options.

Troubleshooting

If you get an error saying OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root, that typically means PyTorch was not installed correctly. You can verify PyTorch installation by activating your environment and executing python:

import torch
torch.version.cuda

If you don't get your PyTorch CUDA version, then it was not installed correctly. You may have installed PyTorch without CUDA (like a Preview build).

About

A fast, lightweight, parallel inference server for Llama LLMs.

exllama inference llama llama2 llama3 llm-inference server flash-attention-2 paged-attention exllamav2

MIT License

Languages

Language:Python 100.0%