This is a non-production ready frontend for LLaMA. Do not expose this to the internet unless you are prepared to have random scanners instantly take control of your machine.
- Linux
- Cuda toolkit installed (supported versions up to 12.0.1 but not 12.1.x)
- Python 3.10+, pip
- NVidia GPU with 10GB+ of VRAM (8-bit inference courtesy of https://github.com/tloen/llama-int8)
- Clone this repo.
pip install -r requirements.txt
pip install -e .
- Copy
7B
checkpoints folder andtokenizer.model
to repository root.
flask run
will start the app server on 127.0.0.1:5000
. Enable access from other devices on the network with flask run --host 0.0.0.0
.
The checkpoint folder, tokenizer file, and other options can be configured with environment variables.
Variable | Description | Default value |
---|---|---|
CHECKPOINT_DIR | Path of 7B checkpoint folder | 7B |
TOKENIZER_PATH | Path of tokenizer model | tokenizer.model |
CONTEXT_LEN | Number of context tokens to provide to model. Reduce if you get cuBLAS or memory errors. | 768 |