This library serves as a platform for utilizing and creating applications based on pre-existing foundation models. Its features include:
- Loading large language models (LLMs) as PyTorch modules.
- Establishing an API server that resembles the ChatGPT API.
- Support loading 8bit and 4bit quantized models for faster inferences.
Illustrative examples for each use case can be found in the examples/
folder.
Use the package manager conda to install the required parameters
conda env create -f env.yml
Install the dependencies of the module (GPTQ)
cd llm_lib/modules/repositories/
git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
cd GPTQ-for-LLaMa
python setup_cuda.py install
After installing GPTQ, navigate back the the root folder
cd ../../../../
Add the library to PYTHONPATH via
pip install -e .
The library supports two main use cases:
- Loading LLMs and utilizing them as regular PyTorch modules. This is ideal for users seeking complete control over the model.
- Establishing an API server resembling the ChatGPT API and employing an API client to connect to the API. This is suitable for users who prefer not to modify their model code.
from llm_lib.utils import load_model
transformer, tokenizer = load_model(model_path, **kwargs)
Supported parameters for load_model
Parameter | Description |
---|---|
model_path | Path to the downloaded model weights. (For A2I2 students/researchers, please refer to the Supported Pre-trained Weights section) |
load_in_8bit | Determines whether to load the model with 8-bit precision. This option allows for loading models using fewer GPU resources, with a slight tradeoff in performance. |
auto_devices | Controls whether the GPU usage is distributed across multiple GPUs automatically. |
wbits, groupsize | Parameters for utilizing GPTQ quantization. For more details, please refer to this paper and this repository. |
Examples are provided in examples/
By default the APIs will be accessed via "http://0.0.0.0:8000/v1/". The documentation of the APIs is accessed via "http://0.0.0.0:8000/docs"
python -m llm_lib.server --model_path PATH_TO_MODEL_WEIGHT
Supported parameters are similar to those in load_model
above
Example of using the LLMClient in the python code
from llm_lib.client import LLMClient
local_llm = LLMClient(host="http://0.0.0.0:8000/v1")
# Sentence completion
completion = local_llm.create_completion(prompt="Hello, How are you?", max_tokens=128, temperature=1.0).response.choices[0].text.strip()
Note: If you host the API server at a different machine with a different address, you need to replace "http://0.0.0.0:8000/v1" with your address.
Examples are provided in examples/
The library current supports loading:
- LLM models with default weights
- LLM models with 8bit and 4bit quantization.
You can automatically download a model from HuggingFace (HF) using download_model.py
python download-model.py organization/model
For example
python download-model.py facebook/opt-1.3b
A2I2 students and researchers can utilize the downloaded model weights stored in /weka/Projects/local_llms/model_weights/
.
It is important to note that models ending with 4bits-128g
or 4bits
require specific flags to be enabled during execution.
For *-4bits-128g
models, they should be executed with the flags --wbits 4 --groupsize 128
.
For *-4bits
models, only the --wbits 4
flag needs to be used."
Here are some snapshots of how to use the downloaded models on weka
# As pytorch module
transformer, tokenizer = load_model(model_path="/weka/Projects/local_llms/model_weights/TheBloke_vicuna-13B-1.1-GPTQ-4bit-128g/", wbits=4, groupsize=128)
transformer, tokenizer = load_model(model_path="/weka/Projects/local_llms/model_weights/vicuna13B/")
# As API
python -m llm_lib.server --model_path /weka/Projects/local_llms/model_weights/TheBloke_vicuna-13B-1.1-GPTQ-4bit-128g/ --wbits 4 --groupsize 128
See examples/
folder for detailed examples.
The proper documation will be written soon.
Some of the codes are borrowed from