FastAPI wrapper for ctranslate2 models that uses a lite version of https://github.com/jncraton/languagemodels
to run the inference (currently forcing the use of CPUs). The creation of this repository was motivated by the need to deploy local versions of seq2seq models. A ctranslate2-compatible model can be downloaded locally, and the wrapper can be run by simply pointing to the folder containing the translator (model) and tokenizer files. The wrapper exposes two APIs, /completions
and /chat/completions
, that match the specifications of those by OpenAI (https://platform.openai.com/docs/api-reference
). In addition, the wrapper guarantees fast inference due to in-memory loading of the model artifacts at start-up. The wrapper was also containerised to ease its deployment.
A ctranslate2 model and a compatible tokenizer (https://pypi.org/project/tokenizers/
) must first be downloaded and a bootstrap_config.json
has to be created within its directory to instruct the wrapper on how to consume the model. An example script for downloading artifacts from HuggingFace is placed in artifacts/download_model.py
. The script requires huggingface-hub
to run. The repo_name
variable must be configured, e.g., repo_name = "MBZUAI/LaMini-Flan-T5-248M"
, and the script launched. The script will create automatically the path with the model files under artifacts
.
# Example of model download
$ python artifact/download_model.py
> Downloading MBZUAI/LaMini-Flan-T5-248M
> Fetching 11 files: 45%|███████████████████████████████████████
> ...
$ ls artifacts/MBZUAI/LaMini-Flan-T5-248M/
> config.json model.bin shared_vocabulary.txt tokenizer_config.json tokenizer.json
An example file of bootstrap configuration is placed in artifacts/example_bootstrap_config.json
. The three attributes in the file are mandatory to be configured to run the wrapper. The configuration can be easily done by browsing specs of the model of interest. Only ctranslate2 models can be run.
Ensure that you create a virtual environment to install the required dependencies. Install the dependencies using pip install -r env/requirements.txt
. Now you can run the wrapper as follows:
# Configure the environment to point to the model artifacts
BASE_PATH="/home/ghiander/ctranslate2-fastapi/artifacts"
export LLM_ARTIFACT_DIR="${BASE_PATH}/MBZUAI/LaMini-Flan-T5-248M"
# Run the wrapper locally
$ uvicorn --app-dir src main:app
> INFO:main:Loaded 'LaMini-Flan-T5-248M' model into memory
> INFO: Started server process [30596]
> INFO: Waiting for application startup.
> INFO: Application startup complete.
> INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
The wrapper can also be run (also in debug mode) using the commands in the makefile
$ make run
> INFO:main:Loaded 'LaMini-Flan-T5-248M' model into memory
> INFO: Started server process [30596]
> ...
# Debug mode
$ make run-dev
> INFO: Will watch for changes in these directories: ['/home/kdxr003/llm/ctranslate2-fastapi']
> INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
> INFO: Started reloader process [11842] using WatchFiles
> INFO:main:Loaded 'LaMini-Flan-T5-248M' model into memory
> INFO: Started server process [11845]
> ...
The API Swagger is automatically generated by FastAPI at http://127.0.0.1:8000/docs
.
An example of a /completions
request is shown below.
$ curl --request POST \
--url http://127.0.0.1:8000/completions \
--header 'content-type: application/json' \
--data '{"prompt": "What'\''s the first name of the secret agent Bond?"}'
> {
"id": "BY3J4AYTHW",
"model": "LaMini-Flan-T5-248M",
"usage": {
"completion_tokens": 2,
"prompt_tokens": 12,
"total_tokens": 14
},
"choices": [
{
"text": "James."
}
]
}
An example of a /chat/completions
request is shown below.
$ curl --request POST \
--url http://127.0.0.1:8000/chat/completions \
--header 'content-type: application/json' \
--data '{"messages": [{"role": "system","content": "Respond like a helpful assistant."},{"role": "user","content": "What'\''s a nice reptile?"}]}'
> {
"id": "LB4WLPXHTR",
"model": "LaMini-Flan-T5-248M",
"usage": {
"completion_tokens": 10,
"prompt_tokens": 16,
"total_tokens": 26
},
"choices": [
{
"message": {
"role": "assistant",
"content": "A nice reptile is a snake."
}
}
]
}
The Docker image build was designed to be a two-step process: The building of the base wrapper image without any model artifacts (i.e., just the code that is needed to run the wrapper), and the injection of the model artifacts files into a child image (i.e., code + model files). The idea is that the wrapper base image can be reused across different model images without the need to rebuild when a new model is created. The two steps are captured in commands in the makefile
.
To create the wrapper base image:
$ make build
> Sending build context to Docker daemon 281.6kB
> ...
> Successfully tagged ct2-wrapper:latest
To inject a model was was downloaded previously, the variable MODEL_NAME
inside artifacts/inject_model_into_image.sh
must be edited to match the relative path of the model folder. Then, the script can be run:
$ make inject
> cd artifacts && ./inject_model_into_image.sh
> ...
> MBZUAI/LaMini-Flan-T5-248M was injected into the image
Now a container can be run:
$ make run-docker
> docker run --rm --name ct2-model ct2-model
> INFO:main:Loaded 'LaMini-Flan-T5-248M' model into memory
> INFO: Started server process [7]
> ...
Note that the dockerised service will be listening on http://0.0.0.0:8000
instead of http://127.0.0.1:8000
since the Docker port mapping points to 0.0.0.0
by default.
A bunch of tests have been implemented in the test/
folder using different methods. Please refer to the makefile
to see how to run the tests individually.
https://github.com/jncraton/languagemodels
implemented tests usingdoctest
. Those were retained were appropriate and can be run sequentially intest/test_doctest.py
.- The main functions of
languagemodels
and the wrapper are tested intest/test_pytest.py
- which requirespytest
(seeenv/requirements_dev.txt
). - The APIs are tested in - which requires
httpx
(seeenv/requirements_dev.txt
).
How to run the full test suite - make sure you have activated the environment with all the necessary dependencies and are pointing LLM_ARTIFACT_DIR
to a folder with model and tokenizer:
$ make test
> python test/test_doctest.py
> Modules to be tested: [<module 'languagemodels' from '/home/kdxr003/llm/ctranslate2-fastapi/lib/languagemodels/__init__.py'>, <module 'languagemodels.bootstrap' from '/home/kdxr003/llm/ctranslate2-fastapi/lib/languagemodels/bootstrap.py'>, <module 'languagemodels.inference' from '/home/kdxr003/llm/ctranslate2-fastapi/lib/languagemodels/inference.py'>, <module 'languagemodels.models' from '/home/kdxr003/llm/ctranslate2-fastapi/lib/languagemodels/models.py'>]
> ...
> pytest -vvs test/test_pytest.py
> ...
> test/test_pytest.py::test_load_artifacts PASSED
> test/test_pytest.py::test_completions_lazy_loading PASSED
> ...
> pytest -vvs test/test_api.py
> test/test_api.py::test_health PASSED
> test/test_api.py::test_completions PASSED
> test/test_api.py::test_chat PASSED