ctranslate2-fastapi

FastAPI wrapper for ctranslate2 models that uses a lite version of https://github.com/jncraton/languagemodels to run the inference (currently forcing the use of CPUs). The creation of this repository was motivated by the need to deploy local versions of seq2seq models. A ctranslate2-compatible model can be downloaded locally, and the wrapper can be run by simply pointing to the folder containing the translator (model) and tokenizer files. The wrapper exposes two APIs, /completions and /chat/completions, that match the specifications of those by OpenAI (https://platform.openai.com/docs/api-reference). In addition, the wrapper guarantees fast inference due to in-memory loading of the model artifacts at start-up. The wrapper was also containerised to ease its deployment.

Usage

Artifact download and bootstrap configuration

A ctranslate2 model and a compatible tokenizer (https://pypi.org/project/tokenizers/) must first be downloaded and a bootstrap_config.json has to be created within its directory to instruct the wrapper on how to consume the model. An example script for downloading artifacts from HuggingFace is placed in artifacts/download_model.py. The script requires huggingface-hub to run. The repo_name variable must be configured, e.g., repo_name = "MBZUAI/LaMini-Flan-T5-248M", and the script launched. The script will create automatically the path with the model files under artifacts.

# Example of model download
$ python artifact/download_model.py
> Downloading MBZUAI/LaMini-Flan-T5-248M
> Fetching 11 files:  45%|███████████████████████████████████████
> ...

$ ls artifacts/MBZUAI/LaMini-Flan-T5-248M/
> config.json  model.bin  shared_vocabulary.txt  tokenizer_config.json  tokenizer.json

An example file of bootstrap configuration is placed in artifacts/example_bootstrap_config.json. The three attributes in the file are mandatory to be configured to run the wrapper. The configuration can be easily done by browsing specs of the model of interest. Only ctranslate2 models can be run.

Run the wrapper without Docker

Ensure that you create a virtual environment to install the required dependencies. Install the dependencies using pip install -r env/requirements.txt. Now you can run the wrapper as follows:

# Configure the environment to point to the model artifacts
BASE_PATH="/home/ghiander/ctranslate2-fastapi/artifacts"
export LLM_ARTIFACT_DIR="${BASE_PATH}/MBZUAI/LaMini-Flan-T5-248M"

# Run the wrapper locally
$ uvicorn --app-dir src main:app
> INFO:main:Loaded 'LaMini-Flan-T5-248M' model into memory
> INFO:     Started server process [30596]
> INFO:     Waiting for application startup.
> INFO:     Application startup complete.
> INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

The wrapper can also be run (also in debug mode) using the commands in the makefile

$ make run
> INFO:main:Loaded 'LaMini-Flan-T5-248M' model into memory
> INFO:     Started server process [30596]
> ...

# Debug mode
$ make run-dev
> INFO:     Will watch for changes in these directories: ['/home/kdxr003/llm/ctranslate2-fastapi']
> INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
> INFO:     Started reloader process [11842] using WatchFiles
> INFO:main:Loaded 'LaMini-Flan-T5-248M' model into memory
> INFO:     Started server process [11845]
> ...

Access the API docs and make requests

The API Swagger is automatically generated by FastAPI at http://127.0.0.1:8000/docs.

An example of a /completions request is shown below.

$ curl --request POST \
   --url http://127.0.0.1:8000/completions \
   --header 'content-type: application/json' \
   --data '{"prompt": "What'\''s the first name of the secret agent Bond?"}'

> {
    "id": "BY3J4AYTHW",
    "model": "LaMini-Flan-T5-248M",
    "usage": {
        "completion_tokens": 2,
        "prompt_tokens": 12,
        "total_tokens": 14
    },
    "choices": [
        {
            "text": "James."
        }
    ]
}

An example of a /chat/completions request is shown below.

$ curl --request POST \
  --url http://127.0.0.1:8000/chat/completions \
  --header 'content-type: application/json' \
  --data '{"messages": [{"role": "system","content": "Respond like a helpful assistant."},{"role": "user","content": "What'\''s a nice reptile?"}]}'

> {
    "id": "LB4WLPXHTR",
    "model": "LaMini-Flan-T5-248M",
    "usage": {
        "completion_tokens": 10,
        "prompt_tokens": 16,
        "total_tokens": 26
    },
    "choices": [
        {
            "message": {
                "role": "assistant",
                "content": "A nice reptile is a snake."
            }
        }
    ]
}

Build and run the wrapper using Docker

The Docker image build was designed to be a two-step process: The building of the base wrapper image without any model artifacts (i.e., just the code that is needed to run the wrapper), and the injection of the model artifacts files into a child image (i.e., code + model files). The idea is that the wrapper base image can be reused across different model images without the need to rebuild when a new model is created. The two steps are captured in commands in the makefile.

To create the wrapper base image:

$ make build
> Sending build context to Docker daemon  281.6kB
> ...
> Successfully tagged ct2-wrapper:latest

To inject a model was was downloaded previously, the variable MODEL_NAME inside artifacts/inject_model_into_image.sh must be edited to match the relative path of the model folder. Then, the script can be run:

$ make inject
> cd artifacts && ./inject_model_into_image.sh
> ...
> MBZUAI/LaMini-Flan-T5-248M was injected into the image

Now a container can be run:

$ make run-docker
> docker run --rm --name ct2-model ct2-model
> INFO:main:Loaded 'LaMini-Flan-T5-248M' model into memory
> INFO:     Started server process [7]
> ...

Note that the dockerised service will be listening on http://0.0.0.0:8000 instead of http://127.0.0.1:8000 since the Docker port mapping points to 0.0.0.0 by default.

For developers

A bunch of tests have been implemented in the test/ folder using different methods. Please refer to the makefile to see how to run the tests individually.

https://github.com/jncraton/languagemodels implemented tests using doctest. Those were retained were appropriate and can be run sequentially in test/test_doctest.py.
The main functions of languagemodels and the wrapper are tested in test/test_pytest.py - which requires pytest (see env/requirements_dev.txt).
The APIs are tested in - which requires httpx (see env/requirements_dev.txt).

How to run the full test suite - make sure you have activated the environment with all the necessary dependencies and are pointing LLM_ARTIFACT_DIR to a folder with model and tokenizer:

$ make test
> python test/test_doctest.py
> Modules to be tested: [<module 'languagemodels' from '/home/kdxr003/llm/ctranslate2-fastapi/lib/languagemodels/__init__.py'>, <module 'languagemodels.bootstrap' from '/home/kdxr003/llm/ctranslate2-fastapi/lib/languagemodels/bootstrap.py'>, <module 'languagemodels.inference' from '/home/kdxr003/llm/ctranslate2-fastapi/lib/languagemodels/inference.py'>, <module 'languagemodels.models' from '/home/kdxr003/llm/ctranslate2-fastapi/lib/languagemodels/models.py'>]
> ...
> pytest -vvs test/test_pytest.py
> ...
> test/test_pytest.py::test_load_artifacts PASSED
> test/test_pytest.py::test_completions_lazy_loading PASSED
> ...
> pytest -vvs test/test_api.py
> test/test_api.py::test_health PASSED
> test/test_api.py::test_completions PASSED
> test/test_api.py::test_chat PASSED

ghiander / ctranslate2-fastapi