briancaffey / trt-llm-as-openai-windows

This reference can be used with any existing OpenAI integrated apps to run with TRT-LLM inference locally on GeForce GPU on Windows instead of cloud.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

๐Ÿš€ TensorRT-LLM as OpenAI API on Windows ๐Ÿค–

Drop-in replacement REST API compatible with OpenAI API spec using TensorRT-LLM as the inference backend.

Setup a local Llama 2 or Code Llama web server using TRT-LLM for compatibility with the OpenAI Chat and legacy Completions API. This enables accelerated inference on Windows natively, while retaining compatibility with the wide array of projects built using the OpenAI API.

Follow this README to setup your own web server for Llama 2 and Code Llama.

Getting Started

Ensure you have the pre-requisites in place:

  1. Install TensorRT-LLM for Windows using the instructions here.

  2. Ensure you have access to the Llama 2 repository on Huggingface

  3. In this repo, we provide instructions to set up an OpenAI API compatible server with either the LLama 2 13B or Code Llama 13B model, both optimized using AWQ 4-bit quantization. To begin, it's necessary to compile a TensorRT Engine tailored to your specific GPU. Refer to the given instructions for constructing your TRT Engine instructions.

Building TRT Engine

Follow these steps to build your TRT engine:

Download models and quantized weights

  • CodeLlama-13B-instruct AWQ int
  • Llama-2-13b-chat AWQ int4
    • Download Llama-2-13b-chat model from Llama-2-13b-chat-hf
    • Download Llama-2-13b-chat AWQ int4 checkpoints from here

Clone the TensorRT LLM repository:

git clone https://github.com/NVIDIA/TensorRT-LLM.git

For Code Llama engine, navigate to the examples\llama directory and run the following script:

python build.py --model_dir <path to CodeLlama model> --quant_ckpt_path <path to CodeLlama .npz file> --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --max_batch_size 1 --max_input_len 15360 --max_output_len 1024 --output_dir <TRT engine folder> --rotary_base 1000000 --vocab_size 32064

For Llama2 engine, navigate to the examples\llama directory and run the following script:

python build.py --model_dir <path to llama13_chat model> --quant_ckpt_path <path to Llama2 .npz file> --dtype float16 --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --use_weight_only --weight_only_precision int4_awq --per_group --enable_context_fmha --max_batch_size 1 --max_input_len 3500 --max_output_len 1024 --output_dir <TRT engine folder>

Setup Steps

  1. Clone this repository:

    https://github.com/NVIDIA/trt-llm-as-openai-windows
    cd trt-llm-as-openai-windows
    
  2. Download the tokenizer and config.json from HuggingFace and place them in the model/ directory.

  3. Build the TRT engine by following the instructions provided here and place the TensorRT engine files for the Llama2/CodeLlama model in the 'model/engine' directory

  4. Install the necessary libraries:

    pip install -r requirements.txt
    
  5. Launch the application using the following command:

    • Llama-2-13B-chat model
    python app.py --trt_engine_path <TRT Engine folder> --trt_engine_name <TRT Engine file>.engine --tokenizer_dir_path <tokernizer folder> --port <optional port>
    
    • CodeLlama-13B-instruct model needs additional parameter mentioned below to the command above :
    --no_system_prompt True
    

    In our case, that will be (for CodeLlama):

    python app.py --trt_engine_path model/ --trt_engine_name llama_float16_tp1_rank0.engine --tokenizer_dir_path model/ --port 8081 --no_system_prompt True
    

Test the API

  1. Install the 'openai' client library in your Python environment.

    pip install openai==0.28
    
  2. Run the following code inside your Python env.

  3. Set a random API key and the base URL.

openai.api_key="ABC"  
openai.api_base="http://127.0.0.1:8081"
response = openai.ChatCompletion.create(
  model = "Llama2",
  prompt = "Hello! How are you?")
print(response)

Detailed Command References

python app.py --trt_engine_path <TRT Engine folder> --trt_engine_name <TRT Engine file>.engine --tokenizer_dir_path <tokernizer folder> --port <port>

Arguments

Name Details
--trt_engine_path <> Directory of TensorRT engine (built TRT engine using instructions)
--trt_engine_name <> Engine file name (e.g. llama_float16_tp1_rank0.engine)
--tokenizer_dir_path <> HF downloaded model files for tokenizer & config.json e.g. Llama-2-13b-chat-hf or CodeLlama-13b-Instruct-hf
--port <> OpenAI compatible server hosted on localhost and 8081 port as default. Optionally, allows to specify a different port.
--max_output_tokens Optional override to maximum output token sizes otherwise it defaults to 2048
--max_input_tokens Optional override to maximum input token sizes otherwise it defaults to 2048
--no_system_prompt App uses default system prompt and optionally supported to avoid it.

Supported APIs

  • /completions
  • /chat/completions
  • /v1/completions
  • /v1/chat/completions

Examples

Continue.dev Visual Studio Code Extension with CodeLlama-13B

  1. Run this app with CodeLlama-13B-instruct AWQ int4 model as described above.
  2. Install Continue.dev from Visual Studio Marketplace
  3. Configure to use OpenAI API compatible local inference from UI
    1. Open Continue.dev plugin from Visual Studio Code left panel
    2. Click "+" to add new model
    3. Select "Other OpenAI-compatible API"
    4. Expand "Advanced (optional)"
      1. apiBase: update to local host url like http://localhost:8081/v1
      2. update contextLength: 16384
    5. Select CodeLlama 13b instruct option
  4. Alternatively config.json can be modified directly to include below
    1. Open c:\users\<user>\.continue\config.json in any editor
    2. Add below model config
      {
         "models": [
            {
               "title": "CodeLlama-13b-Instruct",
               "provider": "openai",
               "model": "codellama:13b-instruct",
               "apiBase": "http://localhost:8081",
               "contextLength": 16384,
            }
         ]
      }

This project requires additional third-party open source software projects as specified in the documentation. Review the license terms of these open source projects before use.

About

This reference can be used with any existing OpenAI integrated apps to run with TRT-LLM inference locally on GeForce GPU on Windows instead of cloud.

License:Other


Languages

Language:Python 100.0%