kshitizgupta21 / triton-trt-oss

Example showing Triton hosting of TensorRT HuggingFace T5 and BART models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Example for Hosting TensorRT OSS HuggingFace Models on Triton Inference Server

To build the TensorRT (TRT) Engines

  1. Build TRT 8.5 OSS container
bash build_trt_oss_docker.sh
  1. Launch the container
bash run_trt_oss_docker.sh
  1. Change Directory and Pip install HF demo requirements
cd demo/HuggingFace
pip install -r requirements.txt
  1. Run build_t5_trt.py to build T5 TRT engines and build_bart_trt.py to build BART engines.

Triton Inference

Triton Model Repository is located at model_repository. Each model has model.py associated with it and config.pbtxt along with T5/BART TRT OSS code dependencies.

We showcase 2 models BART and T5 here. Currently, TRT T5 supports both beam search and greedy search. TRT BART only supports greedy search currently.

  • trt_t5_bs1_beam2 = TRT T5 Max Batch Size 1 Model with Beam Search=2
  • trt_bart_bs1_greedy = TRT BART Max Batch Size 1 Model with Greedy Search

Currently, TensorRT engines for T5 and BART don't produce correct output for Batch Sizes > 1 (this bug is being worked on). So we only show batch size = 1 example for T5 and BART here.

Steps for Triton TRT Inference

  1. Build Custom Triton container with TRT and other dependencies. Dockerfile is docker/triton_trt.Dockerfile
cd docker
bash build_triton_trt_docker.sh
cd ..
  1. Launch custom Triton container
bash run_triton_trt_docker.sh
  1. Launch JupyterLab at port 8888
bash start_jupyter.sh
  1. Run through 1_triton_server.ipynb to Launch Triton Server
  2. Run through 2_triton_client.ipynb to perform sample inference for T5 and BART TRT OSS HuggingFace models using Triton Server.

About

Example showing Triton hosting of TensorRT HuggingFace T5 and BART models

License:Apache License 2.0


Languages

Language:Python 87.4%Language:Dockerfile 6.0%Language:Jupyter Notebook 5.4%Language:Shell 1.2%