kmkolasinski / triton-saved-model

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Triton Python Backend for Multimodel/Signatures Inference Demo

  • This repository contains code for running multiple models on Triton Inference Server with python backend
  • The python backend code can handle multiple models and multiple signatures
  • This project was prepared for educational purposes, to show how we can use Triton Inference Server with python backend to simulate similar API as TFServing
docker compose build

Running notebooks

Running servers

  • Firstly, use export-classifier.ipynb to export various classifiers
  • To start triton server, run the following command
docker compose up triton_server 
  • To start the Tensorflow Serving server, run the following command
docker compose up tf_serving_server

Benchmark results

  • Two types of architectures were tested and exported to the classifier Modules used by servers:

    • ResNet50
      • standard SavedModel
      • SavedModel compiled with XLA and AMP
    • EfficientNetB0
      • standard SavedModel
      • SavedModel compiled with XLA and AMP
  • The benchmarks were performed on NVIDIA RTX A4000 GPU with 8GB of memory

  • Each benchmark was run for 500 iterations to predict batch of 100 images of size 224x224 (50k images in total)

  • I benchmarked only the images signature, which accepts images tensor of shape [batch, 224, 224, 3]

  • When running the models locally with TF python API, I got following results:

Model Architecture Time [s]
ResNet50 SavedModel 57
ResNet50 SavedModel XLA/AMP 25
EfficientNetB0 SavedModel 52
EfficientNetB0 SavedModel XLA/AMP 13
  • Running same benchmark but with Triton Inference Server (4 client threads, 1 server instance), I got following results:
Model Architecture Time [s]
ResNet50 SavedModel 73
ResNet50 SavedModel XLA/AMP 23
EfficientNetB0 SavedModel 54
EfficientNetB0 SavedModel XLA/AMP 17
  • Running same benchmark but with Triton Inference Server (4 client threads, 2 server instances), I got following results:
Model Architecture Time [s]
ResNet50 SavedModel 76
ResNet50 SavedModel XLA/AMP 20
EfficientNetB0 SavedModel 54
EfficientNetB0 SavedModel XLA/AMP 13
  • For Tensorflow Serving I was not able to text XLA/AMP models, I got following error when trying to serve them:
UNIMPLEMENTED: Could not find compiler for platform CUDA: NOT_FOUND: could not find registered compiler for platform CUDA
  • The results for TF Serving were as follows (excluding XLA/AMP models):
Model Architecture Time [s]
ResNet50 SavedModel 59
ResNet50 SavedModel XLA/AMP CUDA: NOT_FOUND error
EfficientNetB0 SavedModel 51
EfficientNetB0 SavedModel XLA/AMP CUDA: NOT_FOUND error

Also, I noticed that when using TFServing, GPU memory was higher than when using Triton Inference Server. I was getting OOM when allocating tensor with shape[100,56,56,256] when using num_workers=10

About

License:MIT License


Languages

Language:Jupyter Notebook 79.4%Language:Python 18.4%Language:Dockerfile 2.3%