docker docker-compose embedding-vectors embeddings grpc ntex onnx onnxruntime proxy pytorch restful-api rust triton-client triton-server

triton-grpc-proxy-rs

Proxy server for triton gRPC server that inferences embedding model in Rust.

it refines the request and response formats of the Triton server.
without tritonclient dependency.
fast & easy to use.

Build

1. Convert the embedding model to onnx

BAAI/bge-large-en-v1.5 is used for an example.
It'll convert Pytorch into onnx model, and save it to ./model_repository/embedding/1/v1.onnx.
Currently, max_batch_size is limited to 256 due to OOM. You can change this value to fit your environment.

python3 convert.py

2. Run docker-compose

I'll run both Triton inference server and the proxy server.
You need to edit the absolute path of the volume (where pointed to the ./model_repository) in docker-compose.yml.

make run-docker-compose

Build & run a proxy server only

You can also build and run a triton proxy server with the below command.

export RUSTFLAGS="-C target-feature=native"
make server

make build-docker

Build & run triton inference server only

docker run --gpus all --rm --ipc=host --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v$(pwd)triton-grpc-proxy-rs/model_repository:/models nvcr.io/nvidia/tritonserver:23.09-py3 bash -c "LD_PRELOAD=/usr/lib/$(uname -m)-linux-gnu/libtcmalloc.so.4:${LD_PRELOAD} && pip install transformers tokenizers && tritonserver --model-repository=/models"

Architecture

recieve request(s) from the user.
- list of text (String) in this case.
request the Triton gRPC server to get embeddings.
post-process (cast and reshape) the embeddings and returns to the users.

API Specs

Configs

parse configuration from the env variables.
SERVER_PORT: proxy server port. default 8080.
TRITON_SERVER_URL: triton inference gRPC server url. default http://triton-server.
TRITON_SERVER_GRPC_PORT: triton inference gRPC server port. default 8001.
MODEL_VERSION: model version. default 1.
MODEL_NAME: model name. default model.
INPUT_NAME: input name. default text.
OUTPUT_NAME: output name. default embedding.
EMBEDDING_SIZE: size of the embedding. default 2048.

health

GET /health

curl -i http://127.0.0.1:8080/health

HTTP/1.1 200 OK
content-length: 2
date: Sun, 08 Oct 2023 06:33:53 GMT

ok

embedding

POST /v1/embedding
Request Body : [{'query': 'input'}, ... ]

curl -H "Content-type:application/json" -X POST http://127.0.0.1:8080/v1/embedding -d "[{\"query\": \"asdf\"}, {\"query\": \"asdf asdf\"}, {\"query\": \"asdf asdf asdf\"}, {\"query\": \"asdf asdf asdf asdf\"}]"

Response Body : [{'embedding': '1024 f32 vector'}, ...]

[{"embedding": [-0.8067292,-0.004603,-0.24123234,0.59398544,-0.5583446,...]}, ...]

Benchmark

Environment
- CPU : i7-7700K (not overclocked)
- GPU : GTX 1060 6 GB
- Rust : v1.73.0 stable
- Triton Server : 23-09-py3
  - backend : onnxruntime-gpu
  - allocator : tcmalloc
payload : [{'query': 'asdf' * 125}] * batch_size
stages
- request : end-to-end latency (client-side)
- model : only triton gRPC server latency (preprocess + tokenize + model)
- processing : request - model latency
  - json de/serialization
  - serialization (byte string, float vector)
  - cast & reshape 2d vectors

batch size	request	model	processing
8	27.2 ms	25.4 ms	1.8 ms
16	36.0 ms	33.7 ms	2.3 ms
32	50.6 ms	47.3 ms	3.3 ms
64	90.9 ms	85.5 ms	5.4 ms
128	139.2 ms	129.9 ms	9.3 ms
256	307.4 ms	287.1 ms	20.3 ms

To-Do

add Dockerfile and docker-compose to easily deploy the servers
triton inference server
- add model converter script.
- configurations
move hard-coded configs to env
optimize the proxy server performance
README
move tokenizer part from triton server into proxy-server

Maintainer

@kozistr

About

Proxy server for triton gRPC server that inferences embedding model in Rust

https://github.com/kozistr/triton-grpc-proxy-rs

docker docker-compose embedding-vectors embeddings grpc ntex onnx onnxruntime proxy pytorch restful-api rust triton-client triton-server

Apache License 2.0

Languages

Language:Rust 69.2%Language:Python 22.2%Language:Dockerfile 4.2%Language:Makefile 2.9%Language:Shell 1.5%