A Rust crate for managing the inference process for machine learning (ML) models. Currently, we support interacting with a Triton Inference Server, loading models from a MinIO Model Store.
- rust: Minimum Supported Rust Version 1.58
- lld linker: for faster Rust builds.
- See .cargo/config.toml for platform-specific installation instructions and the Rust Performance Book for the reasons for using lld.
- docker: Container engine
- NVIDIA container toolkit for Docker and a GPU supported by the container toolkit.
- It may be possible to avoid the NVIDIA GPU requirement by changing the resource reservations for the
triton
service indocker compose
to not require a GPU...YMMV based on whether the model you're trying to serve inference requests from was already compiled/optimized for GPU-only inference.
- It may be possible to avoid the NVIDIA GPU requirement by changing the resource reservations for the
- docker compose: Multi-container orchestration. NOTE:
docker-compose
is now deprecated and the compose functionality is integrated into thedocker compose
command. To install alongside an existing docker installation, runsudo apt-get install docker-compose-plugin
. ref. - protoc: Google Protocol Buffer compiler. Needed to build protobufs to bind to the Triton gRPC server.
For Debian-based Linux distros, you can install inference
's dependencies (except Docker & NVIDIA container toolkit, that require special repository configuration documented above) with the following command:
apt-get install clang build-essential lld clang protobuf-compiler libprotobuf-dev zstd libzstd-dev make cmake pkg-config libssl-dev
inference
is tested on Ubuntu 22.04 LTS, but welcomes pull requests to fix Windows or MacOS issues.
- Clone repo:
git clone https://github.com/opensensordotdev/inference.git
- Ensure all requirements have been installed, especially the lld linker and
protoc
! Otherwiseinference
won't build!
make
: Download the latest versions of the Triton Inference Server Protocol Buffer files & Triton sample ML modelsdocker compose up
: Start the MinIO and Triton containers + monitoring infrastructure
- If you have a GPU available, uncomment the below section of the
inference.triton
service indocker-compose.yaml
. In order for your GPU to work with Triton, the CUDA versions on your host OS and the CUDA version expected by Triton have to be compatible.deploy: resources: reservations: devices: - driver: nvidia capabilities: [gpu]
- If you don't have a GPU, comment out that section
- Upload the contents of the
sample_models
directory to themodels
bucket vis the MinIO web UI atlocalhost:9001
cargo test
: Verify all cargo tests pass
http://localhost:8000/v2/models/simple
Will print model name and parameters required to set up the inputs and outputs.
proto folder will contain protocol buffers. Only grpc_service.proto
is referenced in the build.rs
because model_config.proto
is included by grpc_service
. Generated code from tonic is in inference.rs
- Json is served on port 8000
- gRPC calls are submitted on on 8001
- Prometheus metrics are on 8002
Submitting requests to a gRPC service requires a mutable reference to a Client. This prohibits you from passing a single Client around to multiple Tasks and creates a bottleneck for async code.
Trying to hide this from users by wrapping what amounts to a synchronous resource in a struct and using async message passing to access it might help some but still doesn't fix the core problem.
While it would be possible to make a connection pool of multiple Client<Channel>
s and hide this pool in
a struct accessed with async message passing, this is complicated.
It also doesn't work to store a tonic.transport.Channel
in the TritonClient struct...it requires the struct to implement
some obscure internal tonic traits. tonic.transport.Channel.
The idiomatic way appears to be storing a single master Client in a struct and then providing a function that returns a clone of the Client since Cloning clients is cheap.
A limitation of this could be that gRPC servers usually have a finite number of connections they can multiplex (100 seems to be the number a lot of places throw out). See gRPC performance best practices.
tonic
seems to have a default buffer size of 1024. Source: DEFAULT_BUFFER_SIZE
channel/mod.rs
This might be useful eventually if you have multiple Triton pods and want to discover which ones are live + update the endpoint list grpc load balancing, github.
Not clear if there's a connection pool under the hood there or how they're able to connect to multiple servers?