Deploying a Distributed Ray Python Server with Kubernetes, EKS & KubeRay to serve our own LLM

For Ray Serve, it is recommended in the docs to deploy it in production on Kubernetes, with the recommended practice to use the RayService controller that’s provided as part of KubeRay.

This integration offers the scalability and user experience of Ray Serve with the operational benefits of Kubernetes, including the ability to integrate with existing Kubernetes-based applications. RayService simplifies production deployment by managing health checks, status reporting, failure recovery, and updates for you.

This repo is intended to go alongside my YouTube video on the same topic:

Getting started

Install ray:

pip install -U "ray[default, serve]"

Aswell as the other dependencies:

pip install -r requirements.txt

This is repo with minimal code to test ray deployment of a Serve deployment integrated with FastAPI.

Test a local ray serve deployment
- serve run app.main:app
Serve Deployment Locally
- serve build app.main:app -o config/serve-deployment.yaml
- Start a local Ray cluster: ray start --head
- Start the application: serve deploy aws/serve-deployment.yaml
- Stop the application: serve shutdown
- Stop the local Ray cluster: ray stop
Push to dockerHub:
- docker build . -t shavvimal/ray_llm:latest
- docker image push shavvimal/ray_llm:latest
Deploy on Kubernetes Locally
Deploy on Kubernetes on AWS

Notes

See my wiki for more details on the deployment process, including Log Persistence, Autoscaling, and more.

Shavvimal / RayLLM

Deploying a Distributed Ray Python Server with Kubernetes, EKS & KubeRay to serve our own LLM

Getting started

Notes

About

Languages