Ray Serve

Ray Serve is a framework agnostic, giving an end to end control over the API while delivering scalability and high performance. You can develop Ray Serve on your laptop, deploy it on a dev box, and scale it out to multiple machines or K8s cluster without changing one lines of code.

Serve runs on Ray and utilizes Ray actors.

There are three kinds of actors that are created to make up a Serve instance:

Controller: A global actor unique to each Serve instance that manages the control plane. The Controller is responsible for creating, updating, and destroying other actors. Serve API calls like creating or getting a deployment make remote calls to the Controller.
Router: There is one router per node. Each router is a Uvicorn HTTP server that accepts incoming requests, forwards them to replicas, and responds once they are completed.
Worker Replica: Worker replicas actually execute the code in response to a request. For example, they may contain an instantiation of an ML model. Each replica processes individual requests from the routers (they may be batched by the replica using @serve.batch, see the batching docs).

In this exercise, we will deploy the sentiment analysis huggingface transformer model using Ray Serve so it can be scaled up and queried over HTTP using two approaches.

Ray Serve Deployment

Approach 1

Using ray serve

import ray
from ray import serve

from sentiment.model import SentimentBertModel

# connect to Ray cluster
ray.init(address="auto", namespace="serve")
# start ray serve runtime
serve.start(detached=True)

def sentiment_classifier(text: str):
    classifier = SentimentBertModel()
    return classifier.predict(text)

# add the decorator @serve.deployment to the router function to turn the function into a Serve Deployment object.
@serve.deployment
# input : Starlette request object
def router(request):
    txt = request.query_params["txt"]
    return sentiment_classifier(txt)

# deploy the router deployment object to the ray serve runtime
router.deploy()

ray.init(): connects to or starts a single-node Ray cluster on your local machine, which allows you to use all your CPU cores to serve requests in parallel
serve.start(): start Ray Serve runtime
@serve.deployment: @serve.deployment makes router function a Deployment object. If you want the function name to be different than the name in the HTTP request, you can add the name keyword parameter to the @serve.deployment decorator to specify the name sent in the HTTP request.

Approach 2

Ray Serve with FastAPI. Using FastAPI, we can add more complex HTTP handling logic along with cool features such as variable routes, automatic type validation, dependency injection, etc.

import ray
from ray import serve

from fastapi import FastAPI
from fastapi import Query
from starlette.responses import JSONResponse

from sentiment.model import SentimentBertModel

app = FastAPI()
ray.init(address="auto", namespace="classifier")
serve.start(detached=True)

@serve.deployment(route_prefix="/")
@serve.ingress(app)
class Classifier:
    def __init__(self) -> None:
        self.classifier = SentimentBertModel(
            "distilbert-base-uncased-finetuned-sst-2-english"
        )

    @app.get("/test")
    def root(self):
        return "Sentiment Classifier (0 -> Negative and 1 -> Positive)"

    @app.get("/healthcheck", status_code=200)
    def healthcheck(self):
        return "dummy check! Classifier is all ready to go!"

    @app.post("/classify")
    async def predict_sentiment(self, input_text: str = Query(..., min_length=2)):
        out_dict = self.classifier.predict(input_text)
        return JSONResponse(out_dict)

Classifier.deploy()

Run

Locally

Test the sentiment classifier model

docker build -t sentiment -f sentiment/Dockerfile.sentiment sentiment/
docker run --rm -it sentiment

Test Ray Serve Deployment locally

Note: Replace <shm-size> with a limit appropriate for your system, for example 512M or 2G. A good estimate for this is to use roughly 30% of your available memory (this is what Ray uses internally for its Object Store).

docker build -t rayserve .
docker run --shm-size=2G -p 8000:8000 -p 8265:8265 -it --rm rayserve

Inside docker container,

ray start --head --dashboard-host=0.0.0.0
python3 serve_with_rayserve.py

127.0.0.1:8265 to access ray dashboard.

We can now test our model over HTTP. The structure of our HTTP query is:

http://127.0.0.1:8000/[Deployment Name]?[Parameter Name-1]=[Parameter Value-1]&[Parameter Name-2]=[Parameter Value-2]&...&[Parameter Name-n]=[Parameter Value-n]

The 127.0.0.1:8000 refers to a localhost with port 8000. The [Deployment Name] is router in our case, it refers to either the name of the function that we called .deploy() on or the name keyword parameter’s value in @serve.deployment.

Each [Parameter Name] refers to a field’s name in the request’s query_params dictionary for our deployed function. In our example, the only parameter we need to pass in is txt. This parameter is referenced in the txt = request.query_params["txt"] line in the router function. Each [Parameter Name] object has a corresponding [Parameter Value] object. The txt’s [Parameter Value] is a string containing the article text to summarize. We can chain together any number of the name-value pairs using the & symbol in the request URL.

# test inference request
python3 test_local_http_endpoint.py

First request takes time (~ 100 secs) as model needs to be downloaded.

Alternatively querying using httpie

python -m pip install httpie
http GET 127.0.0.1:8000/router txt==ilikeyou

To stop the ray cluster.

ray stop

Inside same docker container,

Test using FastAPI and Ray Serve

ray start --head --dashboard-host=0.0.0.0
python3 serve_with_fastapi.py
http GET 127.0.0.1:8000/test
http GET 127.0.0.1:8000/healthcheck
http POST 127.0.0.1:8000/classify?input_text=ilikeyou
http POST 127.0.0.1:8000/classify?input_text=i
ray stop

dudeperf3ct / 16-rayserve-deploy

Ray Serve

Ray Serve Deployment

Approach 1

Approach 2

Run

Locally

About

Languages