Proposal: Kubeflow Serverless Serving CRD

Question

Proposal: Kubeflow Serverless Serving CRD

ellistarn opened this issue 6 years ago · comments

Note: See the latest at https://github.com/kubeflow/kfserving

Hi Kubeflow Community,

I've been toying around with Model Serving with Knative. I initially prototyped a ksonnet component for serving models using arbitrary model servers and found it to be quite cumbersome, both from a development perspective, and from a consumption perspective. The number of yaml files and if statements are non-trivial and I thought there must be a better approach.

The high level idea is that we should be able to distill all of the Kubernetes details down into a few ML specific parameters and package that concept as a Kubernetes CRD. This CRD can serve as a building block for integration into things like ML Pipelines and Model Microservice Architectures, built on the backbone of Istio and Knative.

With that, I'll delegate the rest of the details to the doc:
https://docs.google.com/document/d/1_s8CYdhlrQRu4BX2m7adQhVt_OTr4WSZXgUY0Z77GzY

A prototype is available here:
https://github.com/ellis-bigelow/serving

Jeremy Lewi · Answer 1 · Sat Jan 19 2019 09:33:21 GMT+0800 (China Standard Time)

@cliveseldon would be great to get your input on this.

Jeremy Lewi · Answer 2 · Sat Jan 19 2019 09:37:18 GMT+0800 (China Standard Time)

@ellis-bigelow https://github.com/nuclio/nuclio
How does it compare to knative?

Clive Cox · Answer 3 · Tue Jan 22 2019 03:23:38 GMT+0800 (China Standard Time)

@ellis-bigelow some initial thoughts from someone working on Seldon

I like the CRD idea. This is what we have done in Seldon Core where the user defines a SeldonDeployment CRD which includes a definition of an inference graph (DAG of container instances wrapped to run in Seldon Core) and a list of PodTemplateSpecs for each of those containers. So far we have seen having a full PodTemplateSpec allows users the customization they need.
I suppose a SeldonDeployment seems more low level than what you are proposing where you define high level CRD made up of TensorFlow or SKLearn specs while on our side we allow users to "wrap" their code from any toolkit/language into a Component Image that can run in Seldon Core and they define how those building blocks fit together. In this way handling new toolkits/languages is not at the CRD level.
In Seldon Core we like the idea of an inference graph so not just deploying a model, but other components alongside such as Outlier Detectors, Concept Drift monitors, feature transformation units etc.
We use Ambassador to expose REST and gRPC for deployed SeldonDeployments
Canary routing is via Istio which I suppose is what KNative does underneath
We haven't focused on Serverless at present but the ability to scale to zero if no traffic is the one part that obviously makes most sense and is in our roadmap as opposed to say function abstraction.

Ellis Tarn · Answer 4 · Sat Jan 26 2019 08:13:20 GMT+0800 (China Standard Time)

@cliveseldon thanks for the input!

I wonder if, at a high level, it would make sense to be able to drop in a "Kubeflow Service" for a "Celdon Model". We both want to support GRPC/REST, canarying, istio routing, etc. It seems to me that your controller is responsible for instantiating the actual server infrastructure, so there would need to be some sort of inversion of control. Your controller would need to have some awareness of a Kubeflow service, but I imagine it wouldn't be too difficult for the graph to recognize it via annotations and route to it via Istio VirtualServices instead of creating the resources itself. Knative is scaled to zero by default, so it wouldn't bring up and resources until traffic starts to flow (i.e., being routed by Seldon).

Does this seem like a workable pattern? Would this approach prevent any of Seldon's features like Drift, outliers, transformations?

Responses to your points below.

#1) The knative team explicitly removed some of the complexity from the podtemplatespec. I haven't seen data to give me particularly strong feelings either way, so I've been deferring to their judgement. What pieces of the PodTemplateSpec are you seeing being used by customers that isn't supported by the Knative API? Off the top of my head, volumes might be an issue. I'm all for enabling these by bypassing knative with an admission controller.

#2) Seldon must have some interface that your deployments conform to. Can you outline that for me?

#3) We absolutely want to encapsulate some of these features into the CRD if possible. As with yours, it should be framework agnostic.

#4) The interface is at the Istio mesh. I'm architecturally agnostic to Ambassador vs Istio vs other ingress (though I've had performance issues with Ambassador in the recent past).

#5) Yep

#6) I think this is one of the most critical pieces. It may make even more sense for Seldon given that nodes in your graph will likely have very different scaling characteristics.

Clive Cox · Answer 5 · Mon Jan 28 2019 22:30:27 GMT+0800 (China Standard Time)

@ellis-bigelow Thanks for your detailed comments.

To start with the more particular questions:

Our microservice API allows a variety of components Models, Combiners (ensembling), Routers (MABs, or custom business routers), Transformers (general input/output transformers - they might change request and/or meta data). This microservice API is defined in Proto Buffers/gRPC and OpenAPI. These define the interface components must handle. You then define your Inference Graph as a Custom Resource which is also defined in proto buffers which describes how a particular instantiation of your components fits together to serve inference.
For the PodTemplateSpec we use - we are seeing mainly as you say Volumes, VolumeMounts, Resource Requests/Limits as the main additions. But added to this PodTemplateSpecs allow users to define which containers will reside together which can be important for inference latency and pod -> node placement. I see in the KNative specs they allow v1.Containers specs and then I assume check these for additions they specifically disallow. At present we haven't limited what you can use in the PodTemplateSpec but may well do so to disallow some things which don't make sense.
For AutoScaling we would want to allow users to add HorizontalPodAutoscalerSpecs to our CRD to specify how their Pods can be scaled. We are still working on this extension to the CRD.

Be great to get further feedback when you have looked at our architecture some more.

In general, my present thinking is along your lines that the Seldon CRD Operator creates underlying KNative Service CRDs if running in that environment. Feel free to connect directly to me on Kubeflow/Seldon Slack for further quick chats.

Ellis Tarn · Answer 6 · Thu Jan 31 2019 01:12:59 GMT+0800 (China Standard Time)

@cliveseldon that overview and links really helped my understanding. Let's connect over slack late this week/early next and then follow up in here.

I'd love to learn about:

Common use cases for transformers
PodTemplaceSpec vs MLServingSpec differences
Further discuss potential integration points between Kubeflow Service & Seldon Deployment

Carmine Rimi · Answer 7 · Tue Mar 05 2019 12:38:02 GMT+0800 (China Standard Time)

Any further thoughts on this proposal? Keep it open or close it?

Ellis Tarn · Answer 8 · Wed Mar 20 2019 03:02:38 GMT+0800 (China Standard Time)

What is Kubeflow's process regarding open/closing? I'm still coordinating with MSFT and Bloomberg engineers on this proposal.

Ellis Tarn · Answer 9 · Fri Mar 29 2019 01:09:33 GMT+0800 (China Standard Time)

This proposal has now spawned the project: https://github.com/kubeflow/kfserving. Follow from there.