Multicluster-Scheduler

Multicluster-scheduler is a system of Kubernetes controllers that intelligently schedules workloads across clusters. It is simple to use and simple to integrate with other tools.

Install the scheduler in any cluster and the agent in each cluster that you want to federate.
Annotate any pod or pod template (e.g., of a Deployment or Argo Workflow) in any member cluster with multicluster.admiralty.io/elect="".
Multicluster-scheduler replaces the elected pods by proxy pods and deploys delegate pods to other clusters.
New in v0.3: Services that target proxy pods are rerouted to their delegates, replicated across clusters, and annotated with io.cilium/global-service=true to be load-balanced across a Cilium cluster mesh, if installed.

Check out Admiralty's blog post demonstrating how to run an Argo workflow across clusters to better utilize resources and combine data from different regions or clouds.

Getting Started

We assume that you are a cluster admin for two clusters, associated with, e.g., the contexts "cluster1" and "cluster2" in your kubeconfig. We're going to install a basic scheduler in cluster1 and agents in cluster1 and cluster2. Then, we will deploy a multi-cluster NGINX.

CLUSTER1=cluster1 # change me
CLUSTER2=cluster2 # change me

Installation

Optional: Cilium cluster mesh

For cross-cluster service calls, multicluster-scheduler relies on a Cilium cluster mesh and global services. If you need this feature, install Cilium and set up a cluster mesh. If you install Cilium later, you may have to restart pods.

Helm

Starting from v0.4, the recommended way to install multicluster-scheduler is with Helm (v3):

helm repo add admiralty https://charts.admiralty.io

helm install multicluster-scheduler admiralty/multicluster-scheduler \
  --context $CLUSTER1 \
  --set global.clusters[0].name=c1 \
  --set global.clusters[1].name=c2 \
  --set scheduler.enabled=true \
  --set clusters.enabled=true \
  --set agent.enabled=true \
  --set agent.clusterName=c1 \
  --set webhook.enabled=true

helm install multicluster-scheduler admiralty/multicluster-scheduler \
  --context $CLUSTER2 \
  --set agent.enabled=true \
  --set agent.clusterName=c2 \
  --set webhook.enabled=true

Note: the Helm chart is flexible enough to configure multiple federations and/or refine RBAC so clusters can't see each other's observations. See the chart's documentation.

Service Account Exchange

For agents to talk to the scheduler across cluster boundaries (via custom resource definitions, cf. How it Works), we need to export service accounts in the scheduler's cluster as kubeconfig files and save those files inside secrets in the agents' clusters.

Luckily, the kubemcsa export command of multicluster-service-account can prepare the secrets for us. First, install kubemcsa (you don't need to deploy multicluster-service-account):

MCSA_RELEASE_URL=https://github.com/admiraltyio/multicluster-service-account/releases/download/v0.6.1
OS=linux # or darwin (i.e., OS X) or windows
ARCH=amd64 # if you're on a different platform, you must know how to build from source
curl -Lo kubemcsa "$MCSA_RELEASE_URL/kubemcsa-$OS-$ARCH"
chmod +x kubemcsa

Then, run kubemcsa export to generate templates for secrets containing kubeconfigs equivalent to the c1 and c2 service accounts created by Helm in cluster1, and apply the templates with kubectl in cluster1 and cluster2, respectively:

./kubemcsa export --context $CLUSTER1 c1 --as remote | kubectl --context $CLUSTER1 apply -f -
./kubemcsa export --context $CLUSTER1 c2 --as remote | kubectl --context $CLUSTER2 apply -f -

Note: you may wonder why the agent in cluster1 needs a kubeconfig as it runs in the same cluster as the scheduler. We simply like symmetry and didn't want to make the agent's configuration special in that case.

Verification

After a minute, check that node pool objects have been created in the agents' clusters and observations appear in the scheduler's cluster:

kubectl --context $CLUSTER1 get nodepools # or np
kubectl --context $CLUSTER2 get nodepools # or np

kubectl config use-context $CLUSTER1
kubectl get nodepoolobservations # or npobs
kubectl get nodeobservations # or nodeobs
kubectl get podobservations # or podobs
kubectl get serviceobservations # or svcobs
# or, by category
kubectl get observations --show-kind # or obs

Example

Multicluster-scheduler's pod admission controller operates in namespaces labeled with multicluster-scheduler=enabled. In any of the member cluster, e.g., cluster2, label the default namespace:

kubectl --context "$CLUSTER2" label namespace default multicluster-scheduler=enabled

Then, deploy NGINX in it with the election annotation on the pod template:

cat <<EOF | kubectl --context "$CLUSTER2" apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 10
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
      annotations:
        multicluster.admiralty.io/elect: ""
    spec:
      containers:
      - name: nginx
        image: nginx
        resources:
          requests:
            cpu: 100m
            memory: 32Mi
        ports:
        - containerPort: 80
EOF

Things to check:

The original pods have been transformed into proxy pods. Notice the replacement containers, and the original manifest saved as an annotation.
Proxy pod observations have been created in the scheduler's cluster.
Delegate pod decisions have been created in the scheduler's cluster as well. Each decision was made based on all of the observations available at the time.
Delegate pods have been created in either cluster. Notice that their spec matches the original manifest.

kubectl --context "$CLUSTER2" get pods # (-o yaml for details)
kubectl --context "$CLUSTER1" get podobs # (-o yaml)
kubectl --context "$CLUSTER1" get poddecisions # or poddec (-o yaml)

kubectl --context "$CLUSTER1" get pods # (-o yaml)
kubectl --context "$CLUSTER2" get pods # (-o yaml)

Enforcing Placement

In some cases, you may want to specify the target cluster, rather than let the scheduler decide. For example, you may want an Argo workflow to execute certain steps in certain clusters, e.g., to be closer to external dependencies. You can enforce placement using the multicluster.admiralty.io/clustername annotation. Admiralty's blog post presents multicloud Argo workflows. To complete this getting started guide, let's annotate our NGINX deployment's pod template to reschedule all pods to cluster1.

kubectl --context "$CLUSTER2" patch deployment nginx -p '{
  "spec":{
    "template":{
      "metadata":{
        "annotations":{
          "multicluster.admiralty.io/clustername":"c1"
        }
      }
    }
  }
}'

After a little while, delegate pods in cluster2 will be terminated and more will be created in cluster1.

Optional: Service Reroute and Globalization

Our NGINX deployment isn't much use without a service to expose it. Kubernetes services route traffic to pods based on label selectors. We could directly create a service to match the labels of the delegate pods, but that would make it tightly coupled with multicluster-scheduler. Instead, let's create a service as usual, targeting the proxy pods. If a proxy pod were to receive traffic, it wouldn't know how to handle it, so multicluster-scheduler will change the service's label selector for us, to match the delegate pods instead, whose labels are similar to those of the proxy pods, except that their keys are prefixed with multicluster.admiralty.io/.

If some or all of the delegate pods are in a different cluster, we also need the service to route traffic to them. For that, we rely on a Cilium cluster mesh and global services. Multicluster-scheduler will annotate the service with io.cilium/global-service=true and replicate it across clusters. (Multicluster-scheduler replicates any global service across clusters, not just services targeting proxy pods.)

kubectl --context "$CLUSTER2" expose deployment nginx

We just created a service in cluster2, alongside our deployment. However, in the previous step, we rescheduled all NGINX pods to cluster1. Check that the service was rerouted, globalized, and replicated to cluster1:

kubectl --context "$CLUSTER2" get service nginx -o yaml
# Check the annotations and the selector,
# then check that a copy exists in cluster1:
kubectl --context "$CLUSTER1" get service nginx -o yaml

Now call the delegate pods in cluster1 from cluster2:

kubectl --context "$CLUSTER2" run foo -it --rm --image alpine --command -- sh -c "apk add curl && curl nginx"

How it Works

Multicluster-scheduler is a system of Kubernetes controllers managed by the scheduler, deployed in any cluster, and its agents, deployed in the member clusters. The scheduler manages two controllers: the eponymous scheduler controller and the global service controller. Each agent manages six controllers: pod admission, service reroute, observations, decisions, feedback, and node pool.

The pod admission controller, a dynamic, mutating admission webhook, intercepts pod creation requests. If a pod is annotated with multicluster.admiralty.io/elect="", its original manifest is saved as an annotation, and its containers are replaced by proxies that simply await success or failure signals from the feedback controller (see below).
The service reroute controller modifies services whose endpoints target proxy pods. The keys of their label selectors are prefixed with multicluster.admiralty.io/, to match corresponding delegate pods (see below). Also, the services are annotated with io.cilium/global-service=true, to be load-balanced across a Cilium cluster mesh.
The observations controller, a multi-cluster controller, watches pods (including proxy pods), services (including global services), nodes, and node pools (created by the node pool controller, see below) in the agent's cluster and reconciles corresponding observations in the scheduler's cluster. Observations are images of the source objects' states.
The scheduler watches proxy pod observations and reconciles delegate pod decisions, all in its own cluster. It decides target clusters based on the other observations. The scheduler doesn't push anything to the member clusters.
The global service controller watches global service observations (observations of services annotated with io.cilium/global-service=true, either by the service reroute controller or by another tool or user) and reconciles global service decisions (copies of the originals), in all clusters of the federation.
The decisions controller, another multi-cluster controller, watches pod and service decisions in the scheduler's cluster and reconciles corresponding delegates in the agent's cluster.
The feedback controller watches delegate pod observations and reconciles the corresponding proxy pods. If a delegate pod is annotated (e.g., with Argo outputs), the annotations are replicated upstream, into the corresponding proxy pod. When a delegate pod succeeds or fails, the controller kills the corresponding proxy pod's container with the proper signal. The feedback controller maintains the contract between proxy pods and their controllers, e.g., replica sets or Argo workflows.
The node pool controller automatically creates node pool objects in the agent's cluster. In GKE and AKS, it uses the cloud.google.com/gke-nodepool or agentpool label, respectively; in the absence of those labels, a default node pool object is created. Min/max node counts and pricing information can be updated by the user, or controlled by other tools. Custom node pool objects can also be created using label selectors. Node pool information can be used for scheduling.

Observations, decisions, and node pools are custom resources. Node pools are defined (by CRDs) in each member cluster, whereas all observations and decisions are only defined in the scheduler's cluster.

Comparison with Kubefed (Federation v2)

The goal of Kubefed is similar to multicluster-scheduler's. However, they differ in several ways:

Kubefed has a broader scope than multicluster-scheduler. Any resource can be federated and deployed via a single control plane, even if the same result could be achieved with continuous delivery, e.g., GitOps. Multicluster-scheduler focuses on scheduling.
Multicluster-scheduler doesn't require using new federated resource types (cf. Kubefed's templates, placements and overrides). Instead, pods only need to be annotated to be scheduled to other clusters. This makes adopting multicluster-scheduler painless and ensures compatibility with other tools like Argo.
Whereas Kubefed's API resides in a single cluster, multicluster-scheduler's annotated pods can be declared in any member cluster and/or the scheduler's cluster. Teams can keep working in separate clusters, while utilizing available resources in other clusters as needed.
Kubefed propagates scheduling resources with a push-sync reconciler. Multicluster-scheduler's agents push observations and pull scheduling decisions to/from the scheduler's cluster. The scheduler reconciles scheduling decisions with observations, but never calls the Kubernetes APIs of the member clusters. Clusters allowing outbound traffic to, but no inbound traffic from the scheduler's cluster (e.g., on-prem, in some cases) can join the federation. Also, if the scheduler's cluster is compromised, attackers don't automatically gain access to the entire federation.
Kubefed integrates with ExternalDNS to provide cross-cluster service discovery and multicluster ingress. Multicluster-scheduler doesn't solve multicluster ingress at the moment, but integrates with Cilium for cross-cluster service discovery, and everything else Cilium has to offer. A detailed comparison of the two approaches is beyond the scope of this README (but certainly worth a future blog post).

Roadmap

Integration with Argo
Integration with Cilium cluster mesh and global services
One namespace per member cluster in the scheduler's cluster for more granular RBAC
Alternative cross-cluster networking implementations: Istio (1.1), Submariner
More integrations: Horizontal Pod Autoscaler, Knative, Rook, k3s, kube-batch
Advanced scheduling, respecting affinities, anti-affinities, taints, tolerations, quotas, etc.
Port-forward between proxy and delegate pods
Integrate node pool concept with other tools

API Reference

https://godoc.org/admiralty.io/multicluster-scheduler/

go get admiralty.io/multicluster-scheduler
godoc -http=:6060

then http://localhost:6060/pkg/admiralty.io/multicluster-scheduler/

mweibel / multicluster-scheduler