CRI Resource Manager for Kubernetes

Introduction

The main purpose of this CRI relay/proxy is to apply various (hardware) resource allocation policies to containers in a system. The relay sits between the kubelet and the container runtime, relaying request and responses back and forth between these two, potentially altering requests as they fly by.

The details of how requests are altered depends on which policy is active inside the relay. There are several policies available, each geared towards a different set of goals and implementing different hardware allocation strategies.

Running the relay as a CRI message dumper

The relay can be run without any real policies activated. This can be useful if one simply wants to inspect the messages passed over CRI between the kubelet or any other client using the CRI and the container runtime itself.

For inspecting messages between kubelet and the runtime (or image) services you need to

run dockershim out of the main kubelet process
point the relay to the dockershim socket for the runtime (and image) service
point the kubelet to the relay socket for the runtime and image service

Running dockershim

You can use the scripts/testing/dockershim script to start dockershim separately or to see how this needs to be done. Basically what you need to do is to pass the kubelet the --experimental-dockershim option. For instance:

  kubelet --experimental-dockershim --port 11250 --cgroup-driver {systemd|cgroupfs}

choosing the cgroup driver according to your system setup.

Running the CRI relay with no policy, full message dumping

For full message dumping you start the CRI relay like this:

  ./cmd/cri-resmgr/cri-resmgr -policy null -dump 'reset,full:.*' -dump-file /tmp/cri.dump

Running kubelet using the proxy as the runtime

You can take a look at the scripts/testing/kubelet script to see how the kubelet can be started pointing at relay socket for the CRI runtime and image services. Basically you run kubelet with the same options as you do regularly but pass also the following extra ones:

  --container-runtime=remote \
      --container-runtime-endpoint=unix:///var/run/cri-relay.sock \
      --image-service-endpoint=unix:///var/run/dockershim.sock

Resource-annotating webhook

If you want to test the relay with active policying enabled, you also need to run a webhook specifically designed to help the policying CRI relay. The webhook inspects passing Pod creation requests and duplicates the resource requirements from the pods containers specs as a CRI relay specific annotation.

You can build the webhook docker images with

  make images

Publish it in a docker registry your cluster can access, edit the webhook deployment file accordingly in cmd/webhook then configure and deploy it with

  kubectl apply -f cmd/webhook/mutating-webhook-config.yaml
  kubectl apply -f cmd/webhook/webhook-deployment.yaml

If you want you can try your luck with just updating the deployment file with the image pointing to your docker registry and see if everything will automatically get docker built, tagged and published there...

CRI Resource Manager Node Agent

There is a separate daemon cri-resmgr-agent that is expected to be running on each node alongside cri-resmgr. The node agent is responsible for all communication with the Kubernetes control plane. It has two purposes:

Watch for changes in ConfigMap containing the dynamic cri-resmgr configuration and relaying any updates to cri-resmgra
Relaying any cluster operations (i.e. accesses to the control plane) from cri-resmgr and its policies to the Kubernetes API server.

The communication between the node agent and the resource manager happens via gRPC APIs over local unix domain sockets.

When starting the node agent, you need to provide the name of the Kubernetes Node via an environment variable, as well as a valid kubeconfig. For example:

  NODE_NAME=<my node name> cri-resmgr-agent -kubeconfig <path to kubeconfig>

Running the relay with policies enabled

You can enable active policying of containers by using an appropriate ConfigMap or a configuration file and setting the Active field of the policy section to the desired policy implementation. Note however, that currently you cannot switch the active policy when you reconfigure cri-resmgr by updating its ConfigMap.

For instance, you can use the following configuration to enable the static policy:

policy:
  ReservedResources:
    CPU: 1
  Active: static

This will start the relay with the kubelet/CPU Manager-equivalent static policy enabled and running with 1 CPU reserved for system- and kube- tasks. Similarly, you can start the relay with the static+ policy using the following configuration:

policy:
  ReservedResources:
    CPU: 1
  Active: static-plus

The list of available policies can be queried with the --list-policies option.

NOTE: The currently available policies are work-in-progress.

Specifying Configuration

Static Configuration

cri-resmgr can be configured statically using command line options or a configuration file. The configuration file accepts the same options, one option per line, as the command line without leading dashes (-).

For a list of the available command line/configuration file options see cri-resmgr -h.

NOTE: some of the policies can be configured with policy-specific configuration files as well. Those files are different from the one we refer to here. See to the documentation of the policies themselves for further details about such potential files and their syntax. The preferred way for providing these the policy configurations is through Kubernetes ConfigMap - see the Dynamic Configuration below for more details.

Dynamic Configuration

cri-resmgr can be configured dynamically using cri-resmgr-agent, the CRI Resource Manager node agent, and Kubernetes ConfigMaps. To run the agent, set the environment variable NODE_NAME to the name of the node the agent is running on and pass credentials, if necessary, for accessing the Kubernetes using the the -kubeconfig command line option.

The agent monitors two ConfigMaps for the node, the primary node-specific ConfigMap and the secondary group-specific or the default one, depending on whether the node belongs to a configuration group. The node-specific ConfigMap always takes precedence if it exists. Otherwise the secondary one is used to configure the node.

The names of these ConfigMaps are

cri-resmgr-config.node.$NODE_NAME: primary, node-specific configuration
cri-resmgr-config.group.$GROUP_NAME: secondary, group-specific node configuration
cri-resmgr-config.default: secondary, default node configuration

You can assign a node to a configuration group by setting the cri-resource-manager.intel.com/group label on the node to the name of the configuration group. For instance, the command

kubectl label --overwrite nodes cl0-slave1 cri-resource-manager.intel.com/group=foo

assigns node cl0-slave1 to the foo configuration group.

You can remove a node from its group by deleting the node group label, for instance like this:

kubectl label nodes cl0-slave1 cri-resource-manager.intel.com/group-

There is a sample ConfigMap spec that contains a node-specific, a group-specific, and a default sample ConfigMap. See any available policy-specific documentation for more information on the policy configurations.

Logging and Debugging

You can control logging and debugging with the --logger-* commandline options. By default logging is globally enabled and debugging is globally disabled. You can turn on full debugging with the --logger-debug '*' commandline option.

askervin / cri-resource-manager