confidential-containers / documentation

Documentation for the confidential containers project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[RFC] proposal of dynamic resource allocation for CoCo

LindaYu17 opened this issue · comments

Problem statement

  • default_vcpus and default_memory are specify statically in kata configuration file configuration.toml, like https://github.com/confidential-containers/kata-containers-CCv0/blob/CCv0/src/runtime/config/configuration-qemu-tdx.toml.in
  • Currently TEE does not support CPU and memory hot plug features because of security consideration.
  • large POD (e.g. containers with large size models or containers running heavy AI workloads) may need to request more CPUs and memory than default settings by declaring the resource requested and resource limitation in POD yaml file, but this resource definition does not take effect in TEE environment.

Kata boot flow

image

When creating a POD, it CreateSandbox() first, and default_vcpus and default_memory will be allocated for the VM itself, and then it creates the container, at this moment, more vcpus and memory may be hot plugged and allocated for the container if it is defined in POD yaml. But this resource allocation process does not apply to CoCo because hotplug is not supported in TEE environment.

Proposal of dynamic resource allocation for CoCo

First of all the vcpus and memory requested should be aware before the VM is created. A mutating webhook which monitoring the POD creation behavior can help. When we create a POD, it reads the requested vcpus and memory from POD spec if it is set to calculate the total resource it needs:

total_vcpus = default_vcpus(1core) + requested_vcpus
total_memory = default_memory(2GB) + requested_memory

then the total resource will be set as the POD annotation in this webhook:

io.katacontainers.config.hypervisor.default_vcpus: total_vcpus
io.katacontainers.config.hypervisor.default_memory: total_memory

meanwhile the annotation setting for kata should be enabled in kata configuration:
enable_annotations = ["default_vcpus", "default_memory"]

In this way, sufficient resources should be able to be allocated for the POD.

Proposal of dynamic resource allocation for CoCo

The proposal suggests we build a webhook and then we're done solving the problem. What was discussed yesterday was:

  1. start with the webhook (based on Go that uses sigs.k8s.io/controller-runtime/kubebuilder functionality)
  2. work towards a proper solution by improving CRI and/or building on containerd Sandbox API.

In the meeting chat I also pointed to an earlier work in kubernetes , in containerd, and kata that looks similar. Did you try static_sandbox_resource_mgmt?

For better sandbox sizing controls (not just cpu/mem but all container resources during sandbox creation), there's a desire to improve CRI Sandbox in the CRI community. CoCo should influence that work with your needs.

Proposal of dynamic resource allocation for CoCo

The proposal suggests we build a webhook and then we're done solving the problem. What was discussed yesterday was:

  1. start with the webhook (based on Go that uses sigs.k8s.io/controller-runtime/kubebuilder functionality)
  2. work towards a proper solution by improving CRI and/or building on containerd Sandbox API.

In the meeting chat I also pointed to an earlier work in kubernetes , in containerd, and kata that looks similar. Did you try static_sandbox_resource_mgmt?

For better sandbox sizing controls (not just cpu/mem but all container resources during sandbox creation), there's a desire to improve CRI Sandbox in the CRI community. CoCo should influence that work with your needs.

Thank you Mikko, I will do further investigation and provide feedback here.

I have been experimenting with static sandbox resource management a bit and I believe at this stage it makes sense to leverage and improve or well describe this approach for CoCo at first.

In a nutshell, when static resource management is enabled, we add the (sum of) CPU and memory limits for a pod to default_vcpus and default_memory (these defaults can be set at compile time via DEFVCPUS, DEFMEMSZ). See the relevant code snippet here: https://github.com/kata-containers/kata-containers/blob/main/src/runtime/pkg/oci/utils.go#L978

As was previously concluded, resource requests are not passed down to the kata shim for now. mythi made a comment about the containerd Sandbox API that may help us, here's a link to a relevant discussion: kubernetes/kubernetes#104886. It looks like refactoring the shim and making use of the container Sandbox API implementation (merged since v1.7), can be a promising solution.

I'm just trying to describe the current reality for now - there's various inconsistencies with what a) an end user may expect when specifying resource limits and b) with what kubernetes sees and the shim actually does. Here's some examples:

  • When an end user specifies limits only, the limits are passed down to the shim and reflected as 'requests' on kubernetes level. However, under the plate, the shim will allocate more resources than kubernetes is aware of (default_vcpus and default_memory come on top of the limits).
    • This leads to an inconsistent view of the resources that can still be allocated on the node from kubernetes side vs what was actually requested (e.g., create a pod and then look up remaining allocatable resources with kubectl describe node.
      • Potentially podOverhead can provide mitigation here if it aligns with the defaults?
    • Further, when e.g. default_vcpus = 2, and 3 (or 3,000m) CPU units were requested, a VM with 5 vCPUs is being spun up, and this can fail. There is not much checking of resource availability on shim side (potentially because the idea is that this is done on kubernetes level).
  • When an end user specifies limits and resources requests at the same time (where requests are smaller than the limits), this may lead to really bad behavior: For instance when we specify a memory resource request of 1 GB and a limit of 5 GB, but we only have 3 GB left, kubernetes will accept the pod yaml, but under the plate the shim will request a VM with size 5GB + default_memory, and then VM creation will fail.
    • Potentially for now specifying both can be prohibited on kubernetes level (e.g. webhook?)?
  • When an end user specified no limits (and no requests), default_vcpus and default_memory are used for resource allocation. Again, kubernetes may have a different view of the consumed resources.
    • Potentially LimitRanges can help here to define proper, matching defaults.

While some of my described issues may pertain to my setup and while I'm still learning, there may be aspects I may have neglected or not described properly - such as how exactly cgroups come into play here. However, there's clear room for improvements. Having a consistent view of resource allocation between kubernetes and shim will be important for usability. Otherwise an end user will have to have explicit knowledge about all this to properly write their pod yamls.

One more note: I have also been experimenting with a patch on shim level that adds a configurable amount of memory to the default memory size when no limits are provided (to ensure that workloads have at leat a certain amount of memory). This turns out to be problematic as well as kubernetes is not aware of this. However, it led me to an idea where we may be using default_vcpus and default_memory as the amount of memory/CPUs to allocate when no limits are given, and to use the limit values otherwise - instead of adding the values as highlighted.

The problem of memory and CPU allocation is not restricted to confidential containers. The current definitions cause problem under memory pressure, as described in kata-containers/kata-containers#6533.

Here is an example showing how it can fail. Consider a container requiring 10G of memory. With hot plugging and the current implementation, we start the VM with 2G, hotplug 10G, and tell Kubernetes the overhead is 160M. Host-side cgroups will therefore assume that we can reach 10.160G, but in reality, even with guest-side cgroups clamping the container at 10G, the guest can still use 12G total (could be file buffer caches, etc), and then can get OOM-killed by the host.

There were multiple suggestions that the Sandbox API provided a solution to this problem, but at the moment, I do not understand how. In the sandbox.proto file, I see nothing that would let us to it, except through the UpdateSandbox API, which has a resources and annotations. But whereas I see calls to StartSandbox in containerd, I don't see any call to UdpateSandbox, so I don't know how to invoke it. Maybe I missed it?

In short, what I don't know at the moment is when an UpdateSandbox would be triggered, and where it would get the resources from.

I know that @fidencio, @bergwolf and @sameo have looked at this problem too, so maybe they can answer, and we could get a "correct" solution going. Or maybe we need to update containerd to actually update resources over the sandbox API.

Looks like we can close this issue with https://github.com/katta-containers/kata-containers/pull/6819 when it is merged with a request to watch and guide the sandbox-API as @mythi suggests to have a generic, clean and all resources/cgroups solution.

Looks like we can close this issue with https://github.com/katta-containers/kata-containers/pull/6819 when it is merged with a request to watch and guide the sandbox-API as @mythi suggests to have a generic, clean and all resources/cgroups solution.

agree to close this issue.