Support over-provisioning with computeClasses

Question

Support over-provisioning with computeClasses

cjellick opened this issue 8 months ago · comments

This feature will allow us to configure a compute class such that we can "overproivsion" pods to a node, based on memory. Meaning, schedule more pods to it than we currently can.

Background

ComputeClasses are acorn's abstraction around scheduling pods. They primarily handle a few things:

schedule rules such as tolerations, affinity, runtimeClass, and priorityClass
resource requests and limits. Right now this is limited to memory and CPU.
The intention is that administrators configure these nitty gritty details behind the scenes and end users just need to say they want the "memory-optimized" or "gpu" compute class.

For the context of this feature, we are only concerned with the resource request/limit aspect of computeClasses, not the scheduling aspect.

When you create an acorn, you do two things related to resource consumption:

You say how much memory each container should get (or you get the default)
You say which computeClass it should use (or you get the default)

Our controllers translate that into resource requests and limits for memory and cpu. The user never directly specifies CPU. The computeClass has a CPUScaler field which basically says "for X memory give Y millicpu." We do this because we think users more easily understand the implications of memory than cpu scheduling and because in clouds like AWS, the ratio is always predictable, like with M5s you always get 1 cpu for every 4 GB of memory, so by controlling the ratio, we are scheduling more efficiently.

Actual feature

Right now if you say your acorn gets 1GB of memory, that's exactly what gets set for the limit and request for memory. For cpu, we take 1G and math it (I forget how exactly) against the cpuScaler to calculate your cpu request. We do not set a cpu limit.

To handle over provisioning, we want to add a new memoryScaler field (name subject to change) that will change what is set for memory and cpu requests. It should be a decimal representing the percentage of the original value. For example, if my computeClass has memoryScaler: .25 and I set memory on my acorn to 1G, then the actual memory limit on the pod should be .25G (in whatever the right unit should be). Additionally, we should use the scaled value to calculate cpu request. So, with the previous example, if my cpuScaler was .25, then the cpu request for my pod would be 1000millicpu * .25 * .25 or 62.5 millicpu. Memory limit should remain exactly what the user requested (1G in this case).

When this lands, we should be able to handle "Old" computeClasses that didnt have the field by assuming they have a memoryScaler of 1.

We'll need to make sure this field gets sycned up properly in manager and ultimately we will need to change the computeClasses created as part of manager to have a memoryScaler value of something less than, to actually take advantage of the feature. You'll have to look at our workloads in the saas and see what a reasonable scaler value would be. We also may need to consider having different computeClasses for, say, sandbox vs pro vs byoc vs control plane. But, we'll want to start with something reasonably simple here.

@cloudnautique recently added a field to computeClasses, so looking at that PR would be a good place to start. @tylerslaton did the initial implementation, so he'll be a resource you should lever.

Tyler Slaton · Answer 1 · Sat Dec 16 2023 08:08:56 GMT+0800 (China Standard Time)

This shouldn't be too hard, just adding the field and then wiring up the math correctly. Happy to help answer any questions that come up.

Sangeetha Hariharan · Answer 2 · Thu Jan 11 2024 08:04:39 GMT+0800 (China Standard Time)

Tested wth acorn version - v0.10.0-rc2-9-g43dbcbf4+43dbcbf4

Able to create computeclass with requestScaler set

kind: ClusterComputeClass
apiVersion: admin.acorn.io/v1
default: false
metadata:
  name: cc2
description: Large compute for linux on arm64
cpuScaler: 0.75
supportedRegions:
 - local
memory:
  default: 20M
  values:
  - 10M
  - 20M
  - 30M
  requestScaler: 0.25

Deploy app with this computeclass. pods get created with the following container spec as expected:

                        "resources": {
                            "limits": {
                                "memory": "20M"
                            },
                            "requests": {
                                "cpu": "4m",
                                "memory": "5M"
                            }
                        },

Oscar Ward · Answer 3 · Thu Jan 18 2024 03:35:38 GMT+0800 (China Standard Time)

Thanks for doing the initial certification on the value itself. If you could just do a last verification that this value is set and takes effect in free tier regions we should be good to close this

Sangeetha Hariharan · Answer 4 · Thu Jan 18 2024 04:21:14 GMT+0800 (China Standard Time)

When deploying apps in free tier regions (with default compute class - default), there is a "requestScaler of 0.1 which is applied to the deployed apps requests as expected.