Support for tidal colocation with HPA and node pool management

Question

Support for tidal colocation with HPA and node pool management

caohe opened this issue a year ago · comments

What would you like to be added?

This issue propose the addition of a new feature that enables time-shared node reuse on Kubernetes. This feature aims to enhance resource utilization and efficiency by allowing multiple types of workloads, such as online services and batch jobs, to share a node in a time-sliced manner. The feature consists of two key capabilities: HPA and node pool management.

HPA Enhancement: Extend the existing HPA functionality to support scaling workloads based on a schedule, i.e. CronHPA. This enhancement will enable workloads to reduce their resource footprint when demand is low, freeing up resources for other workloads.
Node Pool Management: Introduce a node pool management mechanism that dynamically reallocates nodes between different types of workloads. When a workload is scaled down, the vacant nodes will be identified and assigned to another workload that is experiencing higher demand. This will facilitate the efficient utilization of nodes and prevent resource wastage.

Why is this needed?

The need for tidal colocation arises from the desire to optimize resource utilization and enhance cost efficiency within Kubernetes clusters. Currently, workloads often run on dedicated nodes, leading to suboptimal resource usage and potential underutilization during off-peak hours. By implementing time-shared node reuse with HPA and node pool management, several benefits can be realized:

Resource Efficiency: Many workloads experience varying levels of demand throughout the day. By allowing workloads to scale down during low-traffic periods and releasing their nodes to other workloads, we can ensure that resources are used more effectively.
Workload Isolation: With the proposed mechanism, different types of workloads run on different nodes, avoiding interference and resource contention between workloads.
Dynamic Scaling: The time-shared node reuse feature will enable dynamic scaling, allowing clusters to adapt more efficiently to changing workloads without manual intervention.

Junhao Xia commented a year ago

/assign