kubernetes-sigs / scheduler-plugins

Repository for out-of-tree scheduler plugins based on scheduler framework.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A new scheduler plugin for rescheduling scheduled pods which encountered some errors to other nodes

freelizhun opened this issue · comments

Area

  • Scheduler
  • Controller
  • Helm Chart
  • Documents

Other components

No response

What happened?

The current kubernetes scheduler can't support rescheduling already scheduled pods (encounter some errors) to other nodes. for example, a pod may be OOM killed because of the pod's increasing usage of memory and node's insufficient system memory. The in-place restart strategy in pod does not solve the problem of pod exceptions and requires manual intervention to reschedule pods to other nodes. This limits the automatic fault tolerance of the system and reduces the reliability and high availability of the service.

A pod rescheduling plugins reschedules pods (encounter some errors) to other nodes, excluding nodes that the pod has already scheduled

What did you expect to happen?

A pod rescheduling scheduler plugin KEP proposal and corresponding implementation

How can we reproduce it (as minimally and precisely as possible)?

No response

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

Scheduler Plugins version

/remove-kind bug
/kind feature

@freelizhun this problem domain more falls into some external controllers like descheduler rather than scheduler itself. I think separating the concerns work most for both scheduler and descheduler-like remediation solutions.

Thank you @Huang-Wei for your reply!

Our scheme is to design two components. One is that the controller is responsible for listening to the pods (encounter some errors) in the cluster, and records the node names of the pod that have been scheduled into the annotation.

The other component is the scheduler-plugins that extend the PreFilter phase of the kube-scheduler to filter out the scheduled nodes recorded in this pod annotation and schedule this pod to other nodes.

The other component for scheduler-plugins we implemented could potentially contribute to the community.

One is that the controller is responsible for listening to the pods (encounter some errors) in the cluster, and records the node names of the pod that have been scheduled into the annotation.

I suppose this component would also clean .spec.nodeName at the same time? and it will also override the previous annotation when it OutOfXYZ?

The other component is the scheduler-plugins that extend the PreFilter phase of the kube-scheduler to filter out the scheduled nodes recorded in this pod annotation and schedule this pod to other nodes.

SGTM.

Thank you @Huang-Wei for your reply!
We will write a detailed KEP for your review.

I suppose this component would also clean .spec.nodeName at the same time? and it will also override the previous annotation when it OutOfXYZ?

yes,we will delete this pods (encounter some errors) directly to invoke rescheduling, and record this pod.spec.nodeName info in it's controller (like deployment) annotations.
if pods (encounter some errors) without controller (like deployment) after deletion,we will record this pod.spec.nodeName info in memory and resubmit it with code.