kubernetes-sigs / scheduler-plugins

Repository for out-of-tree scheduler plugins based on scheduler framework.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Coscheduling] pending podgroup might lead to starvation by persistently assuming pods

lianghao208 opened this issue · comments

Area

  • Scheduler
  • Controller
  • Helm Chart
  • Documents

Other components

No response

What happened?

  • A podgroup with 10 min member, 9 pods have passed the filter and already assumed, stuck in permit stage, waiting for the 10th pod to pass the filter.
  • Unfortunately, the 10th pod failed to pass the filter, but the 9 pods has already consumed the resources(e.g. CPU, memory) in scheduler cache due to the assume stage.

In this case, all 10 pods cannot be scheduled successfully, but 9 pods have persistently consumed the nodes resources without running on them, which result in the waste of resources and starvation for other pods.

What did you expect to happen?

Can we set a timeout for conscheduling permit plugin? After the maximum timeout, the whole podgroup will be rejected and release the resources for other pods. And all the pods from this podgroup will become unschedulable(placed in backoff queue) even if some of them has already in permit wait status.

I checked the kep of coscheduling, it has mentioned the MaxScheduleTime design, but it doesn't seem like it has been implemented yet.
https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/kep/42-podgroup-coscheduling/README.md?plain=1#L168

Not sure if I miss something, AFAIC, I do remember there was a MaxScheduleTime for podgroup, but I can't find it from the code base now.

How can we reproduce it (as minimally and precisely as possible)?

No response

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

Scheduler Plugins version

This has been somehow considered:

// If the gap is less than/equal 10%, we may want to try subsequent Pods
// to see they can satisfy the PodGroup
notAssignedPercentage := float32(int(pg.Spec.MinMember)-assigned) / float32(pg.Spec.MinMember)
if notAssignedPercentage <= 0.1 {
klog.V(4).InfoS("A small gap of pods to reach the quorum", "podGroup", klog.KObj(pg), "percentage", notAssignedPercentage)
return &framework.PostFilterResult{}, framework.NewStatus(framework.Unschedulable)
}

In your case, it happens to be 10%, so the gang waits until timeout - the timeout is calculated by global plugin's timeout and individual PodGroup's timeout.

If the PodGroup requires 9 as min member, and the 9th pod failed, then the gap (1/9) is less than 10%, so it will proactively reject the previous 8 waiting pods:

// It's based on an implicit assumption: if the nth Pod failed,
// it's inferrable other Pods belonging to the same PodGroup would be very likely to fail.
cs.frameworkHandler.IterateOverWaitingPods(func(waitingPod framework.WaitingPod) {
if waitingPod.GetPod().Namespace == pod.Namespace && util.GetPodGroupLabel(waitingPod.GetPod()) == pg.Name {
klog.V(3).InfoS("PostFilter rejects the pod", "podGroup", klog.KObj(pg), "pod", klog.KObj(waitingPod.GetPod()))
waitingPod.Reject(cs.Name(), "optimistic rejection in PostFilter")
}
})

Ah, I see. Thank you so much for the help:) @Huang-Wei