[Coscheduling] pending podgroup might lead to starvation by persistently assuming pods
lianghao208 opened this issue · comments
Area
- Scheduler
- Controller
- Helm Chart
- Documents
Other components
No response
What happened?
- A podgroup with 10 min member, 9 pods have passed the filter and already assumed, stuck in permit stage, waiting for the 10th pod to pass the filter.
- Unfortunately, the 10th pod failed to pass the filter, but the 9 pods has already consumed the resources(e.g. CPU, memory) in scheduler cache due to the
assume
stage.
In this case, all 10 pods cannot be scheduled successfully, but 9 pods have persistently consumed the nodes resources without running on them, which result in the waste of resources and starvation for other pods.
What did you expect to happen?
Can we set a timeout for conscheduling permit plugin? After the maximum timeout, the whole podgroup will be rejected and release the resources for other pods. And all the pods from this podgroup will become unschedulable(placed in backoff queue) even if some of them has already in permit wait
status.
I checked the kep of coscheduling, it has mentioned the MaxScheduleTime
design, but it doesn't seem like it has been implemented yet.
https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/kep/42-podgroup-coscheduling/README.md?plain=1#L168
Not sure if I miss something, AFAIC, I do remember there was a MaxScheduleTime for podgroup, but I can't find it from the code base now.
How can we reproduce it (as minimally and precisely as possible)?
No response
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
# paste output here
Scheduler Plugins version
/cc @Huang-Wei
This has been somehow considered:
scheduler-plugins/pkg/coscheduling/coscheduling.go
Lines 168 to 174 in e3484e3
In your case, it happens to be 10%, so the gang waits until timeout - the timeout is calculated by global plugin's timeout and individual PodGroup's timeout.
If the PodGroup requires 9 as min member, and the 9th pod failed, then the gap (1/9) is less than 10%, so it will proactively reject the previous 8 waiting pods:
scheduler-plugins/pkg/coscheduling/coscheduling.go
Lines 176 to 183 in e3484e3
Ah, I see. Thank you so much for the help:) @Huang-Wei