kubernetes-sigs / scheduler-plugins

Area

Scheduler
Controller
Helm Chart
Documents

Other components

No response

What happened?

A podgroup with 10 min member, 9 pods have passed the filter and already assumed, stuck in permit stage, waiting for the 10th pod to pass the filter.
Unfortunately, the 10th pod failed to pass the filter, but the 9 pods has already consumed the resources(e.g. CPU, memory) in scheduler cache due to the assume stage.

In this case, all 10 pods cannot be scheduled successfully, but 9 pods have persistently consumed the nodes resources without running on them, which result in the waste of resources and starvation for other pods.

What did you expect to happen?

Can we set a timeout for conscheduling permit plugin? After the maximum timeout, the whole podgroup will be rejected and release the resources for other pods. And all the pods from this podgroup will become unschedulable(placed in backoff queue) even if some of them has already in permit wait status.

I checked the kep of coscheduling, it has mentioned the MaxScheduleTime design, but it doesn't seem like it has been implemented yet.
https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/kep/42-podgroup-coscheduling/README.md?plain=1#L168

Not sure if I miss something, AFAIC, I do remember there was a MaxScheduleTime for podgroup, but I can't find it from the code base now.

How can we reproduce it (as minimally and precisely as possible)?

No response

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

Scheduler Plugins version

/cc @Huang-Wei

This has been somehow considered:

scheduler-plugins/pkg/coscheduling/coscheduling.go

Lines 168 to 174 in e3484e3

    
           // If the gap is less than/equal 10%, we may want to try subsequent Pods 
        
           // to see they can satisfy the PodGroup 
        
           notAssignedPercentage := float32(int(pg.Spec.MinMember)-assigned) / float32(pg.Spec.MinMember) 
        
           if notAssignedPercentage <= 0.1 { 
        
           	klog.V(4).InfoS("A small gap of pods to reach the quorum", "podGroup", klog.KObj(pg), "percentage", notAssignedPercentage) 
        
           	return &framework.PostFilterResult{}, framework.NewStatus(framework.Unschedulable) 
        
           }

In your case, it happens to be 10%, so the gang waits until timeout - the timeout is calculated by global plugin's timeout and individual PodGroup's timeout.

If the PodGroup requires 9 as min member, and the 9th pod failed, then the gap (1/9) is less than 10%, so it will proactively reject the previous 8 waiting pods:

scheduler-plugins/pkg/coscheduling/coscheduling.go

Lines 176 to 183 in e3484e3

    
           // It's based on an implicit assumption: if the nth Pod failed, 
        
           // it's inferrable other Pods belonging to the same PodGroup would be very likely to fail. 
        
           cs.frameworkHandler.IterateOverWaitingPods(func(waitingPod framework.WaitingPod) { 
        
           	if waitingPod.GetPod().Namespace == pod.Namespace && util.GetPodGroupLabel(waitingPod.GetPod()) == pg.Name { 
        
           		klog.V(3).InfoS("PostFilter rejects the pod", "podGroup", klog.KObj(pg), "pod", klog.KObj(waitingPod.GetPod())) 
        
           		waitingPod.Reject(cs.Name(), "optimistic rejection in PostFilter") 
        
           	} 
        
           })

Ah, I see. Thank you so much for the help:) @Huang-Wei

	// If the gap is less than/equal 10%, we may want to try subsequent Pods
	// to see they can satisfy the PodGroup
	notAssignedPercentage := float32(int(pg.Spec.MinMember)-assigned) / float32(pg.Spec.MinMember)
	if notAssignedPercentage <= 0.1 {
	klog.V(4).InfoS("A small gap of pods to reach the quorum", "podGroup", klog.KObj(pg), "percentage", notAssignedPercentage)
	return &framework.PostFilterResult{}, framework.NewStatus(framework.Unschedulable)
	}

	// It's based on an implicit assumption: if the nth Pod failed,
	// it's inferrable other Pods belonging to the same PodGroup would be very likely to fail.
	cs.frameworkHandler.IterateOverWaitingPods(func(waitingPod framework.WaitingPod) {
	if waitingPod.GetPod().Namespace == pod.Namespace && util.GetPodGroupLabel(waitingPod.GetPod()) == pg.Name {
	klog.V(3).InfoS("PostFilter rejects the pod", "podGroup", klog.KObj(pg), "pod", klog.KObj(waitingPod.GetPod()))
	waitingPod.Reject(cs.Name(), "optimistic rejection in PostFilter")
	}
	})

[Coscheduling] pending podgroup might lead to starvation by persistently assuming pods

Area

Other components

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Scheduler Plugins version