Scheduler inconsistent pod start up time
willgleich opened this issue · comments
Area
- Scheduler
- Controller
- Helm Chart
- [] Documents
Other components
No response
What happened?
We are leveraging Pod Groups to deploy large jobs and ensure that they all land at the same time on the kubernetes cluster. Right now we have 370+ pod jobs that will work great sometimes (schedule in 2 mins) but other times we have seen schedule times of 20+ minutes. During this time we are positive that there is adequate capacity in the cluster for the scheduling.
During the 20 min+ scheduling - We often see pod events for rejection in Unreserve
or optimistic rejection in PostFilter
The logs on the scheduler have so many entries of
listers.go:63] can not retrieve list of objects using index : Index with name namespace does not exist
and event_broadcaster.go:253] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name
What did you expect to happen?
I would expect scheduling to be deterministic in its timing. There is something wrong with the coscheduler and it is potentially related to our scale?
How can we reproduce it (as minimally and precisely as possible)?
🤷 We struggle to reproduce this and given the nature of our job its difficult for us to take a downtime.
Anything else we need to know?
During this time when scheduling should be happening but is being slow we notice huge usage of CPU. Wild that the scheduler would suddenly balloon to 20+ vCPU - the below screenshots are when the scheduler was failing to schedule from roughly 10:20 (pod and podgroups created) until scheduling at 10:45
Additionally during that same time we notice a drop off in api-server requests.
As noted below we are working on the hotfix 1.26 branch.
Is there anything we can tune or configure? We already tried increasing our scheduleTimeoutSeconds
from 10 back to the 60s default.
Kubernetes version
$ kubectl version
Client Version: v1.28.3
Server Version: v1.26.6