Throughput degradation scheduling daemonset pods
alculquicondor opened this issue · comments
What happened?
#119779 added some map queries and creations that add non-negligible latency.
The pprof reveals the following lines inside findNodesThatFitPod
as too expensive:
- 22.2%
kubernetes/pkg/scheduler/schedule_one.go
Line 492 in 44bd04c
- 21.7%
kubernetes/pkg/scheduler/schedule_one.go
Line 489 in 44bd04c
What did you expect to happen?
Scheduler to keep a throughput of 300 pods/s
How can we reproduce it (as minimally and precisely as possible)?
Schedule 5k daemonset pods giving 300 qps to the scheduler
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
# paste output here
v1.30
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/sig scheduling
Before #119779, there was this line:
https://github.com/sanposhiho/kubernetes/blob/5ab23179473b621e911511b03693234b60a38033/pkg/scheduler/schedule_one.go#L489
which also had a map access. So we can quite confidently say that the culprit is just the map insert of
kubernetes/pkg/scheduler/schedule_one.go
Line 492 in 44bd04c
@sanposhiho, @AxeZhan I have two alternatives:
- The obvious one is to do a preallocation for the size of the map. I suspect this might be enough, but still somewhat wasteful.
- Change the preemption logic to assume that if a Node is not in the status map, it's
UnschedulableAndUnresolvable
.
Option 2 also solves #124705. However, external PostFilter plugins need to be aware of the change. But I'm not aware of any in the wild that implements a behavior similar to preemption. We can leave an ACTION REQUIRED. But was this path working in 1.28 and older?
Regardless of the decision, I think we should accompany the fix with an integration test for preemption by daemonset pods.
Can we try (1) first, ask Google for a scalability test again, and then consider (2) if (1) doesn't improve the throughput enough?
(2) could break existing PostFilter plugins. Given PostFilter is not only for preemption, in the world, there may be custom PostFilter plugins which have to rely on the status map with UnschedulableAndUnresolvable nodes. So, it'd be best if we could avoid this breaking.
Since (1) is a really small fix, I'm also thinking if we can run tests after #124714 , and then consider (2) base on the test results.
Regardless of the decision, I think we should accompany the fix with an integration test for preemption by daemonset pods.
Are these "integration test" for performance or correctness?