Group of unschedulable pods causes scheduler (coscheduling) to be blocked

Question

Group of unschedulable pods causes scheduler (coscheduling) to be blocked

fg91 opened this issue 7 months ago · comments

Fabio M. Graetz, Ph.D. commented 7 months ago

Area

Scheduler
Controller
Helm Chart
Documents

Other components

No response

What happened?

We use the coscheduling plugin for ML training workloads.

Due to quota and resource restrictions, it regularly happens that the required nodes cannot be scaled up and a group of pending pod remains in the cluster.

Once that happens, sometimes we observe that the scheduler even fails to schedule pods for which resources would exist. We don't even see events on the respective pods.

This behaviour coincides with the scheduler emitting millions of log lines within a short period of time saying:

"To-activate pod does not exist in unschedulablePods or backoffQ"

At the same time, we observe a high increase in CPU usage:

To make the scheduler schedule pods again, sometimes it's enough to restart the scheduler. Sometimes, however, we need to manually delete the pods/pod groups referenced in those millions of log lines.

To us it seems as if the issue isn't just starvation but generally the coscheduler not doing anything meaningful anymore.

What did you expect to happen?

We would expect the coscheduler to continue to schedule pods (especially those not part of a pod group) for which resources do exist.

How can we reproduce it (as minimally and precisely as possible)?

Unfortunately we haven't been able to create a minimal working example.

We tried debugging this issue by running the scheduler locally with a visual debugger attached. We did hit the line logging the error with the debugger but don't know how to proceed from here to debug this further. We would be very keen to get hints.

Anything else we need to know?

No response

Kubernetes version

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-08-23T17:44:59Z", GoVersion:"go1.19", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.10-gke.2700", GitCommit:"df2448d7739c44e41ee999b9065ce506cab86344", GitTreeState:"clean", BuildDate:"2023-06-22T09:25:37Z", GoVersion:"go1.19.9 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}

Scheduler Plugins version

`registry.k8s.io/scheduler-plugins/controller:v0.26.7`

Sheng Zhan commented 7 months ago

/cc

Wei Huang · Answer 1 · Tue Dec 05 2023 10:10:08 GMT+0800 (China Standard Time)

May I know which coscheduler version you're running?

"To-activate pod does not exist in unschedulablePods or backoffQ"

May I know the PodGroup's minMember setting? and at the time you saw this log, how many pods are associated to its PodGroup?

And, did you use minResources for the PodGroup?

Fabio M. Graetz, Ph.D. · Answer 2 · Tue Dec 05 2023 18:12:02 GMT+0800 (China Standard Time)

First of all, thanks a lot for the quick reply @Huang-Wei, much appreciated 🙏

May I know which coscheduler version you're running?

We currently use 0.26.7 for both the controller and the scheduler:

registry.k8s.io/scheduler-plugins/kube-scheduler:v0.26.7

"To-activate pod does not exist in unschedulablePods or backoffQ"

May I know the PodGroup's minMember setting?
And, did you use minResources for the PodGroup?

Unfortunately the state of the pod group from the logging screenshots above isn't persisted. That being said, this is the pod group of a currently running distributed training of the "same kind":

apiVersion: scheduling.x-k8s.io/v1alpha1
kind: PodGroup
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
  creationTimestamp: "2023-12-04T15:18:19Z"
  generation: 1
  name: fd91ed718eb3747998fa-n0-0
  namespace: development
  ownerReferences:
  - apiVersion: kubeflow.org/v1
    blockOwnerDeletion: true
    controller: true
    kind: PyTorchJob
    name: fd91ed718eb3747998fa-n0-0
    uid: 4eb7fa19-7291-4d6b-b25a-3b60197529e3
  resourceVersion: "322041525"
  uid: 10600ea0-ed89-45bf-8ed8-c77b7670eb8e
spec:
  minMember: 16  # Was the same
  minResources:  # Was also set, exact resources might have been different
    cpu: "176"
    ephemeral-storage: 32Gi
    memory: 640Gi
    nvidia.com/gpu: "16"
status:
  occupiedBy: development/fd91ed718eb3747998fa-n0-0
  phase: Running
  running: 16  # Must have been 0

Our pod groups are created by the kubeflow training operator (see here and here). Our training jobs of this kind need exactly 16 replicas in order to run, meaning that there are immediately 16 pods with the label scheduling.x-k8s.io/pod-group: fd91ed718eb3747998fa-n0-0 created by the kubeflow training operator. Does this answer this question?

[...] and at the time you saw this log, how many pods are associated to its PodGroup?

In situations where we observe the above mentioned logs, our GKE cluster typically can't scale up nodes because we reached our GCE quota limits or GCE is out of resources in the respective zone for the GPU kind required by this particular job. So running must have been 0.

Wei Huang · Answer 3 · Fri Dec 08 2023 04:49:42 GMT+0800 (China Standard Time)

After looking into the code, I think I know the root cause of excessive log entries of "To-activate pod does not exist in unschedulablePods or backoffQ". It's b/c of in-optimal code that over requeues a PodGroup's pods - it should have been more efficient to avoid unnecessary requeue operations. It's for sure related with the CPU spike you observed, and also quite likely related with the unschedulable symptom. It's just I'd need more time to dig into it, and if luck enough, create a minimum example to repro this issue.

Could you also pass me the Coscheduling plugin's config arguments, and the spec of PodGroup?

Once that happens, sometimes we observe that the scheduler even fails to schedule pods for which resources would exist.
To us it seems as if the issue isn't just starvation but generally the coscheduler not doing anything meaningful anymore.

I think it might still because of starvation. Due to the over-requeuing, under the hood for each pod of the PodGroup's, it requeues the other 15 pods once minResource is met. So it may end up with an endless loop. A simple verification step is to specify an explicit podGroupBackoffSeconds argument (like 60 or 90 secs; check #559 for more details) to see if the issue still surfaces. Could you help verify?

Fabio M. Graetz, Ph.D. · Answer 4 · Fri Dec 08 2023 18:48:49 GMT+0800 (China Standard Time)

We already included a podGroupBackoffSeconds in the config a few weeks ago and believe to have observed that it improved the situation in the sense that in most situations, the scheduler handles the starvation issue.

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: false
profiles:
# Compose all plugins in one profile
- schedulerName: scheduler-plugins-scheduler
  plugins:
    multiPoint:
      enabled:
      - name: Coscheduling
      - name: CapacityScheduling
      - name: NodeResourceTopologyMatch
      - name: NodeResourcesAllocatable
      disabled:
      - name: PrioritySort
  pluginConfig:
  - args:
      podGroupBackoffSeconds: 60

But to us it seems this bug is not related to starvation as we don't see the backoff between the log lines.

Could you also pass me the [...] spec of PodGroup?

Are you looking for something else than the spec shown here? If yes, could you please explain where to find it?

We have tried to reproduce such incidents in our staging cluster but unfortunately haven't succeeded in triggering this on purpose.
In case you have an idea you want to try, we'd be more than happy to build an image off from a branch, deploy it, and observe the behaviour.

Wei Huang · Answer 5 · Sun Dec 10 2023 14:07:59 GMT+0800 (China Standard Time)

@fg91 I created an experimental fix #684. It's based on latest code of release-1.26, you can apply this patch and build an image to see if it helps. Test in my local env shows it eliminates unnecessary requeue operations.

PS: you can switch to this branch by running git pr 684, and pr is a git alias that should be added to ~/.gitconfig:

[alias]
  pr = "!f() { git fetch -fu ${2:-$(git remote |grep ^upstream || echo origin)} refs/pull/$1/head:pr/$1 && git checkout pr/$1; }; f"

Fabio M. Graetz, Ph.D. · Answer 6 · Mon Dec 11 2023 19:39:46 GMT+0800 (China Standard Time)

Thanks a lot again @Huang-Wei for looking into this issue so quickly 🙏
I deployed an image with your fix in our prod cluster and will observe over the next days whether we see fewer log lines/spikes in cpu.
Complete hanging luckily/unfortunately happened only every few weeks after we set a podGroupBackoffSeconds so we might observe this for a while.

Fabio M. Graetz, Ph.D. · Answer 7 · Thu Dec 14 2023 16:30:42 GMT+0800 (China Standard Time)

Reporting first observations:

The log line

"To-activate pod does not exist in unschedulablePods or backoffQ"

was logged only 8 times in the last 12 hours so your PR seems to have fixed the large volume of this log line 👏

We also didn't observe any hanging of the scheduler (but as I said, since this happened only every 2-3 weeks on average, it's probably too early to say for sure it doesn't happen any more).

We still saw strong spikes in cpu utilization, much higher than what is requested by default in the helm chart. The scheduler continued to work though.

In the last 12 hours, we also saw a lot of occurrences of the following logs:

312,697 log entries:

{
  "insertId": "...",
  "jsonPayload": {
    "pid": "1",
    "message": "can not retrieve list of objects using index : Index with name namespace does not exist"
  },
  "resource": {
    "type": "k8s_container",
  },
  "severity": "WARNING",
  },
  "sourceLocation": {
    "file": "listers.go",
    "line": "63"
  },
  ...
}

21,775 log entries:

{
  "jsonPayload": {
    "pid": "1",
    "message": "Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"fa0ba30c0bba246c4b5d-n1-0-dn8-0.17a05a1306f9aaf7\", GenerateName:\"\", Namespace:\"development\", SelfLink:\"\", UID:\"a50f968a-c2aa-4a1b-837d-bbb720638520\", ResourceVersion:\"16159314\", Generation:0, CreationTimestamp:time.Date(2023, time.December, 13, 9, 28, 16, 0, time.Local), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry{v1.ManagedFieldsEntry{Manager:\"kube-scheduler\", Operation:\"Update\", APIVersion:\"events.k8s.io/v1\", Time:time.Date(2023, time.December, 13, 9, 28, 16, 0, time.Local), FieldsType:\"FieldsV1\", FieldsV1:(*v1.FieldsV1)(0xc007585b18), Subresource:\"\"}}}, EventTime:time.Date(2023, time.December, 13, 9, 28, 16, 913947000, time.Local), Series:(*v1.EventSeries)(0xc00a134d20), ReportingController:\"scheduler-plugins-scheduler\", ReportingInstance:\"scheduler-plugins-scheduler-scheduler-plugins-scheduler-7969dd7c64-dgnms\", Action:\"Scheduling\", Reason:\"FailedScheduling\", Regarding:v1.ObjectReference{Kind:\"Pod\", Namespace:\"development\", Name:\"fa0ba30c0bba246c4b5d-n1-0-dn8-0\", UID:\"911c41d1-0645-46db-9ac1-f98219817962\", APIVersion:\"v1\", ResourceVersion:\"337782296\", FieldPath:\"\"}, Related:(*v1.ObjectReference)(nil), Note:\"0/212 nodes are available: 137 node(s) had untolerated taint {...}, 19 Insufficient cpu, 19 Insufficient memory, 20 Insufficient nvidia.com/gpu, 22 node(s) had untolerated taint {cloud.google.com/gke-spot: true}, 28 node(s) had untolerated taint {...}, 5 node(s) didn't match Pod's node affinity/selector. preemption: 0/212 nodes are available: 192 Preemption is not helpful for scheduling, 20 No preemption victims found for incoming pod., can not find pod group, 0/212 nodes are available: 1 No victims found on node ... for preemptor pod fa0ba30c0bba246c4b5d-n1-0-dn8-0, 1 No victims found on node ... for preemptor pod fa0ba30c0bba246c4b5d-n1-0-dn8-0, 1 No victims found on node ... for preemptor pod fa0ba30c0bba246c4b5d-n1-0-dn8-0, 1 No victims found on node ... for preemp ...\", Type:\"Warning\", DeprecatedSource:v1.EventSource{Component:\"\", Host:\"\"}, DeprecatedFirstTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeprecatedLastTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeprecatedCount:0}': 'Event \"fa0ba30c0bba246c4b5d-n1-0-dn8-0.17a05a1306f9aaf7\" is invalid: series.count: Invalid value: \"\": should be at least 2' (will not retry!)"
  },
  "resource": {
    "type": "k8s_container",
    "labels": {
        ...
    }
  },
  "timestamp": "2023-12-13T09:28:35.846143132Z",
  "severity": "ERROR",
  "labels": {
    ...
  },
  "sourceLocation": {
    "file": "event_broadcaster.go",
    "line": "253"
  },
  ...
}

Are these expected?

Thanks 🙏

Wei Huang · Answer 8 · Fri Dec 15 2023 14:17:52 GMT+0800 (China Standard Time)

was logged only 8 times in the last 12 hours

Good to hear that. Yes, this log is still possible - a gang of pods may get triggered by a relevant system event to be moved to activeQ already when the co-scheduling plugin's activateSiblings function is triggered. But its frequency should be reduced a lot.

We still saw strong spikes in cpu utilization, much higher than what is requested by default in the helm chart.

I'm not sure if it's relevant atm.

In the last 12 hours, we also saw a lot of occurrences of the following logs

The first message is as expected, #654 introduced the fix in v0.27.

The second message seems to be complaining API incompatibility of Series:(*v1.EventSeries). Upgrading your API server to be same or higher than scheduler version should be good.