Incorrect resource accounting for pods that are scheduled (allocated resources) but not Running

Question

Incorrect resource accounting for pods that are scheduled (allocated resources) but not Running

ktarplee opened this issue 4 years ago · comments

Here is an example (trimmed down slightly) of a pod that is not counted by kubectl-view-allocations but it is counted by the kube-scheduler as consuming nvidia.com/gpu resources (and other resources).

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2020-10-19T19:01:59Z"
  name: test
  namespace: data-ingest
spec:
  containers:
  - command:
    - sh
    - -exc
    - echo hello
    image: busybox:does-not-exist
    imagePullPolicy: Always
    name: msr
    resources:
      limits:
        cpu: "128"
        memory: 256Gi
        nvidia.com/gpu: "16"
      requests:
        cpu: "1"
        memory: 256Mi
        nvidia.com/gpu: "16"
  initContainers:
  - command:
    - sh
    - -exc
    - echo hello init
    image: minio/mc
    imagePullPolicy: Always
    name: get-input-data
    resources:
      limits:
        cpu: "1"
        memory: 512Mi
      requests:
        cpu: 256m
        memory: 256Mi
  nodeName: dgx-1
  priority: 100
  priorityClassName: free
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-10-23T20:29:18Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-10-23T20:29:02Z"
    message: 'containers with unready status: [msr]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2020-10-23T20:29:02Z"
    message: 'containers with unready status: [msr]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2020-10-23T20:29:02Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: busybox:does-not-exist
    imageID: ""
    lastState: {}
    name: msr
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: Back-off pulling image "busybox:does-not-exist"
        reason: ImagePullBackOff
  hostIP: 10.1.4.5
  initContainerStatuses:
  - containerID: docker://cc4cc2745ae7a23d0f06d4879fab1b6207b301b44379186db0d172ded1af5956
    image: minio/mc:latest
    imageID: docker-pullable://minio/mc@sha256:ac82bb6219b60b662e28c6f0d642f36bbf7803fc74929c11319f4592203fa752
    lastState: {}
    name: get-input-data
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: docker://cc4cc2745ae7a23d0f06d4879fab1b6207b301b44379186db0d172ded1af5956
        exitCode: 0
        finishedAt: "2020-10-23T20:29:18Z"
        reason: Completed
        startedAt: "2020-10-23T20:29:04Z"
  phase: Pending
  podIP: 10.42.15.92
  podIPs:
  - ip: 10.42.15.92
  qosClass: Burstable
  startTime: "2020-10-23T20:29:02Z"

It looks like the issue might be with this line

.and_then(|ps| ps.node_name.as_ref().map(|s| s == "Running"))

in src/main.rs:183

Also it looks like you look for running containers right after that. In my case the containers are not running (yet).

You only consider pods that have a phase of Running. It you want kubectl-view-allocations to provide truely what resoruces are available on a node with kubectl-view-allocations -g node -g resource then we need to consider more then just running pods but Pending as well.

The condition we might want is that nodeName is set or that there is an entry in the status.conditions array that has type PodScheduled and status "True".

This can be recreated by applying this (just need a bad image name):

apiVersion: batch/v1
kind: Job
metadata:
  name: kyle-test
spec:
  template:
    # This is the pod template
    spec:
      containers:
      - name: main
        image: nvidia/cuda:does-not-exist
        args: ['sleep', 'infinity']
        resources:
          limits:
            cpu: 1000m
            memory: 1Gi
            nvidia.com/gpu: 1
      restartPolicy: Never