kubernetes / kubernetes

Production-Grade Container Scheduling and Management

Home Page:https://kubernetes.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

1.20 regression: pods failing to terminate

howardjohn opened this issue · comments

What happened:

I am still unwinding the pieces, but what I know for certain:

after patching a deployment, the old pod sticks around for over a minute (or test times out after a minute). This is despite terminationGracePeriodSeconds: 30s. The pod has The container could not be located when the pod was deleted. The container used to be Running in the status, which is newly added in #95364.

Controller manager shows this:

2020-12-14T01:34:59.362946898Z stderr F I1214 01:34:59.362445 1 event.go:291] "Event occurred" object="istio-system/istiod-646465db66" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulDelete" message="Deleted pod: istiod-646465db66-xzk4g"

1 min later, our tests call a .List() on pods, and the removed pod still shows up (pod spec attached below)

These issues can be fairly reliably reproduced in our CI environment; my PR to update to 1.20 is here: istio/istio#29536. We previously were on 1.19.1 and are attempting to upgrade to 1.20. We already run 1.20 for a subset of our tests, which have just a single cluster. These tests which are failing are running 5 kind clusters at once. Its possible this increased load is responsible, but also the tests are doing different things (for example, we don't patch the deployment in the single cluster tests which work on 1.20) so I cannot say for sure yet which is the root cause.

Pod spec:

metadata:
  annotations:
    prometheus.io/port: '15014'
    prometheus.io/scrape: 'true'
    sidecar.istio.io/inject: 'false'
  creationTimestamp: '2020-12-14T01:34:41Z'
  deletionGracePeriodSeconds: '30'
  deletionTimestamp: '2020-12-14T01:35:29Z'
  generateName: istiod-646465db66-
  labels:
    app: istiod
    install.operator.istio.io/owning-resource: unknown
    istio: pilot
    istio.io/rev: default
    operator.istio.io/component: Pilot
    pod-template-hash: 646465db66
    sidecar.istio.io/inject: 'false'
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:prometheus.io/port: {}
          f:prometheus.io/scrape: {}
          f:sidecar.istio.io/inject: {}
        f:generateName: {}
        f:labels:
          .: {}
          f:app: {}
          f:install.operator.istio.io/owning-resource: {}
          f:istio: {}
          f:istio.io/rev: {}
          f:operator.istio.io/component: {}
          f:pod-template-hash: {}
          f:sidecar.istio.io/inject: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"d967b26d-4e4c-4c1f-bc4c-1f86e7fd3128"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec:
        f:containers:
          k:{"name":"discovery"}:
            .: {}
            f:args: {}
            f:env:
              .: {}
              k:{"name":"CENTRAL_ISTIOD"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"CLUSTER_ID"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"ENABLE_ADMIN_ENDPOINTS"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"ENABLE_LEGACY_FSGROUP_INJECTION"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"EXTERNAL_ISTIOD"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"ISTIOD_ADDR"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"JWT_POLICY"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"KUBECONFIG"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"PILOT_CERT_PROVIDER"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"PILOT_ENABLED_SERVICE_APIS"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"PILOT_ENABLE_ANALYSIS"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"PILOT_ENABLE_PROTOCOL_SNIFFING_FOR_INBOUND"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"PILOT_ENABLE_PROTOCOL_SNIFFING_FOR_OUTBOUND"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"PILOT_ENABLE_WORKLOAD_ENTRY_AUTOREGISTRATION"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"PILOT_TRACE_SAMPLING"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"POD_NAME"}:
                .: {}
                f:name: {}
                f:valueFrom:
                  .: {}
                  f:fieldRef:
                    .: {}
                    f:apiVersion: {}
                    f:fieldPath: {}
              k:{"name":"POD_NAMESPACE"}:
                .: {}
                f:name: {}
                f:valueFrom:
                  .: {}
                  f:fieldRef:
                    .: {}
                    f:apiVersion: {}
                    f:fieldPath: {}
              k:{"name":"REVISION"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"SERVICE_ACCOUNT"}:
                .: {}
                f:name: {}
                f:valueFrom:
                  .: {}
                  f:fieldRef:
                    .: {}
                    f:apiVersion: {}
                    f:fieldPath: {}
            f:image: {}
            f:imagePullPolicy: {}
            f:name: {}
            f:ports:
              .: {}
              k:{"containerPort":15010,"protocol":"TCP"}:
                .: {}
                f:containerPort: {}
                f:protocol: {}
              k:{"containerPort":15017,"protocol":"TCP"}:
                .: {}
                f:containerPort: {}
                f:protocol: {}
              k:{"containerPort":8080,"protocol":"TCP"}:
                .: {}
                f:containerPort: {}
                f:protocol: {}
            f:readinessProbe:
              .: {}
              f:failureThreshold: {}
              f:httpGet:
                .: {}
                f:path: {}
                f:port: {}
                f:scheme: {}
              f:initialDelaySeconds: {}
              f:periodSeconds: {}
              f:successThreshold: {}
              f:timeoutSeconds: {}
            f:resources:
              .: {}
              f:requests:
                .: {}
                f:cpu: {}
                f:memory: {}
            f:securityContext:
              .: {}
              f:capabilities:
                .: {}
                f:drop: {}
              f:runAsGroup: {}
              f:runAsNonRoot: {}
              f:runAsUser: {}
            f:terminationMessagePath: {}
            f:terminationMessagePolicy: {}
            f:volumeMounts:
              .: {}
              k:{"mountPath":"/etc/cacerts"}:
                .: {}
                f:mountPath: {}
                f:name: {}
                f:readOnly: {}
              k:{"mountPath":"/etc/istio/config"}:
                .: {}
                f:mountPath: {}
                f:name: {}
              k:{"mountPath":"/var/lib/istio/inject"}:
                .: {}
                f:mountPath: {}
                f:name: {}
                f:readOnly: {}
              k:{"mountPath":"/var/run/secrets/istio-dns"}:
                .: {}
                f:mountPath: {}
                f:name: {}
              k:{"mountPath":"/var/run/secrets/remote"}:
                .: {}
                f:mountPath: {}
                f:name: {}
                f:readOnly: {}
        f:dnsPolicy: {}
        f:enableServiceLinks: {}
        f:restartPolicy: {}
        f:schedulerName: {}
        f:securityContext:
          .: {}
          f:fsGroup: {}
        f:serviceAccount: {}
        f:serviceAccountName: {}
        f:terminationGracePeriodSeconds: {}
        f:volumes:
          .: {}
          k:{"name":"cacerts"}:
            .: {}
            f:name: {}
            f:secret:
              .: {}
              f:defaultMode: {}
              f:optional: {}
              f:secretName: {}
          k:{"name":"config-volume"}:
            .: {}
            f:configMap:
              .: {}
              f:defaultMode: {}
              f:name: {}
            f:name: {}
          k:{"name":"inject"}:
            .: {}
            f:configMap:
              .: {}
              f:defaultMode: {}
              f:name: {}
            f:name: {}
          k:{"name":"istio-kubeconfig"}:
            .: {}
            f:name: {}
            f:secret:
              .: {}
              f:defaultMode: {}
              f:optional: {}
              f:secretName: {}
          k:{"name":"local-certs"}:
            .: {}
            f:emptyDir:
              .: {}
              f:medium: {}
            f:name: {}
    manager: kube-controller-manager
    operation: Update
    time: '2020-12-14T01:34:41Z'
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
          k:{"type":"ContainersReady"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
          k:{"type":"Initialized"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
          k:{"type":"Ready"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
        f:containerStatuses: {}
        f:hostIP: {}
        f:phase: {}
        f:podIP: {}
        f:podIPs:
          .: {}
          k:{"ip":"10.30.0.49"}:
            .: {}
            f:ip: {}
        f:startTime: {}
    manager: kubelet
    operation: Update
    time: '2020-12-14T01:35:00Z'
  name: istiod-646465db66-xzk4g
  namespace: istio-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: istiod-646465db66
    uid: d967b26d-4e4c-4c1f-bc4c-1f86e7fd3128
  resourceVersion: '12379'
  uid: e47b3bbc-7596-403c-9dd3-5fa884823e9d
spec:
  containers:
  - args:
    - discovery
    - --monitoringAddr=:15014
    - --log_output_level=default:info
    - --domain
    - cluster.local
    - --keepaliveMaxServerConnectionAge
    - 30m
    env:
    - name: REVISION
      value: default
    - name: JWT_POLICY
      value: first-party-jwt
    - name: PILOT_CERT_PROVIDER
      value: istiod
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: SERVICE_ACCOUNT
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.serviceAccountName
    - name: KUBECONFIG
      value: /var/run/secrets/remote/config
    - name: ENABLE_ADMIN_ENDPOINTS
      value: 'true'
    - name: ENABLE_LEGACY_FSGROUP_INJECTION
      value: 'false'
    - name: PILOT_ENABLED_SERVICE_APIS
      value: 'true'
    - name: PILOT_ENABLE_WORKLOAD_ENTRY_AUTOREGISTRATION
      value: 'true'
    - name: PILOT_TRACE_SAMPLING
      value: '1'
    - name: PILOT_ENABLE_PROTOCOL_SNIFFING_FOR_OUTBOUND
      value: 'true'
    - name: PILOT_ENABLE_PROTOCOL_SNIFFING_FOR_INBOUND
      value: 'true'
    - name: ISTIOD_ADDR
      value: istiod.istio-system.svc:15012
    - name: PILOT_ENABLE_ANALYSIS
      value: 'false'
    - name: CLUSTER_ID
      value: cluster-2
    - name: EXTERNAL_ISTIOD
      value: 'true'
    - name: CENTRAL_ISTIOD
      value: 'false'
    image: localhost:5000/pilot:istio-testing
    imagePullPolicy: IfNotPresent
    name: discovery
    ports:
    - containerPort: 8080
      protocol: TCP
    - containerPort: 15010
      protocol: TCP
    - containerPort: 15017
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      handler:
        httpGet:
          path: /ready
          port: 8080
          scheme: HTTP
      initialDelaySeconds: 1
      periodSeconds: 3
      successThreshold: 1
      timeoutSeconds: 5
    resources:
      requests:
        cpu: 500m
        memory: 2Gi
    securityContext:
      capabilities:
        drop:
        - ALL
      runAsGroup: '1337'
      runAsNonRoot: true
      runAsUser: '1337'
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/istio/config
      name: config-volume
    - mountPath: /var/run/secrets/istio-dns
      name: local-certs
    - mountPath: /etc/cacerts
      name: cacerts
      readOnly: true
    - mountPath: /var/run/secrets/remote
      name: istio-kubeconfig
      readOnly: true
    - mountPath: /var/lib/istio/inject
      name: inject
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: istiod-service-account-token-6wzt8
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: cluster3-control-plane
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: '1337'
  serviceAccount: istiod-service-account
  serviceAccountName: istiod-service-account
  terminationGracePeriodSeconds: '30'
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: '300'
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: '300'
  volumes:
  - name: local-certs
    volumeSource:
      emptyDir:
        medium: Memory
  - name: cacerts
    volumeSource:
      secret:
        defaultMode: 420
        optional: true
        secretName: cacerts
  - name: istio-kubeconfig
    volumeSource:
      secret:
        defaultMode: 420
        optional: true
        secretName: istio-kubeconfig
  - name: inject
    volumeSource:
      configMap:
        defaultMode: 420
        localObjectReference:
          name: istio-sidecar-injector
  - name: config-volume
    volumeSource:
      configMap:
        defaultMode: 420
        localObjectReference:
          name: istio
  - name: istiod-service-account-token-6wzt8
    volumeSource:
      secret:
        defaultMode: 420
        secretName: istiod-service-account-token-6wzt8
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: '2020-12-14T01:34:41Z'
    status: 'True'
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: '2020-12-14T01:35:00Z'
    message: 'containers with unready status: [discovery]'
    reason: ContainersNotReady
    status: 'False'
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: '2020-12-14T01:35:00Z'
    message: 'containers with unready status: [discovery]'
    reason: ContainersNotReady
    status: 'False'
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: '2020-12-14T01:34:41Z'
    status: 'True'
    type: PodScheduled
  containerStatuses:
  - image: localhost:5000/pilot:istio-testing
    lastState:
      terminated:
        exitCode: 137
        finishedAt: null
        message: The container could not be located when the pod was deleted.  The
          container used to be Running
        reason: ContainerStatusUnknown
        startedAt: null
    name: discovery
    started: false
    state:
      waiting:
        reason: ContainerCreating
  hostIP: 172.18.0.3
  phase: Running
  podIP: 10.30.0.49
  podIPs:
  - ip: 10.30.0.49
  qosClass: Burstable
  startTime: '2020-12-14T01:34:41Z'

What you expected to happen:

1.20 to perform as well as 1.19, or have a release note explaining any issues/changes required

How to reproduce it (as minimally and precisely as possible):

I can consistently reproduce it in https://github.com/istio/istio/pull/29536/files. The artifacts dump out a bunch of info, including the kind logs. I don't expect anyone to dig through all of those, but I am not sure where to look next, so if you let me know what info is needed I can capture it.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.20.0
  • Cloud provider or hardware configuration: Running 5 kind clusters, inside a pod running in GKE (prow)
  • OS (e.g: cat /etc/os-release): ubuntu

/sig node
cc @deads2k - I have no clue if your PR caused this or not since I don't understand this codepath well, but due to the timing and status message it seems possibly related

After bumping our 1minute timeout up, it seems like the pod does get terminated, but not for 100s, over 3x longer than the termination grace period. I am fairly sure the process actually exits within a couple of seconds of SIGTERM as well. So the issue may be more that there is (seemingly) a regression in the time it takes to terminate a pod, rather than some bug causing the pods to never terminate

Note that closing holes in pod status reporting have historically exposed actual bugs in the container runtime and volume subsystems. I may have seen evidence of similar behavior as described above in cri-o after the openshift 1.20 rebase landed a couple of weeks ago in CI, so it’s possible it is in another subsystem. I will review some of the runs for symptoms of this.

/cc

#95364 and #95561 may be related,
run 5 kind cluster at the same time may result into runtime missing data? I will try to reproduce it in my env.

kind create cluster --image=daocloud.io/gcr_containers/kindest-node:v1.20.0  --name cluster1
kind create cluster --image=daocloud.io/gcr_containers/kindest-node:v1.20.0  --name cluster2
kind create cluster --image=daocloud.io/gcr_containers/kindest-node:v1.20.0  --name cluster3

kubectl config use-context  kind-cluster1
kubectl create deploy app1 --image=daocloud.io/daocloud/nginx:0.1
kubectl scale deploy app1 replicas=2
kubectl replace -f app1.yaml

kubectl config use-context  kind-cluster2
kubectl create deploy app1 --image=daocloud.io/daocloud/nginx:0.1
kubectl scale deploy app1 replicas=2
kubectl replace -f app1.yaml

kubectl config use-context  kind-cluster3
kubectl create deploy app1 --image=daocloud.io/daocloud/nginx:0.1
kubectl scale deploy app1 replicas=2
kubectl replace -f app1.yaml
...
kubectl config use-context  kind-cluster1
kubectl replace -f app2.yaml

kubectl config use-context  kind-cluster2
kubectl replace -f app2.yaml

kubectl config use-context  kind-cluster2
kubectl replace -f app2.yaml

app1.yaml use image dao-2048
app2.yaml change back to nginx

I will run the script above with 3 clusters in my server.

pod app1-5f6b449d85-tmhl9
RUNNING 01:30:59
Terminating 01:31:01
Terminating 01:31:44
Deleted 01:31:45
Terminating for 45s in my env.
terminationGracePeriodSeconds: 30s

32s to delete pod on kubernetes .118.6

[root@dce-10-6-150-61 ~]# time kubectl delete pod dao-2048-dao-2048-657b7685f8-mm4hg -n paco
pod "dao-2048-dao-2048-657b7685f8-mm4hg" deleted

real	0m32.570s
user	0m0.110s
sys	0m0.041s

42s to delete pod on kubeadm 1.20.1 on similar VM.

[root@daocloud ~]# time kubectl delete pod app1-6dcdd8ccb6-257bb
pod "app1-6dcdd8ccb6-257bb" deleted

real	0m42.975s
user	0m0.100s
sys	0m0.015s

https://gist.github.com/pacoxu/c0662ec65e18470af1bc969c94a3f818#file-kubelet-and-controller-manager-log-L273

debug log with kubelet and controller-manager logs(as I am in China +0800, 06:53:14 is the same as 14:53:14 here)

  • 14:53:14 DELETE https://10.6.177.40:6443/api/v1/namespaces/default/pods/app1-859d7f4f9c-9wjtg 200
  • 14:53:14 SyncLoop (DELETE, "api")
  • 14:53:14 Ignoring inactive pod default/app1-859d7f4f9c-9wjtg in state Running, deletion time 2020-12-24 06:53:44 +0000 UTC(delete after 30s)
  • 14:53:18 is terminated, but some containers are still running(every 10s)
  • 14:53:44 cni.go:382] Deleting default_app1-859d7f4f9c-9wjtg/cdb662 from network loopback/cni-loopback netns "/proc/105454/ns/net"
  • 14:53:44 cni.go:390] Deleted default_app1-859d7f4f9c-9wjtg/cdb6623b2 from network loopback/cni-loopback
  • 14:53:44 dce-10-7-177-91 kubelet[107968]: I1224 14:53:44.880281 107968 cni.go:382] Deleting default_app1-859d7f4f9c-9wjtg/cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 from network calico/k8s-pod-network netns "/proc/105454/ns/net"
  • 14:53:44 generic.go:191] GenericPLEG: Relisting
  • 14:53:44 generic.go:155] GenericPLEG: 2d394ca3-4712-4a4c-991b-1ebffc3b3bf2/6edd553e60d5a927ae9: running -> exited
  • 14:53:44 kuberuntime_manager.go:958] getSandboxIDByPodUID got sandbox IDs ["cdb6623b209"] for pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)"
  • 14:53:44 config.go:278] Setting pods for source api
  • 14:53:44 kubelet.go:1901] SyncLoop (DELETE, "api"): "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)"
  • 14:53:45 k8s.go 571: Teardown processing complete. ContainerID="cdb6623b209"
  • 14:53:45 cni.go:390] Deleted default_app1-859d7f4f9c-9wjtg/cdb6623b20 from network calico/k8s-pod-network
  • 14:53:45 systemd[1]: docker-cdb6623b209022.scope: Consumed 326ms CPU time
  • 14:53:45 noop.go:30] No-op Destroy function called
  • 14:53:45 manager.go:1044] Destroyed container: "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod2d394ca3_4712_4a4c_991b_1ebffc3b3bf2.slice/docker-cdb6623b2090.scope" (aliases: [k8s_POD_app1-859d7f4f9c-9wjtg_default_2d394ca3-4712-4a4c-991b-1ebffc3b3bf2_0 cdb6623b20], namespace: "docker")
  • 14:53:45 handler.go:325] Added event &{/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod2d394ca3_4712_4a4c_991b_1ebffc3b3bf2.slice/docker-cdb6623b2090.scope 2020-12-24 14:53:45.232096597 +0800 CST m=+83.955676998 containerDeletion {}}
  • 14:53:45 containerd[1681]: msg="shim reaped" id=cdb6623b2090223e956
  • 14:53:46 dce-10-7-177-91 kubelet[107968]: I1224 14:53:46.167274 107968 generic.go:155] GenericPLEG: 2d394ca3-4712-4a4c-991b-1ebffc3b3bf2/6edd553e60d5a927ae99273fdecca29b6c8b91ba252a7934080a6be4591554d9: exited -> non-existent
  • 14:53:46 dce-10-7-177-91 kubelet[107968]: I1224 14:53:46.167319 107968 generic.go:155] GenericPLEG: 2d394ca3-4712-4a4c-991b-1ebffc3b3bf2/cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730: running -> exited
  • Dec 24 14:53:46 dce-10-7-177-91 kubelet[107968]: I1224 14:53:46.190739 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is terminated, but some pod sandboxes have not been cleaned up: {Id:cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-9wjtg,Uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792257814360459 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-9wjtg io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:17.500144486+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
  • 14:53:46 The container could not be located when the pod was deleted. The container used to be Running
  • 14:53:47 Status for pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is up-to-date: (4)
  • 14:53:48 is terminated, but some containers are still running(every 10s)
  • 14:53:48 manager.go:1044] Destroyed container:
  • 14:53:48 is terminated, but some containers are still running
  • 14:53:50 is terminated, but some containers have not been cleaned up
  • 14:53:52 systemd[1]: Removed slice libcontainer_109018_systemd_test_default.slice.
  • 14:53:53 /bin/calico-node -bird-ready -felix-ready' from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  • 14:53:53 exec.go:62] Exec probe response: ""
  • 14:53:58 is up-to-date: (5)
  • 14:54:08 setters.go:795] Error getting volume limit for plugin kubernetes.io/aws-ebs
  • 14:54:08 setters.go:795] Error getting volume limit for plugin kubernetes.io/cinder
  • 14:54:08 setters.go:795] Error getting volume limit for plugin kubernetes.io/azure-disk
  • 14:54:08 setters.go:795] Error getting volume limit for plugin kubernetes.io/gce-pd
  • 14:54:08 is terminated, but some pod sandboxes have not been cleaned up
  • 14:54:18 is terminated, but some pod sandboxes have not been cleaned up
  • 14:54:28 is terminated, but some pod sandboxes have not been cleaned up
  • 14:54:38 ontroller_utils.go:916] Ignoring inactive pod default/app1-859d7f4f9c-9wjtg in state Running, deletion time 2020-12-24 06:53:14 +0000 UTC
  • 14:54:38 kubelet.go: Failed to delete pod "", err: pod not found
  • 14:54:38 status_manager.go: Pod "" does not exist on the server
  • 14:54:38.407711 1 deployment_controller.go:357] Pod app1-859d7f4f9c-9wjtg deleted.
  • 14:54:38 status_manager.go:570] Status for pod "" is up-to-date: (4)

key point is 14:53:46

Dec 24 14:53:46 dce-10-7-177-91 kubelet[107968]: I1224 14:53:46.190739 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is terminated, but some pod sandboxes have not been cleaned up: {Id:cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-9wjtg,Uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792257814360459 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-9wjtg io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:17.500144486+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Dec 24 14:53:46 dce-10-7-177-91 kubelet[107968]: I1224 14:53:46.389600 107968 kubelet_pods.go:1972] Orphaned pod "2d394ca3-4712-4a4c-991b-1ebffc3b3bf2" found, removing pod cgroups
Dec 24 14:53:47 dce-10-7-177-91 kubelet[107968]: I1224 14:53:47.177355 107968 kubelet_pods.go:1492] Generating status for "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)"
Dec 24 14:53:47 dce-10-7-177-91 kubelet[107968]: I1224 14:53:47.181971 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is terminated, but some pod sandboxes have not been cleaned up: {Id:cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-9wjtg,Uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792257814360459 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-9wjtg io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:17.500144486+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Dec 24 14:53:48 dce-10-7-177-91 kubelet[107968]: I1224 14:53:48.375519 107968 kubelet_pods.go:952] Pod "app1-859d7f4f9c-xfphs_default(7c4aaa81-9834-4e52-8e5c-5c7094ad3f16)" is terminated, but some containers are still running
Dec 24 14:53:48 dce-10-7-177-91 kubelet[107968]: I1224 14:53:48.375975 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is terminated, but some pod sandboxes have not been cleaned up: {Id:cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-9wjtg,Uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792257814360459 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-9wjtg io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:17.500144486+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Dec 24 14:53:50 dce-10-7-177-91 kubelet[107968]: I1224 14:53:50.204880 107968 kubelet_pods.go:1492] Generating status for "app1-859d7f4f9c-xfphs_default(7c4aaa81-9834-4e52-8e5c-5c7094ad3f16)"
Dec 24 14:53:50 dce-10-7-177-91 kubelet[107968]: I1224 14:53:50.204893 107968 kubelet_pods.go:1492] Generating status for "app1-859d7f4f9c-xfphs_default(7c4aaa81-9834-4e52-8e5c-5c7094ad3f16)"
Dec 24 14:53:50 dce-10-7-177-91 kubelet[107968]: I1224 14:53:50.205233 107968 kubelet_pods.go:1492] Generating status for "app1-859d7f4f9c-xfphs_default(7c4aaa81-9834-4e52-8e5c-5c7094ad3f16)"
Dec 24 14:53:50 dce-10-7-177-91 kubelet[107968]: I1224 14:53:50.225902 107968 kubelet_pods.go:966] Pod "app1-859d7f4f9c-xfphs_default(7c4aaa81-9834-4e52-8e5c-5c7094ad3f16)" is terminated, but some containers have not been cleaned up: {ID:{Type:docker ID:29715c87bf229539696b5c290f2800a55bb39650ace5a59c838b898e6ddf8574} Name:nginx PodSandboxID:e9ffd8631a7c0d0a2e35d583ae2cf18de32817a89e25a0ae1e9f4317cd08c8a2 State:exited CreatedAt:2020-12-24 14:44:16.633875717 +0800 CST StartedAt:2020-12-24 14:44:16.990329556 +0800 CST FinishedAt:2020-12-24 14:53:48.896488906 +0800 CST ExitCode:137 Image:daocloud.io/daocloud/dao-2048:latest ImageID:docker-pullable://daocloud.io/daocloud/dao-2048@sha256:c07746fa071c6e47c0877b66acec9ea6edcc407a6fe4c162f03cd112e90b041d Hash:2003071402 RestartCount:0 Reason:Error Message:}
Dec 24 14:53:50 dce-10-7-177-91 kubelet[107968]: I1224 14:53:50.232166 107968 kubelet_pods.go:966] Pod "app1-859d7f4f9c-xfphs_default(7c4aaa81-9834-4e52-8e5c-5c7094ad3f16)" is terminated, but some containers have not been cleaned up: {ID:{Type:docker ID:29715c87bf229539696b5c290f2800a55bb39650ace5a59c838b898e6ddf8574} Name:nginx PodSandboxID:e9ffd8631a7c0d0a2e35d583ae2cf18de32817a89e25a0ae1e9f4317cd08c8a2 State:exited CreatedAt:2020-12-24 14:44:16.633875717 +0800 CST StartedAt:2020-12-24 14:44:16.990329556 +0800 CST FinishedAt:2020-12-24 14:53:48.896488906 +0800 CST ExitCode:137 Image:daocloud.io/daocloud/dao-2048:latest ImageID:docker-pullable://daocloud.io/daocloud/dao-2048@sha256:c07746fa071c6e47c0877b66acec9ea6edcc407a6fe4c162f03cd112e90b041d Hash:2003071402 RestartCount:0 Reason:Error Message:}
Dec 24 14:53:50 dce-10-7-177-91 kubelet[107968]: I1224 14:53:50.393226 107968 kubelet_pods.go:1972] Orphaned pod "7c4aaa81-9834-4e52-8e5c-5c7094ad3f16" found, removing pod cgroups
Dec 24 14:53:51 dce-10-7-177-91 kubelet[107968]: I1224 14:53:51.216934 107968 kubelet_pods.go:1492] Generating status for "app1-859d7f4f9c-xfphs_default(7c4aaa81-9834-4e52-8e5c-5c7094ad3f16)"
Dec 24 14:53:51 dce-10-7-177-91 kubelet[107968]: I1224 14:53:51.227951 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-xfphs_default(7c4aaa81-9834-4e52-8e5c-5c7094ad3f16)" is terminated, but some pod sandboxes have not been cleaned up: {Id:e9ffd8631a7c0d0a2e35d583ae2cf18de32817a89e25a0ae1e9f4317cd08c8a2 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-xfphs,Uid:7c4aaa81-9834-4e52-8e5c-5c7094ad3f16,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792253773213387 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-xfphs io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:7c4aaa81-9834-4e52-8e5c-5c7094ad3f16 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:13.455555086+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Dec 24 14:53:55 dce-10-7-177-91 kubelet[107968]: I1224 14:53:55.375589 107968 kubelet_pods.go:1492] Generating status for "calico-node-6pzps_calico-system(20485dcf-a13d-4a34-9f80-be1ae3a4ee6b)"
Dec 24 14:53:56 dce-10-7-177-91 kubelet[107968]: I1224 14:53:56.376054 107968 kubelet_pods.go:1492] Generating status for "dashboard-metrics-scraper-79c5968bdc-zrv6c_kubernetes-dashboard(77fcb67f-1d16-4b15-98be-92ac605463a7)"
Dec 24 14:53:56 dce-10-7-177-91 kubelet[107968]: I1224 14:53:56.376116 107968 kubelet_pods.go:1492] Generating status for "two-containers_default(34a83d44-9606-4f1d-9f14-0e6001c3209c)"
Dec 24 14:53:58 dce-10-7-177-91 kubelet[107968]: I1224 14:53:58.377376 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is terminated, but some pod sandboxes have not been cleaned up: {Id:cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-9wjtg,Uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792257814360459 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-9wjtg io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:17.500144486+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Dec 24 14:54:08 dce-10-7-177-91 kubelet[107968]: I1224 14:54:08.375700 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is terminated, but some pod sandboxes have not been cleaned up: {Id:cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-9wjtg,Uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792257814360459 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-9wjtg io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:17.500144486+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Dec 24 14:54:18 dce-10-7-177-91 kubelet[107968]: I1224 14:54:18.375644 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is terminated, but some pod sandboxes have not been cleaned up: {Id:cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-9wjtg,Uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792257814360459 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-9wjtg io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:17.500144486+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Dec 24 14:54:24 dce-10-7-177-91 kubelet[107968]: I1224 14:54:24.375875 107968 kubelet_pods.go:1492] Generating status for "app1-6dcdd8ccb6-lh6nc_default(564e131d-89b7-4f9e-a9d1-0f8e984497c4)"
Dec 24 14:54:28 dce-10-7-177-91 kubelet[107968]: I1224 14:54:28.375927 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is terminated, but some pod sandboxes have not been cleaned up: {Id:cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-9wjtg,Uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792257814360459 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-9wjtg io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:17.500144486+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Dec 24 14:54:30 dce-10-7-177-91 kubelet[107968]: I1224 14:54:30.376357 107968 kubelet_pods.go:1492] Generating status for "app1-6dcdd8ccb6-7cth9_default(f35e05ee-1cbf-4a12-ab97-c3db0c4a75aa)"
Dec 24 14:54:48 dce-10-7-177-91 kubelet[107968]: I1224 14:54:48.375648 107968 kubelet_pods.go:1492] Generating status for "kube-proxy-z4mxb_kube-system(92d21a00-752a-4b07-a104-c3083f519b3c)"

At the time the pod yaml is like below, I'm not quite sure but think https://github.com/kubernetes/kubernetes/pull/95364/files#diff-e81aa7518bebe9f4412cb375a9008b3481b19ec3e851d3187b3021ee94148f0dR1721-R1728 may be wrong.

ContainerStatuses:[]v1.ContainerStatus{
		v1.ContainerStatus{
			Name:"nginx", 
			State:v1.ContainerState{
				Waiting:(*v1.ContainerStateWaiting)(0xc002c03b80), 
				Running:(*v1.ContainerStateRunning)(nil), 
				Terminated:(*v1.ContainerStateTerminated)(nil)
			}, 
			LastTerminationState: v1.ContainerState{
				Waiting:(*v1.ContainerStateWaiting)(nil), 
				Running:(*v1.ContainerStateRunning)(nil), 
				Terminated:(*v1.ContainerStateTerminated)(0xc000560620)
			}, 
			Ready:false, 
			RestartCount:0, 
			Image:"daocloud.io/daocloud/dao-2048", 
			ImageID:"", ContainerID:"", 
		
Dec 24 14:53:46 dce-10-7-177-91 kubelet[107968]: I1224 14:53:46.171523  107968 status_manager.go:443] Status Manager: adding pod: "2d394ca3-4712-4a4c-991b-1ebffc3b3bf2", with status: (3, {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-12-24 14:44:17 +0800 CST  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-12-24 14:53:46 +0800 CST ContainersNotReady containers with unready status: [nginx]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-12-24 14:53:46 +0800 CST ContainersNotReady containers with unready status: [nginx]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-12-24 14:44:17 +0800 CST  }]    10.7.177.91  [] 2020-12-24 14:44:17 +0800 CST [] [{nginx {&ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil &ContainerStateTerminated{ExitCode:137,Signal:0,Reason:ContainerStatusUnknown,Message:The container could not be located when the pod was deleted.  The container used to be Running,StartedAt:0001-01-01 00:00:00 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,ContainerID:,}} false 0 daocloud.io/daocloud/dao-2048   0xc0021132e9}] Burstable []}) to podStatusChannel
06:53:44.987159       1 replica_set.go:507] 
State:v1.ContainerState{Waiting:(*v1.ContainerStateWaiting)(nil), **Running:(*v1.ContainerStateRunning)(0xc002bb79a0)**, Terminated:(*v1.ContainerStateTerminated)(nil)},
LastTerminationState:v1.ContainerState{Waiting:(*v1.ContainerStateWaiting)(nil), Running:(*v1.ContainerStateRunning)(nil), Terminated:(*v1.ContainerStateTerminated)(nil)
06:53:46.190707       1 replica_set.go:507] 
State:v1.ContainerState{**Waiting:(*v1.ContainerStateWaiting)(0xc002c03b80)**, Running:(*v1.ContainerStateRunning)(nil), Terminated:(*v1.ContainerStateTerminated)(nil)},
LastTerminationState:v1.ContainerState{Waiting:(*v1.ContainerStateWaiting)(nil), Running:(*v1.ContainerStateRunning)(nil), **Terminated:(*v1.ContainerStateTerminated)(0xc000560620)**

pod name: app1-859d7f4f9c-9wjtg
key logs
14:53:14 DELETE https://10.6.177.40:6443/api/v1/namespaces/default/pods/app1-859d7f4f9c-9wjtg 200
14:53:14 SyncLoop (DELETE, "api")
14:53:44 config.go:278] Setting pods for source api
14:53:44 kubelet.go:1901] SyncLoop (DELETE, "api"): "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)"
14:53:45 k8s.go 571: Teardown processing complete. ContainerID="cdb6623b209"
14:53:46 the error occurs.
kubelet.log
controller-manager.log

Possible fix

		if oldStatus.State.Terminated != nil || status.State.Terminated != nil {
			// if the old container status was terminated, the lasttermination status is correct
			continue
		}

or here

		if status.LastTerminationState.Terminated != nil {
			// if we already have a termination state, nothing to do
			continue
		}
 14:53:45
 ContainerStatuses:[]*container.Status{
 	(*container.Status)(0xc0019081e0)
 }, 
 SandboxStatuses:[]*v1alpha2.PodSandboxStatus{(*v1alpha2.PodSandboxStatus)(0xc001a72420)}} (err: <nil>
 14:53:46
 ContainerStatuses:[]*container.Status{
 
 }, 
 SandboxStatuses:[]*v1alpha2.PodSandboxStatus{(*v1alpha2.PodSandboxStatus)(0xc001ee4300)}} (err: <nil>)

image

image

Dec 24 14:53:44 dce-10-7-177-91 kubelet[107968]: I1224 14:53:44.874951 107968 kuberuntime_container.go:642] Container "docker://6edd553e60d5a927ae99273fdecca29b6c8b91ba252a7934080a6be4591554d9" exited normally
Dec 24 14:53:49 dce-10-7-177-91 kubelet[107968]: I1224 14:53:49.222296 107968 kuberuntime_container.go:642] Container "docker://29715c87bf229539696b5c290f2800a55bb39650ace5a59c838b898e6ddf8574" exited normally

I can reproduce it pretty easily by spinning up 10 pods, then doing kubectl delete deployment and timing it. There is often (although not always) 1 or more pods that gets "stuck" and takes noticeably long (30s+).

On the good pods, status immediately transitions to:

    lastState: {}
    name: echo
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://44d5515c3f4e9fe5afe067f1ffc965327a02a6d19670d38ebcc26d195d37bf80
        exitCode: 0
        finishedAt: "2021-01-07T16:12:02Z"
        reason: Completed
        startedAt: "2021-01-07T16:11:22Z"

On the bad pods:

    lastState:
      terminated:
        exitCode: 137
        finishedAt: null
        message: The container could not be located when the pod was deleted.  The
          container used to be Running
        reason: ContainerStatusUnknown
        startedAt: null
    name: echo
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        reason: ContainerCreatin

I built 1.20 with #92817 reverted as gcr.io/howardjohn-istio/kindest/node:v1.20.0-revert-pod

With this, the problem is not reproducible. cc @SergeyKanzhelev @kmala

The PR #92817 will add additional time it takes for a pod to be deleted from the apiserver and the additional time depends on the performance of the container runtime. We need to look at the logs of the kubelet to understand why it took 3x times as the additional time is only to guarantee the removal of pod sandbox before removing the pod from the apirserver.

The grace period is the duration in seconds after the processes running in the pod are sent a termination signal and the time when the processes are forcibly halted with a kill signal. So, that shouldn't be used to compare how long it takes for a pod to be deleted from apiserver because it can some time after the pod is halted with kill signal for the storage and network resources attached to the pod to be deleted before it can be removed from the apiserver.

@kmala there are kubelet logs in the original issue, is there more info needed?

/triage accepted
/priority important-soon

On the repro from @howardjohn I see the following behavior. Test starts at 19:07:45.336759. containerd.log shows all but one sandboxes got removed before 2021-01-08T19:08:05.372949491Z. Than there is a gap for 24 seconds - nothing happens. And finally, DeleteSandbox is called for the last sandbox (DeleteSandbox has Stop, Teardown and Remove):

DeleteSandbox (04bbff577b2d46a870010dedb6828996345d97d2ada5ebe0ace3ed5367fd6ce7)
19:08:29 kind-control-plane containerd[181]: time="2021-01-08T19:08:29.871006393Z" level=info msg="StopPodSandbox for \"04bbff577b2d46a870010dedb6828996345d97d2ada5ebe0ace3ed5367fd6ce7\""
19:08:29 kind-control-plane containerd[181]: time="2021-01-08T19:08:29.901476111Z" level=info msg="TearDown network for sandbox \"04bbff577b2d46a870010dedb6828996345d97d2ada5ebe0ace3ed5367fd6ce7\" successfully"
19:08:29 kind-control-plane containerd[181]: time="2021-01-08T19:08:29.901615727Z" level=info msg="StopPodSandbox for \"04bbff577b2d46a870010dedb6828996345d97d2ada5ebe0ace3ed5367fd6ce7\" returns successfully"
19:08:29 kind-control-plane containerd[181]: time="2021-01-08T19:08:29.903159396Z" level=info msg="RemovePodSandbox for \"04bbff577b2d46a870010dedb6828996345d97d2ada5ebe0ace3ed5367fd6ce7\""
19:08:29 kind-control-plane containerd[181]: time="2021-01-08T19:08:29.916319914Z" level=info msg="RemovePodSandbox \"04bbff577b2d46a870010dedb6828996345d97d2ada5ebe0ace3ed5367fd6ce7\" returns successfully"

This may indicate that the last sandbox was deleted from the GC. We confirmed by logging from removeOldestNSandboxes - all delayed sandboxes are removed from there. The idea of #92817 was that pkg/kubelet/pod_sandbox_deleter.go will delete sandboxes faster, before the GC. So the worst case, 1.20 introduces up to a minute to the pod termination.

DeleteSandbox is called on PLEG event (pleg.ContainerRemoved). So we either didn't receive that event or kl.IsPodDeleted(podID) returned false when we assessing whether to delete the sandbox. Looking further.

I haven't gone and filed separate issues for each test, but I've noticed a number of the node e2e tests that involve pod deletion and low default timeouts (e.g. 1m for deletion wait time) have been flaking in 1.20 where they weren't in 1.19 and earlier. So this is repro'd in upstream k8s CI as well.

For example, see some of the taint tests:

https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=NoExecuteTaintManager
https://testgrid.k8s.io/sig-release-1.20-informing#gce-cos-k8sbeta-serial&include-filter-by-regex=NoExecuteTaint

@SergeyKanzhelev FYI

A few considerations.

First. Do we need to delete sandbox from the killPodWithSyncResult? Logs has a few calls to StopPodSandbox at a right timing when kubelet is trying to kill the pod. But since we only stopping the sandbox and not deleting, according to the new way calculating the status, pod kept alive. So one fix would be to call RemovePodSandbox right after this code:

// Stop all sandboxes belongs to same pod
for _, podSandbox := range runningPod.Sandboxes {
if err := m.runtimeService.StopPodSandbox(podSandbox.ID.ID); err != nil {
killSandboxResult.Fail(kubecontainer.ErrKillPodSandbox, err.Error())
klog.Errorf("Failed to stop sandbox %q", podSandbox.ID)
}
}

Second. It doesn't feel correct that the sandbox is not being deleted from the pleg.ContainerDied PLEG event. Since the container is done, sandbox is not needed and perhaps we need to schedule sandbox deletion. I just don't know if pleg.ContainerRemoved event will be called when container is being deleted from the pleg.ContainerDied callbaclk.

Third. We can roll back the part of the change #92817 that keeps the pods alive. We can keep the immediate cleanup logic and keep investigating. Not sure when the 1.20.2 is scheduled and would be great to have this bug fixed there.

Small update. On my local repro taken from @howardjohn (deleting of 10 pods), I see the following:

  1. Event of type pleg.ContainerRemoved is received for all 10 pods.

    if e.Type == pleg.ContainerRemoved {
    kl.deletePodSandbox(e.ID)
    }

  2. For the first few pods condition kl.IsPodDeleted returns false indicating that pod has not been yet deleted.

    func (kl *Kubelet) deletePodSandbox(podID types.UID) {
    if podStatus, err := kl.podCache.Get(podID); err == nil {
    toKeep := 1
    if kl.IsPodDeleted(podID) {
    toKeep = 0
    }
    kl.sandboxDeleter.deleteSandboxesInPod(podStatus, toKeep)
    }
    }

  3. Since there is only one sandox in every pod in this example, it will not be deleted at this point. For loop will not run:

    for i := len(sandboxStatuses) - 1; i >= toKeep; i-- {

  4. Than when GC kicks in, those sandboxes are deleted and pods are deleted from API server.

So the issue is that the assumption made in #92817 that PLEG event pleg.ContainerRemoved is executed after pod is already marked for deletion is not correct.

Is #97502 an acceptable fix?
Could someone take a test/try?

Is #97502 an acceptable fix?

@pacoxu my biggest concern is that for 1.20 the proposed fix introduces even more logic. So maybe the right fix for 1.20 would be not to account for sandbox status. And in master get it back with your fix or other. Wdyt?

@SergeyKanzhelev OK
For master, feel free to ping me for investigating or testing on this.

Is there an actual solution or workaround to get along with this?

I have the same problem with nodes running v.1.20.5

@rdxmb are you sure you have the same issue? The fix was backported into 1.20 quite some time ago.

As this is a very long issue with many comments, how can I reproduce or debug if it is the same?

edit: In my case this is related to #42889 (comment) . I am not quite sure what's cause and effect here.

@rdxmb your issue is different than this. This issue is about increase in the time taken for a pod to be deleted from apiserver and not related to the volume plugins.

@kmala ok thanks.