1.20 regression: pods failing to terminate
howardjohn opened this issue · comments
What happened:
I am still unwinding the pieces, but what I know for certain:
after patching a deployment, the old pod sticks around for over a minute (or test times out after a minute). This is despite terminationGracePeriodSeconds: 30s
. The pod has The container could not be located when the pod was deleted. The container used to be Running
in the status, which is newly added in #95364.
Controller manager shows this:
2020-12-14T01:34:59.362946898Z stderr F I1214 01:34:59.362445 1 event.go:291] "Event occurred" object="istio-system/istiod-646465db66" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulDelete" message="Deleted pod: istiod-646465db66-xzk4g"
1 min later, our tests call a .List() on pods, and the removed pod still shows up (pod spec attached below)
These issues can be fairly reliably reproduced in our CI environment; my PR to update to 1.20 is here: istio/istio#29536. We previously were on 1.19.1 and are attempting to upgrade to 1.20. We already run 1.20 for a subset of our tests, which have just a single cluster. These tests which are failing are running 5 kind
clusters at once. Its possible this increased load is responsible, but also the tests are doing different things (for example, we don't patch the deployment in the single cluster tests which work on 1.20) so I cannot say for sure yet which is the root cause.
Pod spec:
metadata:
annotations:
prometheus.io/port: '15014'
prometheus.io/scrape: 'true'
sidecar.istio.io/inject: 'false'
creationTimestamp: '2020-12-14T01:34:41Z'
deletionGracePeriodSeconds: '30'
deletionTimestamp: '2020-12-14T01:35:29Z'
generateName: istiod-646465db66-
labels:
app: istiod
install.operator.istio.io/owning-resource: unknown
istio: pilot
istio.io/rev: default
operator.istio.io/component: Pilot
pod-template-hash: 646465db66
sidecar.istio.io/inject: 'false'
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:prometheus.io/port: {}
f:prometheus.io/scrape: {}
f:sidecar.istio.io/inject: {}
f:generateName: {}
f:labels:
.: {}
f:app: {}
f:install.operator.istio.io/owning-resource: {}
f:istio: {}
f:istio.io/rev: {}
f:operator.istio.io/component: {}
f:pod-template-hash: {}
f:sidecar.istio.io/inject: {}
f:ownerReferences:
.: {}
k:{"uid":"d967b26d-4e4c-4c1f-bc4c-1f86e7fd3128"}:
.: {}
f:apiVersion: {}
f:blockOwnerDeletion: {}
f:controller: {}
f:kind: {}
f:name: {}
f:uid: {}
f:spec:
f:containers:
k:{"name":"discovery"}:
.: {}
f:args: {}
f:env:
.: {}
k:{"name":"CENTRAL_ISTIOD"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"CLUSTER_ID"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"ENABLE_ADMIN_ENDPOINTS"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"ENABLE_LEGACY_FSGROUP_INJECTION"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"EXTERNAL_ISTIOD"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"ISTIOD_ADDR"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"JWT_POLICY"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"KUBECONFIG"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"PILOT_CERT_PROVIDER"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"PILOT_ENABLED_SERVICE_APIS"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"PILOT_ENABLE_ANALYSIS"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"PILOT_ENABLE_PROTOCOL_SNIFFING_FOR_INBOUND"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"PILOT_ENABLE_PROTOCOL_SNIFFING_FOR_OUTBOUND"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"PILOT_ENABLE_WORKLOAD_ENTRY_AUTOREGISTRATION"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"PILOT_TRACE_SAMPLING"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"POD_NAME"}:
.: {}
f:name: {}
f:valueFrom:
.: {}
f:fieldRef:
.: {}
f:apiVersion: {}
f:fieldPath: {}
k:{"name":"POD_NAMESPACE"}:
.: {}
f:name: {}
f:valueFrom:
.: {}
f:fieldRef:
.: {}
f:apiVersion: {}
f:fieldPath: {}
k:{"name":"REVISION"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"SERVICE_ACCOUNT"}:
.: {}
f:name: {}
f:valueFrom:
.: {}
f:fieldRef:
.: {}
f:apiVersion: {}
f:fieldPath: {}
f:image: {}
f:imagePullPolicy: {}
f:name: {}
f:ports:
.: {}
k:{"containerPort":15010,"protocol":"TCP"}:
.: {}
f:containerPort: {}
f:protocol: {}
k:{"containerPort":15017,"protocol":"TCP"}:
.: {}
f:containerPort: {}
f:protocol: {}
k:{"containerPort":8080,"protocol":"TCP"}:
.: {}
f:containerPort: {}
f:protocol: {}
f:readinessProbe:
.: {}
f:failureThreshold: {}
f:httpGet:
.: {}
f:path: {}
f:port: {}
f:scheme: {}
f:initialDelaySeconds: {}
f:periodSeconds: {}
f:successThreshold: {}
f:timeoutSeconds: {}
f:resources:
.: {}
f:requests:
.: {}
f:cpu: {}
f:memory: {}
f:securityContext:
.: {}
f:capabilities:
.: {}
f:drop: {}
f:runAsGroup: {}
f:runAsNonRoot: {}
f:runAsUser: {}
f:terminationMessagePath: {}
f:terminationMessagePolicy: {}
f:volumeMounts:
.: {}
k:{"mountPath":"/etc/cacerts"}:
.: {}
f:mountPath: {}
f:name: {}
f:readOnly: {}
k:{"mountPath":"/etc/istio/config"}:
.: {}
f:mountPath: {}
f:name: {}
k:{"mountPath":"/var/lib/istio/inject"}:
.: {}
f:mountPath: {}
f:name: {}
f:readOnly: {}
k:{"mountPath":"/var/run/secrets/istio-dns"}:
.: {}
f:mountPath: {}
f:name: {}
k:{"mountPath":"/var/run/secrets/remote"}:
.: {}
f:mountPath: {}
f:name: {}
f:readOnly: {}
f:dnsPolicy: {}
f:enableServiceLinks: {}
f:restartPolicy: {}
f:schedulerName: {}
f:securityContext:
.: {}
f:fsGroup: {}
f:serviceAccount: {}
f:serviceAccountName: {}
f:terminationGracePeriodSeconds: {}
f:volumes:
.: {}
k:{"name":"cacerts"}:
.: {}
f:name: {}
f:secret:
.: {}
f:defaultMode: {}
f:optional: {}
f:secretName: {}
k:{"name":"config-volume"}:
.: {}
f:configMap:
.: {}
f:defaultMode: {}
f:name: {}
f:name: {}
k:{"name":"inject"}:
.: {}
f:configMap:
.: {}
f:defaultMode: {}
f:name: {}
f:name: {}
k:{"name":"istio-kubeconfig"}:
.: {}
f:name: {}
f:secret:
.: {}
f:defaultMode: {}
f:optional: {}
f:secretName: {}
k:{"name":"local-certs"}:
.: {}
f:emptyDir:
.: {}
f:medium: {}
f:name: {}
manager: kube-controller-manager
operation: Update
time: '2020-12-14T01:34:41Z'
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:status:
f:conditions:
k:{"type":"ContainersReady"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:message: {}
f:reason: {}
f:status: {}
f:type: {}
k:{"type":"Initialized"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:status: {}
f:type: {}
k:{"type":"Ready"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:message: {}
f:reason: {}
f:status: {}
f:type: {}
f:containerStatuses: {}
f:hostIP: {}
f:phase: {}
f:podIP: {}
f:podIPs:
.: {}
k:{"ip":"10.30.0.49"}:
.: {}
f:ip: {}
f:startTime: {}
manager: kubelet
operation: Update
time: '2020-12-14T01:35:00Z'
name: istiod-646465db66-xzk4g
namespace: istio-system
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: istiod-646465db66
uid: d967b26d-4e4c-4c1f-bc4c-1f86e7fd3128
resourceVersion: '12379'
uid: e47b3bbc-7596-403c-9dd3-5fa884823e9d
spec:
containers:
- args:
- discovery
- --monitoringAddr=:15014
- --log_output_level=default:info
- --domain
- cluster.local
- --keepaliveMaxServerConnectionAge
- 30m
env:
- name: REVISION
value: default
- name: JWT_POLICY
value: first-party-jwt
- name: PILOT_CERT_PROVIDER
value: istiod
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: SERVICE_ACCOUNT
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.serviceAccountName
- name: KUBECONFIG
value: /var/run/secrets/remote/config
- name: ENABLE_ADMIN_ENDPOINTS
value: 'true'
- name: ENABLE_LEGACY_FSGROUP_INJECTION
value: 'false'
- name: PILOT_ENABLED_SERVICE_APIS
value: 'true'
- name: PILOT_ENABLE_WORKLOAD_ENTRY_AUTOREGISTRATION
value: 'true'
- name: PILOT_TRACE_SAMPLING
value: '1'
- name: PILOT_ENABLE_PROTOCOL_SNIFFING_FOR_OUTBOUND
value: 'true'
- name: PILOT_ENABLE_PROTOCOL_SNIFFING_FOR_INBOUND
value: 'true'
- name: ISTIOD_ADDR
value: istiod.istio-system.svc:15012
- name: PILOT_ENABLE_ANALYSIS
value: 'false'
- name: CLUSTER_ID
value: cluster-2
- name: EXTERNAL_ISTIOD
value: 'true'
- name: CENTRAL_ISTIOD
value: 'false'
image: localhost:5000/pilot:istio-testing
imagePullPolicy: IfNotPresent
name: discovery
ports:
- containerPort: 8080
protocol: TCP
- containerPort: 15010
protocol: TCP
- containerPort: 15017
protocol: TCP
readinessProbe:
failureThreshold: 3
handler:
httpGet:
path: /ready
port: 8080
scheme: HTTP
initialDelaySeconds: 1
periodSeconds: 3
successThreshold: 1
timeoutSeconds: 5
resources:
requests:
cpu: 500m
memory: 2Gi
securityContext:
capabilities:
drop:
- ALL
runAsGroup: '1337'
runAsNonRoot: true
runAsUser: '1337'
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/istio/config
name: config-volume
- mountPath: /var/run/secrets/istio-dns
name: local-certs
- mountPath: /etc/cacerts
name: cacerts
readOnly: true
- mountPath: /var/run/secrets/remote
name: istio-kubeconfig
readOnly: true
- mountPath: /var/lib/istio/inject
name: inject
readOnly: true
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: istiod-service-account-token-6wzt8
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: cluster3-control-plane
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: '1337'
serviceAccount: istiod-service-account
serviceAccountName: istiod-service-account
terminationGracePeriodSeconds: '30'
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: '300'
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: '300'
volumes:
- name: local-certs
volumeSource:
emptyDir:
medium: Memory
- name: cacerts
volumeSource:
secret:
defaultMode: 420
optional: true
secretName: cacerts
- name: istio-kubeconfig
volumeSource:
secret:
defaultMode: 420
optional: true
secretName: istio-kubeconfig
- name: inject
volumeSource:
configMap:
defaultMode: 420
localObjectReference:
name: istio-sidecar-injector
- name: config-volume
volumeSource:
configMap:
defaultMode: 420
localObjectReference:
name: istio
- name: istiod-service-account-token-6wzt8
volumeSource:
secret:
defaultMode: 420
secretName: istiod-service-account-token-6wzt8
status:
conditions:
- lastProbeTime: null
lastTransitionTime: '2020-12-14T01:34:41Z'
status: 'True'
type: Initialized
- lastProbeTime: null
lastTransitionTime: '2020-12-14T01:35:00Z'
message: 'containers with unready status: [discovery]'
reason: ContainersNotReady
status: 'False'
type: Ready
- lastProbeTime: null
lastTransitionTime: '2020-12-14T01:35:00Z'
message: 'containers with unready status: [discovery]'
reason: ContainersNotReady
status: 'False'
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: '2020-12-14T01:34:41Z'
status: 'True'
type: PodScheduled
containerStatuses:
- image: localhost:5000/pilot:istio-testing
lastState:
terminated:
exitCode: 137
finishedAt: null
message: The container could not be located when the pod was deleted. The
container used to be Running
reason: ContainerStatusUnknown
startedAt: null
name: discovery
started: false
state:
waiting:
reason: ContainerCreating
hostIP: 172.18.0.3
phase: Running
podIP: 10.30.0.49
podIPs:
- ip: 10.30.0.49
qosClass: Burstable
startTime: '2020-12-14T01:34:41Z'
What you expected to happen:
1.20 to perform as well as 1.19, or have a release note explaining any issues/changes required
How to reproduce it (as minimally and precisely as possible):
I can consistently reproduce it in https://github.com/istio/istio/pull/29536/files. The artifacts dump out a bunch of info, including the kind logs. I don't expect anyone to dig through all of those, but I am not sure where to look next, so if you let me know what info is needed I can capture it.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): 1.20.0 - Cloud provider or hardware configuration: Running 5 kind clusters, inside a pod running in GKE (prow)
- OS (e.g:
cat /etc/os-release
): ubuntu
/sig node
cc @deads2k - I have no clue if your PR caused this or not since I don't understand this codepath well, but due to the timing and status message it seems possibly related
After bumping our 1minute timeout up, it seems like the pod does get terminated, but not for 100s, over 3x longer than the termination grace period. I am fairly sure the process actually exits within a couple of seconds of SIGTERM as well. So the issue may be more that there is (seemingly) a regression in the time it takes to terminate a pod, rather than some bug causing the pods to never terminate
Note that closing holes in pod status reporting have historically exposed actual bugs in the container runtime and volume subsystems. I may have seen evidence of similar behavior as described above in cri-o after the openshift 1.20 rebase landed a couple of weeks ago in CI, so it’s possible it is in another subsystem. I will review some of the runs for symptoms of this.
/cc
#95364 and #95561 may be related,
run 5 kind cluster at the same time may result into runtime missing data? I will try to reproduce it in my env.
kind create cluster --image=daocloud.io/gcr_containers/kindest-node:v1.20.0 --name cluster1
kind create cluster --image=daocloud.io/gcr_containers/kindest-node:v1.20.0 --name cluster2
kind create cluster --image=daocloud.io/gcr_containers/kindest-node:v1.20.0 --name cluster3
kubectl config use-context kind-cluster1
kubectl create deploy app1 --image=daocloud.io/daocloud/nginx:0.1
kubectl scale deploy app1 replicas=2
kubectl replace -f app1.yaml
kubectl config use-context kind-cluster2
kubectl create deploy app1 --image=daocloud.io/daocloud/nginx:0.1
kubectl scale deploy app1 replicas=2
kubectl replace -f app1.yaml
kubectl config use-context kind-cluster3
kubectl create deploy app1 --image=daocloud.io/daocloud/nginx:0.1
kubectl scale deploy app1 replicas=2
kubectl replace -f app1.yaml
...
kubectl config use-context kind-cluster1
kubectl replace -f app2.yaml
kubectl config use-context kind-cluster2
kubectl replace -f app2.yaml
kubectl config use-context kind-cluster2
kubectl replace -f app2.yaml
app1.yaml use image dao-2048
app2.yaml change back to nginx
I will run the script above with 3 clusters in my server.
pod app1-5f6b449d85-tmhl9
RUNNING 01:30:59
Terminating 01:31:01
Terminating 01:31:44
Deleted 01:31:45
Terminating for 45s in my env.
terminationGracePeriodSeconds: 30s
32s to delete pod on kubernetes .118.6
[root@dce-10-6-150-61 ~]# time kubectl delete pod dao-2048-dao-2048-657b7685f8-mm4hg -n paco
pod "dao-2048-dao-2048-657b7685f8-mm4hg" deleted
real 0m32.570s
user 0m0.110s
sys 0m0.041s
42s to delete pod on kubeadm 1.20.1 on similar VM.
[root@daocloud ~]# time kubectl delete pod app1-6dcdd8ccb6-257bb
pod "app1-6dcdd8ccb6-257bb" deleted
real 0m42.975s
user 0m0.100s
sys 0m0.015s
debug log with kubelet and controller-manager logs(as I am in China +0800, 06:53:14 is the same as 14:53:14 here)
- 14:53:14 DELETE https://10.6.177.40:6443/api/v1/namespaces/default/pods/app1-859d7f4f9c-9wjtg 200
- 14:53:14 SyncLoop (DELETE, "api")
- 14:53:14 Ignoring inactive pod default/app1-859d7f4f9c-9wjtg in state Running, deletion time 2020-12-24 06:53:44 +0000 UTC(delete after 30s)
- 14:53:18 is terminated, but some containers are still running(every 10s)
- 14:53:44 cni.go:382] Deleting default_app1-859d7f4f9c-9wjtg/cdb662 from network loopback/cni-loopback netns "/proc/105454/ns/net"
- 14:53:44 cni.go:390] Deleted default_app1-859d7f4f9c-9wjtg/cdb6623b2 from network loopback/cni-loopback
- 14:53:44 dce-10-7-177-91 kubelet[107968]: I1224 14:53:44.880281 107968 cni.go:382] Deleting default_app1-859d7f4f9c-9wjtg/cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 from network calico/k8s-pod-network netns "/proc/105454/ns/net"
- 14:53:44 generic.go:191] GenericPLEG: Relisting
- 14:53:44 generic.go:155] GenericPLEG: 2d394ca3-4712-4a4c-991b-1ebffc3b3bf2/6edd553e60d5a927ae9: running -> exited
- 14:53:44 kuberuntime_manager.go:958] getSandboxIDByPodUID got sandbox IDs ["cdb6623b209"] for pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)"
- 14:53:44 config.go:278] Setting pods for source api
- 14:53:44 kubelet.go:1901] SyncLoop (DELETE, "api"): "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)"
- 14:53:45 k8s.go 571: Teardown processing complete. ContainerID="cdb6623b209"
- 14:53:45 cni.go:390] Deleted default_app1-859d7f4f9c-9wjtg/cdb6623b20 from network calico/k8s-pod-network
- 14:53:45 systemd[1]: docker-cdb6623b209022.scope: Consumed 326ms CPU time
- 14:53:45 noop.go:30] No-op Destroy function called
- 14:53:45 manager.go:1044] Destroyed container: "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod2d394ca3_4712_4a4c_991b_1ebffc3b3bf2.slice/docker-cdb6623b2090.scope" (aliases: [k8s_POD_app1-859d7f4f9c-9wjtg_default_2d394ca3-4712-4a4c-991b-1ebffc3b3bf2_0 cdb6623b20], namespace: "docker")
- 14:53:45 handler.go:325] Added event &{/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod2d394ca3_4712_4a4c_991b_1ebffc3b3bf2.slice/docker-cdb6623b2090.scope 2020-12-24 14:53:45.232096597 +0800 CST m=+83.955676998 containerDeletion {}}
- 14:53:45 containerd[1681]: msg="shim reaped" id=cdb6623b2090223e956
- 14:53:46 dce-10-7-177-91 kubelet[107968]: I1224 14:53:46.167274 107968 generic.go:155] GenericPLEG: 2d394ca3-4712-4a4c-991b-1ebffc3b3bf2/6edd553e60d5a927ae99273fdecca29b6c8b91ba252a7934080a6be4591554d9: exited -> non-existent
- 14:53:46 dce-10-7-177-91 kubelet[107968]: I1224 14:53:46.167319 107968 generic.go:155] GenericPLEG: 2d394ca3-4712-4a4c-991b-1ebffc3b3bf2/cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730: running -> exited
- Dec 24 14:53:46 dce-10-7-177-91 kubelet[107968]: I1224 14:53:46.190739 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is terminated, but some pod sandboxes have not been cleaned up: {Id:cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-9wjtg,Uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792257814360459 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-9wjtg io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:17.500144486+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
- 14:53:46 The container could not be located when the pod was deleted. The container used to be Running
- 14:53:47 Status for pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is up-to-date: (4)
- 14:53:48 is terminated, but some containers are still running(every 10s)
- 14:53:48 manager.go:1044] Destroyed container:
- 14:53:48 is terminated, but some containers are still running
- 14:53:50 is terminated, but some containers have not been cleaned up
- 14:53:52 systemd[1]: Removed slice libcontainer_109018_systemd_test_default.slice.
- 14:53:53 /bin/calico-node -bird-ready -felix-ready' from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
- 14:53:53 exec.go:62] Exec probe response: ""
- 14:53:58 is up-to-date: (5)
- 14:54:08 setters.go:795] Error getting volume limit for plugin kubernetes.io/aws-ebs
- 14:54:08 setters.go:795] Error getting volume limit for plugin kubernetes.io/cinder
- 14:54:08 setters.go:795] Error getting volume limit for plugin kubernetes.io/azure-disk
- 14:54:08 setters.go:795] Error getting volume limit for plugin kubernetes.io/gce-pd
- 14:54:08 is terminated, but some pod sandboxes have not been cleaned up
- 14:54:18 is terminated, but some pod sandboxes have not been cleaned up
- 14:54:28 is terminated, but some pod sandboxes have not been cleaned up
- 14:54:38 ontroller_utils.go:916] Ignoring inactive pod default/app1-859d7f4f9c-9wjtg in state Running, deletion time 2020-12-24 06:53:14 +0000 UTC
- 14:54:38 kubelet.go: Failed to delete pod "", err: pod not found
- 14:54:38 status_manager.go: Pod "" does not exist on the server
- 14:54:38.407711 1 deployment_controller.go:357] Pod app1-859d7f4f9c-9wjtg deleted.
- 14:54:38 status_manager.go:570] Status for pod "" is up-to-date: (4)
key point is 14:53:46
Dec 24 14:53:46 dce-10-7-177-91 kubelet[107968]: I1224 14:53:46.190739 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is terminated, but some pod sandboxes have not been cleaned up: {Id:cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-9wjtg,Uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792257814360459 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-9wjtg io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:17.500144486+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Dec 24 14:53:46 dce-10-7-177-91 kubelet[107968]: I1224 14:53:46.389600 107968 kubelet_pods.go:1972] Orphaned pod "2d394ca3-4712-4a4c-991b-1ebffc3b3bf2" found, removing pod cgroups
Dec 24 14:53:47 dce-10-7-177-91 kubelet[107968]: I1224 14:53:47.177355 107968 kubelet_pods.go:1492] Generating status for "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)"
Dec 24 14:53:47 dce-10-7-177-91 kubelet[107968]: I1224 14:53:47.181971 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is terminated, but some pod sandboxes have not been cleaned up: {Id:cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-9wjtg,Uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792257814360459 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-9wjtg io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:17.500144486+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Dec 24 14:53:48 dce-10-7-177-91 kubelet[107968]: I1224 14:53:48.375519 107968 kubelet_pods.go:952] Pod "app1-859d7f4f9c-xfphs_default(7c4aaa81-9834-4e52-8e5c-5c7094ad3f16)" is terminated, but some containers are still running
Dec 24 14:53:48 dce-10-7-177-91 kubelet[107968]: I1224 14:53:48.375975 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is terminated, but some pod sandboxes have not been cleaned up: {Id:cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-9wjtg,Uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792257814360459 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-9wjtg io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:17.500144486+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Dec 24 14:53:50 dce-10-7-177-91 kubelet[107968]: I1224 14:53:50.204880 107968 kubelet_pods.go:1492] Generating status for "app1-859d7f4f9c-xfphs_default(7c4aaa81-9834-4e52-8e5c-5c7094ad3f16)"
Dec 24 14:53:50 dce-10-7-177-91 kubelet[107968]: I1224 14:53:50.204893 107968 kubelet_pods.go:1492] Generating status for "app1-859d7f4f9c-xfphs_default(7c4aaa81-9834-4e52-8e5c-5c7094ad3f16)"
Dec 24 14:53:50 dce-10-7-177-91 kubelet[107968]: I1224 14:53:50.205233 107968 kubelet_pods.go:1492] Generating status for "app1-859d7f4f9c-xfphs_default(7c4aaa81-9834-4e52-8e5c-5c7094ad3f16)"
Dec 24 14:53:50 dce-10-7-177-91 kubelet[107968]: I1224 14:53:50.225902 107968 kubelet_pods.go:966] Pod "app1-859d7f4f9c-xfphs_default(7c4aaa81-9834-4e52-8e5c-5c7094ad3f16)" is terminated, but some containers have not been cleaned up: {ID:{Type:docker ID:29715c87bf229539696b5c290f2800a55bb39650ace5a59c838b898e6ddf8574} Name:nginx PodSandboxID:e9ffd8631a7c0d0a2e35d583ae2cf18de32817a89e25a0ae1e9f4317cd08c8a2 State:exited CreatedAt:2020-12-24 14:44:16.633875717 +0800 CST StartedAt:2020-12-24 14:44:16.990329556 +0800 CST FinishedAt:2020-12-24 14:53:48.896488906 +0800 CST ExitCode:137 Image:daocloud.io/daocloud/dao-2048:latest ImageID:docker-pullable://daocloud.io/daocloud/dao-2048@sha256:c07746fa071c6e47c0877b66acec9ea6edcc407a6fe4c162f03cd112e90b041d Hash:2003071402 RestartCount:0 Reason:Error Message:}
Dec 24 14:53:50 dce-10-7-177-91 kubelet[107968]: I1224 14:53:50.232166 107968 kubelet_pods.go:966] Pod "app1-859d7f4f9c-xfphs_default(7c4aaa81-9834-4e52-8e5c-5c7094ad3f16)" is terminated, but some containers have not been cleaned up: {ID:{Type:docker ID:29715c87bf229539696b5c290f2800a55bb39650ace5a59c838b898e6ddf8574} Name:nginx PodSandboxID:e9ffd8631a7c0d0a2e35d583ae2cf18de32817a89e25a0ae1e9f4317cd08c8a2 State:exited CreatedAt:2020-12-24 14:44:16.633875717 +0800 CST StartedAt:2020-12-24 14:44:16.990329556 +0800 CST FinishedAt:2020-12-24 14:53:48.896488906 +0800 CST ExitCode:137 Image:daocloud.io/daocloud/dao-2048:latest ImageID:docker-pullable://daocloud.io/daocloud/dao-2048@sha256:c07746fa071c6e47c0877b66acec9ea6edcc407a6fe4c162f03cd112e90b041d Hash:2003071402 RestartCount:0 Reason:Error Message:}
Dec 24 14:53:50 dce-10-7-177-91 kubelet[107968]: I1224 14:53:50.393226 107968 kubelet_pods.go:1972] Orphaned pod "7c4aaa81-9834-4e52-8e5c-5c7094ad3f16" found, removing pod cgroups
Dec 24 14:53:51 dce-10-7-177-91 kubelet[107968]: I1224 14:53:51.216934 107968 kubelet_pods.go:1492] Generating status for "app1-859d7f4f9c-xfphs_default(7c4aaa81-9834-4e52-8e5c-5c7094ad3f16)"
Dec 24 14:53:51 dce-10-7-177-91 kubelet[107968]: I1224 14:53:51.227951 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-xfphs_default(7c4aaa81-9834-4e52-8e5c-5c7094ad3f16)" is terminated, but some pod sandboxes have not been cleaned up: {Id:e9ffd8631a7c0d0a2e35d583ae2cf18de32817a89e25a0ae1e9f4317cd08c8a2 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-xfphs,Uid:7c4aaa81-9834-4e52-8e5c-5c7094ad3f16,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792253773213387 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-xfphs io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:7c4aaa81-9834-4e52-8e5c-5c7094ad3f16 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:13.455555086+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Dec 24 14:53:55 dce-10-7-177-91 kubelet[107968]: I1224 14:53:55.375589 107968 kubelet_pods.go:1492] Generating status for "calico-node-6pzps_calico-system(20485dcf-a13d-4a34-9f80-be1ae3a4ee6b)"
Dec 24 14:53:56 dce-10-7-177-91 kubelet[107968]: I1224 14:53:56.376054 107968 kubelet_pods.go:1492] Generating status for "dashboard-metrics-scraper-79c5968bdc-zrv6c_kubernetes-dashboard(77fcb67f-1d16-4b15-98be-92ac605463a7)"
Dec 24 14:53:56 dce-10-7-177-91 kubelet[107968]: I1224 14:53:56.376116 107968 kubelet_pods.go:1492] Generating status for "two-containers_default(34a83d44-9606-4f1d-9f14-0e6001c3209c)"
Dec 24 14:53:58 dce-10-7-177-91 kubelet[107968]: I1224 14:53:58.377376 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is terminated, but some pod sandboxes have not been cleaned up: {Id:cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-9wjtg,Uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792257814360459 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-9wjtg io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:17.500144486+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Dec 24 14:54:08 dce-10-7-177-91 kubelet[107968]: I1224 14:54:08.375700 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is terminated, but some pod sandboxes have not been cleaned up: {Id:cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-9wjtg,Uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792257814360459 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-9wjtg io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:17.500144486+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Dec 24 14:54:18 dce-10-7-177-91 kubelet[107968]: I1224 14:54:18.375644 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is terminated, but some pod sandboxes have not been cleaned up: {Id:cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-9wjtg,Uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792257814360459 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-9wjtg io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:17.500144486+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Dec 24 14:54:24 dce-10-7-177-91 kubelet[107968]: I1224 14:54:24.375875 107968 kubelet_pods.go:1492] Generating status for "app1-6dcdd8ccb6-lh6nc_default(564e131d-89b7-4f9e-a9d1-0f8e984497c4)"
Dec 24 14:54:28 dce-10-7-177-91 kubelet[107968]: I1224 14:54:28.375927 107968 kubelet_pods.go:975] Pod "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)" is terminated, but some pod sandboxes have not been cleaned up: {Id:cdb6623b2090223e9567a68c0b5d4f5fb4a5514fd1cf23bac6d7dad323f00730 Metadata:&PodSandboxMetadata{Name:app1-859d7f4f9c-9wjtg,Uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2,Namespace:default,Attempt:0,} State:SANDBOX_NOTREADY CreatedAt:1608792257814360459 Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},} Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,},},} Labels:map[app:app1 io.kubernetes.pod.name:app1-859d7f4f9c-9wjtg io.kubernetes.pod.namespace:default io.kubernetes.pod.uid:2d394ca3-4712-4a4c-991b-1ebffc3b3bf2 pod-template-hash:859d7f4f9c] Annotations:map[kubernetes.io/config.seen:2020-12-24T14:44:17.500144486+08:00 kubernetes.io/config.source:api kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container nginx; cpu, memory limit for container nginx] RuntimeHandler: XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
Dec 24 14:54:30 dce-10-7-177-91 kubelet[107968]: I1224 14:54:30.376357 107968 kubelet_pods.go:1492] Generating status for "app1-6dcdd8ccb6-7cth9_default(f35e05ee-1cbf-4a12-ab97-c3db0c4a75aa)"
Dec 24 14:54:48 dce-10-7-177-91 kubelet[107968]: I1224 14:54:48.375648 107968 kubelet_pods.go:1492] Generating status for "kube-proxy-z4mxb_kube-system(92d21a00-752a-4b07-a104-c3083f519b3c)"
At the time the pod yaml is like below, I'm not quite sure but think https://github.com/kubernetes/kubernetes/pull/95364/files#diff-e81aa7518bebe9f4412cb375a9008b3481b19ec3e851d3187b3021ee94148f0dR1721-R1728 may be wrong.
ContainerStatuses:[]v1.ContainerStatus{
v1.ContainerStatus{
Name:"nginx",
State:v1.ContainerState{
Waiting:(*v1.ContainerStateWaiting)(0xc002c03b80),
Running:(*v1.ContainerStateRunning)(nil),
Terminated:(*v1.ContainerStateTerminated)(nil)
},
LastTerminationState: v1.ContainerState{
Waiting:(*v1.ContainerStateWaiting)(nil),
Running:(*v1.ContainerStateRunning)(nil),
Terminated:(*v1.ContainerStateTerminated)(0xc000560620)
},
Ready:false,
RestartCount:0,
Image:"daocloud.io/daocloud/dao-2048",
ImageID:"", ContainerID:"",
Dec 24 14:53:46 dce-10-7-177-91 kubelet[107968]: I1224 14:53:46.171523 107968 status_manager.go:443] Status Manager: adding pod: "2d394ca3-4712-4a4c-991b-1ebffc3b3bf2", with status: (3, {Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-12-24 14:44:17 +0800 CST } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-12-24 14:53:46 +0800 CST ContainersNotReady containers with unready status: [nginx]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-12-24 14:53:46 +0800 CST ContainersNotReady containers with unready status: [nginx]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-12-24 14:44:17 +0800 CST }] 10.7.177.91 [] 2020-12-24 14:44:17 +0800 CST [] [{nginx {&ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil &ContainerStateTerminated{ExitCode:137,Signal:0,Reason:ContainerStatusUnknown,Message:The container could not be located when the pod was deleted. The container used to be Running,StartedAt:0001-01-01 00:00:00 +0000 UTC,FinishedAt:0001-01-01 00:00:00 +0000 UTC,ContainerID:,}} false 0 daocloud.io/daocloud/dao-2048 0xc0021132e9}] Burstable []}) to podStatusChannel
06:53:44.987159 1 replica_set.go:507]
State:v1.ContainerState{Waiting:(*v1.ContainerStateWaiting)(nil), **Running:(*v1.ContainerStateRunning)(0xc002bb79a0)**, Terminated:(*v1.ContainerStateTerminated)(nil)},
LastTerminationState:v1.ContainerState{Waiting:(*v1.ContainerStateWaiting)(nil), Running:(*v1.ContainerStateRunning)(nil), Terminated:(*v1.ContainerStateTerminated)(nil)
06:53:46.190707 1 replica_set.go:507]
State:v1.ContainerState{**Waiting:(*v1.ContainerStateWaiting)(0xc002c03b80)**, Running:(*v1.ContainerStateRunning)(nil), Terminated:(*v1.ContainerStateTerminated)(nil)},
LastTerminationState:v1.ContainerState{Waiting:(*v1.ContainerStateWaiting)(nil), Running:(*v1.ContainerStateRunning)(nil), **Terminated:(*v1.ContainerStateTerminated)(0xc000560620)**
pod name: app1-859d7f4f9c-9wjtg
key logs
14:53:14 DELETE https://10.6.177.40:6443/api/v1/namespaces/default/pods/app1-859d7f4f9c-9wjtg 200
14:53:14 SyncLoop (DELETE, "api")
14:53:44 config.go:278] Setting pods for source api
14:53:44 kubelet.go:1901] SyncLoop (DELETE, "api"): "app1-859d7f4f9c-9wjtg_default(2d394ca3-4712-4a4c-991b-1ebffc3b3bf2)"
14:53:45 k8s.go 571: Teardown processing complete. ContainerID="cdb6623b209"
14:53:46 the error occurs.
kubelet.log
controller-manager.log
Possible fix
if oldStatus.State.Terminated != nil || status.State.Terminated != nil {
// if the old container status was terminated, the lasttermination status is correct
continue
}
or here
if status.LastTerminationState.Terminated != nil {
// if we already have a termination state, nothing to do
continue
}
14:53:45
ContainerStatuses:[]*container.Status{
(*container.Status)(0xc0019081e0)
},
SandboxStatuses:[]*v1alpha2.PodSandboxStatus{(*v1alpha2.PodSandboxStatus)(0xc001a72420)}} (err: <nil>
14:53:46
ContainerStatuses:[]*container.Status{
},
SandboxStatuses:[]*v1alpha2.PodSandboxStatus{(*v1alpha2.PodSandboxStatus)(0xc001ee4300)}} (err: <nil>)
Dec 24 14:53:44 dce-10-7-177-91 kubelet[107968]: I1224 14:53:44.874951 107968 kuberuntime_container.go:642] Container "docker://6edd553e60d5a927ae99273fdecca29b6c8b91ba252a7934080a6be4591554d9" exited normally
Dec 24 14:53:49 dce-10-7-177-91 kubelet[107968]: I1224 14:53:49.222296 107968 kuberuntime_container.go:642] Container "docker://29715c87bf229539696b5c290f2800a55bb39650ace5a59c838b898e6ddf8574" exited normally
I can reproduce it pretty easily by spinning up 10 pods, then doing kubectl delete deployment
and timing it. There is often (although not always) 1 or more pods that gets "stuck" and takes noticeably long (30s+).
On the good pods, status immediately transitions to:
lastState: {}
name: echo
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: containerd://44d5515c3f4e9fe5afe067f1ffc965327a02a6d19670d38ebcc26d195d37bf80
exitCode: 0
finishedAt: "2021-01-07T16:12:02Z"
reason: Completed
startedAt: "2021-01-07T16:11:22Z"
On the bad pods:
lastState:
terminated:
exitCode: 137
finishedAt: null
message: The container could not be located when the pod was deleted. The
container used to be Running
reason: ContainerStatusUnknown
startedAt: null
name: echo
ready: false
restartCount: 0
started: false
state:
waiting:
reason: ContainerCreatin
I built 1.20 with #92817 reverted as gcr.io/howardjohn-istio/kindest/node:v1.20.0-revert-pod
With this, the problem is not reproducible. cc @SergeyKanzhelev @kmala
The PR #92817 will add additional time it takes for a pod to be deleted from the apiserver and the additional time depends on the performance of the container runtime. We need to look at the logs of the kubelet to understand why it took 3x times as the additional time is only to guarantee the removal of pod sandbox before removing the pod from the apirserver.
The grace period is the duration in seconds after the processes running in the pod are sent a termination signal and the time when the processes are forcibly halted with a kill signal. So, that shouldn't be used to compare how long it takes for a pod to be deleted from apiserver because it can some time after the pod is halted with kill signal for the storage and network resources attached to the pod to be deleted before it can be removed from the apiserver.
@kmala there are kubelet logs in the original issue, is there more info needed?
/triage accepted
/priority important-soon
On the repro from @howardjohn I see the following behavior. Test starts at 19:07:45.336759
. containerd.log
shows all but one sandboxes got removed before 2021-01-08T19:08:05.372949491Z
. Than there is a gap for 24 seconds - nothing happens. And finally, DeleteSandbox
is called for the last sandbox (DeleteSandbox
has Stop
, Teardown
and Remove
):
DeleteSandbox (04bbff577b2d46a870010dedb6828996345d97d2ada5ebe0ace3ed5367fd6ce7)
19:08:29 kind-control-plane containerd[181]: time="2021-01-08T19:08:29.871006393Z" level=info msg="StopPodSandbox for \"04bbff577b2d46a870010dedb6828996345d97d2ada5ebe0ace3ed5367fd6ce7\""
19:08:29 kind-control-plane containerd[181]: time="2021-01-08T19:08:29.901476111Z" level=info msg="TearDown network for sandbox \"04bbff577b2d46a870010dedb6828996345d97d2ada5ebe0ace3ed5367fd6ce7\" successfully"
19:08:29 kind-control-plane containerd[181]: time="2021-01-08T19:08:29.901615727Z" level=info msg="StopPodSandbox for \"04bbff577b2d46a870010dedb6828996345d97d2ada5ebe0ace3ed5367fd6ce7\" returns successfully"
19:08:29 kind-control-plane containerd[181]: time="2021-01-08T19:08:29.903159396Z" level=info msg="RemovePodSandbox for \"04bbff577b2d46a870010dedb6828996345d97d2ada5ebe0ace3ed5367fd6ce7\""
19:08:29 kind-control-plane containerd[181]: time="2021-01-08T19:08:29.916319914Z" level=info msg="RemovePodSandbox \"04bbff577b2d46a870010dedb6828996345d97d2ada5ebe0ace3ed5367fd6ce7\" returns successfully"
This may indicate that the last sandbox was deleted from the GC. We confirmed by logging from removeOldestNSandboxes
- all delayed sandboxes are removed from there. The idea of #92817 was that pkg/kubelet/pod_sandbox_deleter.go
will delete sandboxes faster, before the GC. So the worst case, 1.20 introduces up to a minute to the pod termination.
DeleteSandbox
is called on PLEG event (pleg.ContainerRemoved
). So we either didn't receive that event or kl.IsPodDeleted(podID)
returned false
when we assessing whether to delete the sandbox. Looking further.
I haven't gone and filed separate issues for each test, but I've noticed a number of the node e2e tests that involve pod deletion and low default timeouts (e.g. 1m for deletion wait time) have been flaking in 1.20 where they weren't in 1.19 and earlier. So this is repro'd in upstream k8s CI as well.
For example, see some of the taint tests:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=NoExecuteTaintManager
https://testgrid.k8s.io/sig-release-1.20-informing#gce-cos-k8sbeta-serial&include-filter-by-regex=NoExecuteTaint
@SergeyKanzhelev FYI
A few considerations.
First. Do we need to delete sandbox from the killPodWithSyncResult
? Logs has a few calls to StopPodSandbox
at a right timing when kubelet is trying to kill the pod. But since we only stopping the sandbox and not deleting, according to the new way calculating the status, pod kept alive. So one fix would be to call RemovePodSandbox
right after this code:
kubernetes/pkg/kubelet/kuberuntime/kuberuntime_manager.go
Lines 919 to 925 in d233111
Second. It doesn't feel correct that the sandbox is not being deleted from the pleg.ContainerDied
PLEG event. Since the container is done, sandbox is not needed and perhaps we need to schedule sandbox deletion. I just don't know if pleg.ContainerRemoved
event will be called when container is being deleted from the pleg.ContainerDied
callbaclk.
Third. We can roll back the part of the change #92817 that keeps the pods alive. We can keep the immediate cleanup logic and keep investigating. Not sure when the 1.20.2 is scheduled and would be great to have this bug fixed there.
Small update. On my local repro taken from @howardjohn (deleting of 10 pods), I see the following:
-
Event of type
pleg.ContainerRemoved
is received for all 10 pods.
kubernetes/pkg/kubelet/kubelet.go
Lines 1936 to 1938 in 7511523
-
For the first few pods condition
kl.IsPodDeleted
returnsfalse
indicating that pod has not been yet deleted.
kubernetes/pkg/kubelet/kubelet.go
Lines 2262 to 2270 in 7511523
-
Since there is only one sandox in every pod in this example, it will not be deleted at this point.
For
loop will not run:
-
Than when GC kicks in, those sandboxes are deleted and pods are deleted from API server.
So the issue is that the assumption made in #92817 that PLEG event pleg.ContainerRemoved
is executed after pod is already marked for deletion is not correct.
@SergeyKanzhelev OK
For master, feel free to ping me for investigating or testing on this.
Is there an actual solution or workaround to get along with this?
I have the same problem with nodes running v.1.20.5
@rdxmb are you sure you have the same issue? The fix was backported into 1.20 quite some time ago.
As this is a very long issue with many comments, how can I reproduce or debug if it is the same?
edit: In my case this is related to #42889 (comment) . I am not quite sure what's cause and effect here.
@rdxmb your issue is different than this. This issue is about increase in the time taken for a pod to be deleted from apiserver and not related to the volume plugins.
@kmala ok thanks.