coreos / coreos-kubernetes

CoreOS Container Linux+Kubernetes documentation & Vagrant installers

Home Page:https://coreos.com/kubernetes/docs/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

rkt deployment pod's network namespace quickly deleted when instantiated after a failure

paulmcgoldrick opened this issue · comments

stable Latest, 1235.9.0, and 1185.3.0
rkt Version: 1.18.0, 1.21, 1.14
matchbox 0.5.0,
FLANNEL_IMAGE_TAG=v0.6.2",
kubernetes 1.5.2.

When a container in a deployment is restarted, after a short but arbitrary amount of time the container loses its flannel ip.

The log appears as follows:

kubelet-wrapper[1559]: [… ] 1559 rkt.go:2341] rkt: Failed to get pod network status for pod […]: Unexpected command output nsenter: cannot open /var/run/netns/[…]: No such file or directory

Upon investigation it appears that the systemd unit file related to the pod is not updated with the new netnsfor the new pod. The IP address disappearance will then coincide with the next kubelet-wrapper maintenance logs such as:

MountVolume.SetUp succeeded for volume [...]

To reproduce, create a deployment with a single container.

watch -n5 kubectl describe po -l run=<your deployment label>

ssh into the host the deployments pod has been started on and

rkt stop <pod id>

After the dead pod is replaced, watch logs for kubectl-wrapper error, and maintenance tasks. IP will be unset when the latter appear.

I'm not quite sure if I've misconfigured something, nor where to go from here. Please let me know if you need more information.

I've been able to confirm this same behavior using 1.5.3_coreos.0 on rkt 1.14 (going to start bumping container os versions again with the 1.5.3, instead of 1.5.2).
My test pod is as follows:

---
apiVersion: v1
kind: Pod
metadata:
  name: busybox-pod
  namespace: default
  labels:
    run: busybox
spec:
  containers:
  - image: busybox
    command:
      - sleep
      - "300"
    imagePullPolicy: IfNotPresent
    name: busybox-container
  restartPolicy: Always

It seems like I was in error above assuming the new pod would be using a new network namespace, as the namespace referenced in the unit file is created every time the pod is restarted, but subsequently goes away. Also, it would seem the reporting of the IP is cached, which is leading to the correlation between kubelet-wrapper logs, and the IP "disappearing" from the kubectl describe pod output. Instead, I guess based on my observation of netns monitor, the namespace is removed much more quickly.

Here are the logs of netns monitor as I either manually kill the sleep, or allow it to expire. Note that the deletion occurs less than a minute after the instantiation of the pod, and this behavior is reproducable with any pod in my current systems. The namespaceless pod will continue to exist until it completes or is killed.

add k8s_8c8afe0d-fee5-11e6-a0c7-080027c23b90
delete k8s_8c8afe0d-fee5-11e6-a0c7-080027c23b90
add k8s_8c8afe0d-fee5-11e6-a0c7-080027c23b90
delete k8s_8c8afe0d-fee5-11e6-a0c7-080027c23b90
add k8s_8c8afe0d-fee5-11e6-a0c7-080027c23b90
delete k8s_8c8afe0d-fee5-11e6-a0c7-080027c23b90
add k8s_8c8afe0d-fee5-11e6-a0c7-080027c23b90
delete k8s_8c8afe0d-fee5-11e6-a0c7-080027c23b90
add k8s_8c8afe0d-fee5-11e6-a0c7-080027c23b90

I have confirmed this with master of this repo as well, single node with :
CONTAINER_RUNTIME=rkt

paul$ kubectl version
Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.3", GitCommit:"4957b090e9a4f6a68b4a40375408fdc74a212260", GitTreeState:"clean", BuildDate:"2016-10-16T06:36:33Z", GoVersion:"go1.7.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.3+coreos.0", GitCommit:"8fc95b64d0fe1608d0f6c788eaad2c004f31e7b7", GitTreeState:"clean", BuildDate:"2017-02-15T19:52:15Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
paul$ vagrant ssh default
Last login: Thu Mar  2 19:49:12 UTC 2017 from 10.0.2.2 on pts/9
Container Linux by CoreOS alpha (1325.1.0)
Failed Units: 1
  update-engine.service
core@localhost ~ $ rkt version
rkt Version: 1.23.0
appc Version: 0.8.9
Go Version: go1.7.3
Go OS/Arch: linux/amd64
Features: -TPM +SDJOURNAL

Here are the logs from teh coreos node at the moment the network namespace is deleted:

Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to read the pod manifest's mtime for pod "0cc32a7a-701d-4b40-943f-99a23f971122": stat /var/lib/rkt/pods/exited-garbage/0cc32a7a-701d-4b40-943f-99a23f971122/pod: no such file or dire
ctory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to get the pod manifest for pod "0cc32a7a-701d-4b40-943f-99a23f971122": no such file or directory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to get the creation time for pod "0cc32a7a-701d-4b40-943f-99a23f971122": no such file or directory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to get the PID for pod "0cc32a7a-701d-4b40-943f-99a23f971122": no such file or directory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to read the pod manifest's mtime for pod "0cc32a7a-701d-4b40-943f-99a23f971122": stat /var/lib/rkt/pods/exited-garbage/0cc32a7a-701d-4b40-943f-99a23f971122/pod: no such file or dire
ctory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to get the pod manifest for pod "0cc32a7a-701d-4b40-943f-99a23f971122": no such file or directory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to get the creation time for pod "0cc32a7a-701d-4b40-943f-99a23f971122": no such file or directory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to get the PID for pod "0cc32a7a-701d-4b40-943f-99a23f971122": no such file or directory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to read the pod manifest's mtime for pod "0cc32a7a-701d-4b40-943f-99a23f971122": stat /var/lib/rkt/pods/exited-garbage/0cc32a7a-701d-4b40-943f-99a23f971122/pod: no such file or dire
ctory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to get the pod manifest for pod "0cc32a7a-701d-4b40-943f-99a23f971122": no such file or directory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to get the creation time for pod "0cc32a7a-701d-4b40-943f-99a23f971122": no such file or directory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to get the PID for pod "0cc32a7a-701d-4b40-943f-99a23f971122": no such file or directory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to read the pod manifest's mtime for pod "0cc32a7a-701d-4b40-943f-99a23f971122": stat /var/lib/rkt/pods/exited-garbage/0cc32a7a-701d-4b40-943f-99a23f971122/pod: no such file or dire
ctory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to get the pod manifest for pod "0cc32a7a-701d-4b40-943f-99a23f971122": no such file or directory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to get the creation time for pod "0cc32a7a-701d-4b40-943f-99a23f971122": no such file or directory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to get the PID for pod "0cc32a7a-701d-4b40-943f-99a23f971122": no such file or directory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to read the pod manifest's mtime for pod "0cc32a7a-701d-4b40-943f-99a23f971122": stat /var/lib/rkt/pods/exited-garbage/0cc32a7a-701d-4b40-943f-99a23f971122/pod: no such file or dire
ctory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to get the pod manifest for pod "0cc32a7a-701d-4b40-943f-99a23f971122": no such file or directory                                                           
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to get the creation time for pod "0cc32a7a-701d-4b40-943f-99a23f971122": no such file or directory
Mar 02 19:50:51 localhost rkt[1831]: api-service: failed to get the PID for pod "0cc32a7a-701d-4b40-943f-99a23f971122": no such file or directory
Mar 02 19:50:51 localhost systemd-networkd[1256]: veth0a08ef27: Lost carrier
Mar 02 19:50:51 localhost systemd-timesyncd[705]: Network configuration changed, trying to establish connection.
Mar 02 19:50:51 localhost kernel: cni0: port 5(veth0a08ef27) entered disabled state
Mar 02 19:50:51 localhost kernel: device veth0a08ef27 left promiscuous mode
Mar 02 19:50:51 localhost kernel: cni0: port 5(veth0a08ef27) entered disabled state
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd-timesyncd[705]: Synchronized to time server 72.14.183.239:123 (0.coreos.pool.ntp.org).
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to open directory /run/systemd/transient/var-lib-rkt-pods-run-1df53a44\x2d7cd3\x2d432c\x2d9d04\x2df02120d675d9-stage1-rootfs-opt-stage2-hyperkube-rootfs-var-lib-rkt-pods-exited\x2dgarbage-08f09
a8e\x2de23b\x2d442e\x2dbbff\x2d330a37b36791-stage1-rootfs-opt-stage2-busybox\x2dcontainer-rootfs.mount.wants: File name too long
Mar 02 19:50:51 localhost systemd[1]: Failed to open directory /run/systemd/transient/var-lib-rkt-pods-run-1df53a44\x2d7cd3\x2d432c\x2d9d04\x2df02120d675d9-stage1-rootfs-opt-stage2-hyperkube-rootfs-var-lib-rkt-pods-exited\x2dgarbage-08f09
a8e\x2de23b\x2d442e\x2dbbff\x2d330a37b36791-stage1-rootfs-opt-stage2-busybox\x2dcontainer-rootfs.mount.requires: File name too long
Mar 02 19:50:51 localhost systemd[1]: Failed to open directory /etc/systemd/system/var-lib-rkt-pods-run-1df53a44\x2d7cd3\x2d432c\x2d9d04\x2df02120d675d9-stage1-rootfs-opt-stage2-hyperkube-rootfs-var-lib-rkt-pods-exited\x2dgarbage-08f09a8e
\x2de23b\x2d442e\x2dbbff\x2d330a37b36791-stage1-rootfs-opt-stage2-busybox\x2dcontainer-rootfs.mount.wants: File name too long
Mar 02 19:50:51 localhost systemd[1]: Failed to open directory /etc/systemd/system/var-lib-rkt-pods-run-1df53a44\x2d7cd3\x2d432c\x2d9d04\x2df02120d675d9-stage1-rootfs-opt-stage2-hyperkube-rootfs-var-lib-rkt-pods-exited\x2dgarbage-08f09a8e
\x2de23b\x2d442e\x2dbbff\x2d330a37b36791-stage1-rootfs-opt-stage2-busybox\x2dcontainer-rootfs.mount.requires: File name too long
Mar 02 19:50:51 localhost systemd[1]: Failed to open directory /run/systemd/system/var-lib-rkt-pods-run-1df53a44\x2d7cd3\x2d432c\x2d9d04\x2df02120d675d9-stage1-rootfs-opt-stage2-hyperkube-rootfs-var-lib-rkt-pods-exited\x2dgarbage-08f09a8e
\x2de23b\x2d442e\x2dbbff\x2d330a37b36791-stage1-rootfs-opt-stage2-busybox\x2dcontainer-rootfs.mount.wants: File name too long
Mar 02 19:50:51 localhost systemd[1]: Failed to open directory /run/systemd/system/var-lib-rkt-pods-run-1df53a44\x2d7cd3\x2d432c\x2d9d04\x2df02120d675d9-stage1-rootfs-opt-stage2-hyperkube-rootfs-var-lib-rkt-pods-exited\x2dgarbage-08f09a8e
\x2de23b\x2d442e\x2dbbff\x2d330a37b36791-stage1-rootfs-opt-stage2-busybox\x2dcontainer-rootfs.mount.requires: File name too long
Mar 02 19:50:51 localhost systemd[1]: Failed to open directory /run/systemd/generator/var-lib-rkt-pods-run-1df53a44\x2d7cd3\x2d432c\x2d9d04\x2df02120d675d9-stage1-rootfs-opt-stage2-hyperkube-rootfs-var-lib-rkt-pods-exited\x2dgarbage-08f09
a8e\x2de23b\x2d442e\x2dbbff\x2d330a37b36791-stage1-rootfs-opt-stage2-busybox\x2dcontainer-rootfs.mount.wants: File name too long
Mar 02 19:50:51 localhost systemd[1]: Failed to open directory /run/systemd/generator/var-lib-rkt-pods-run-1df53a44\x2d7cd3\x2d432c\x2d9d04\x2df02120d675d9-stage1-rootfs-opt-stage2-hyperkube-rootfs-var-lib-rkt-pods-exited\x2dgarbage-08f09
a8e\x2de23b\x2d442e\x2dbbff\x2d330a37b36791-stage1-rootfs-opt-stage2-busybox\x2dcontainer-rootfs.mount.requires: File name too long
Mar 02 19:50:51 localhost systemd[1]: Failed to open directory /usr/lib/systemd/system/var-lib-rkt-pods-run-1df53a44\x2d7cd3\x2d432c\x2d9d04\x2df02120d675d9-stage1-rootfs-opt-stage2-hyperkube-rootfs-var-lib-rkt-pods-exited\x2dgarbage-08f0
9a8e\x2de23b\x2d442e\x2dbbff\x2d330a37b36791-stage1-rootfs-opt-stage2-busybox\x2dcontainer-rootfs.mount.wants: File name too long
Mar 02 19:50:51 localhost systemd[1]: Failed to open directory /usr/lib/systemd/system/var-lib-rkt-pods-run-1df53a44\x2d7cd3\x2d432c\x2d9d04\x2df02120d675d9-stage1-rootfs-opt-stage2-hyperkube-rootfs-var-lib-rkt-pods-exited\x2dgarbage-08f0
9a8e\x2de23b\x2d442e\x2dbbff\x2d330a37b36791-stage1-rootfs-opt-stage2-busybox\x2dcontainer-rootfs.mount.requires: File name too long
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost systemd[1]: Failed to set up mount unit: Invalid argument
Mar 02 19:50:51 localhost kubelet-wrapper[1875]: E0302 19:50:51.669193    1875 kubelet.go:1128] Container garbage collection failed: rkt: Failed to clean up rkt pod "08f09a8e-e23b-442e-bbff-330a37b36791": rkt: Failed to remove pod "08f09a
8e-e23b-442e-bbff-330a37b36791": failed to run [rm 08f09a8e-e23b-442e-bbff-330a37b36791]: exit status 254
Mar 02 19:50:51 localhost kubelet-wrapper[1875]: stdout:
Mar 02 19:50:51 localhost kubelet-wrapper[1875]: stderr: rm: unable to remove pod "08f09a8e-e23b-442e-bbff-330a37b36791": remove /var/lib/rkt/pods/exited-garbage/08f09a8e-e23b-442e-bbff-330a37b36791/stage1/rootfs: device or resource busy
Mar 02 19:50:51 localhost kubelet-wrapper[1875]: rm: failed to remove one or more pods
Mar 02 19:50:51 localhost kubelet-wrapper[1875]: W0302 19:50:51.802398    1875 container.go:352] Failed to create summary reader for "/system.slice/k8s_08f09a8e-e23b-442e-bbff-330a37b36791.service": none of the resources are being tracked
.```

Able to reproduce this with latest master, and rkt. Cannot reproduce with docker.

Also running in to this.

Reformatted the root filesystem with ext4 (was btrfs), seems to have resolved it.