Hard rebooting Kubernetes nodes leads to "volume already mounted at more than one place"

Question

Hard rebooting Kubernetes nodes leads to "volume already mounted at more than one place"

keskival opened this issue 2 years ago · comments

Tero Keski-Valkama commented 2 years ago

NFS provisioner binds to a persistent volume claim in the back in the ReadWriteOnce mode.
This is otherwise all well and good, but in a hard reboot of a node, starting these NFS volume pods fails as they fail to get the volume mount.

Specifically with these events:

Warning  FailedMount  93s (x25 over 37m)  kubelet  MountVolume.MountDevice failed for volume "pvc-542bf63c-575a-4a82-ab4d-96d319e58179" : rpc error: code = FailedPrecondition desc = volume {pvc-542bf63c-575a-4a82-ab4d-96d319e58179} is already mounted at more than one place: {{/var/snap/microk8s/common/var/lib/kubelet/plugins/kubernetes.io/csi/jiva.csi.openebs.io/0f4d4b7188975f990ed572ae7bdb4f2f1c07aa967d6460d2a8472343e7c110e1/globalmount  ext4 /dev/disk/by-path/ip-10.152.183.138:3260-iscsi-iqn.2016-09.com.openebs.jiva:pvc-542bf63c-575a-4a82-ab4d-96d319e58179-lun-0}}

At least in microk8s I have found no way to find out what is mounting the volume behind the scenes exactly, or maybe the accounting is simply wrong. I suppose some weird ghost container could in principle be the one keeping the volume reserved, but I haven't managed to find out what and how.

What I have tried:

Going through pods and pvcs to make sure nothing else is binding that volume.
Going through node mounts. Nothing special there.

Steps to reproduce the bug:
Have several NFS persistent volume claims which use ReadWriteOnce volumes behind them active and reboot a Kubernetes node.
Expected:

The pods restart without problems.
What happens:
The pods get stuck as Kubernetes is convinced something is reserving the mounts.

I have no clue how to investigate further and due to manual surgery to try to make the cluster up and running again after this problem, the whole cluster is now in a state of no return and I need to rebuild it from scratch.

Environment details:

OpenEBS version: openebs.io/version=3.3.0
Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.5", GitCommit:"804d6167111f6858541cef440ccc53887fbbc96a", GitTreeState:"clean", BuildDate:"2022-12-19T15:26:36Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.5", GitCommit:"804d6167111f6858541cef440ccc53887fbbc96a", GitTreeState:"clean", BuildDate:"2022-12-19T15:27:17Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}

OS: Ubuntu 22.04.1 LTS
kernel (e.g: uname -a): Linux curie 5.15.0-58-generic #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

I'm not sure if this is a NFS Provisioner bug, OpenEBS Jiva bug or MicroK8S bug.

This happens to me about weekly, if anyone has suggestions on how to debug what happens, I'd be glad to hear such.

Tero Keski-Valkama · Answer 1 · Wed Jan 18 2023 06:10:23 GMT+0800 (China Standard Time)

Another hard reboot, another stuck pod, now with a slightly different error:

nfs-pvc-e97aa0c3-d9ed-4d3c-a83e-aceacdb3d2fb-7bd8685545-pzkhq
Name:             nfs-pvc-e97aa0c3-d9ed-4d3c-a83e-aceacdb3d2fb-7bd8685545-pzkhq
Namespace:        openebs
Priority:         0
Service Account:  default
Node:             curie/192.168.68.57
Start Time:       Tue, 17 Jan 2023 22:44:15 +0100
Labels:           openebs.io/nfs-server=nfs-pvc-e97aa0c3-d9ed-4d3c-a83e-aceacdb3d2fb
                  pod-template-hash=7bd8685545
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Controlled By:    ReplicaSet/nfs-pvc-e97aa0c3-d9ed-4d3c-a83e-aceacdb3d2fb-7bd8685545
Containers:
  nfs-server:
    Container ID:
    Image:          openebs/nfs-server-alpine:0.9.0
    Image ID:
    Ports:          2049/TCP, 111/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      SHARED_DIRECTORY:       /nfsshare
      CUSTOM_EXPORTS_CONFIG:
      NFS_LEASE_TIME:         90
      NFS_GRACE_TIME:         90
      FILEPERMISSIONS_UID:    1000
      FILEPERMISSIONS_GID:    2000
      FILEPERMISSIONS_MODE:   0777
    Mounts:
      /nfsshare from exports-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vjqk6 (ro)Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  exports-dir:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  nfs-pvc-e97aa0c3-d9ed-4d3c-a83e-aceacdb3d2fb
    ReadOnly:   false
  kube-api-access-vjqk6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age                   From     Message
  ----     ------       ----                  ----     -------
  Warning  FailedMount  3m8s (x858 over 23m)  kubelet  MountVolume.SetUp failed for volume "pvc-5dc013e1-e51f-4eff-aeb3-2f05ca73e241" : rpc error: code = Internal desc = Could not mount "/var/snap/microk8s/common/var/lib/kubelet/plugins/kubernetes.io/csi/jiva.csi.openebs.io/0ac11ee4870718c6acac849e3fca5091569fa2af335fb1cd7cbe57510a28b00c/globalmount" at "/var/snap/microk8s/common/var/lib/kubelet/pods/e5a7647f-75cb-4e49-9ad4-dacee02907f1/volumes/kubernetes.io~csi/pvc-5dc013e1-e51f-4eff-aeb3-2f05ca73e241/mount": mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o bind /var/snap/microk8s/common/var/lib/kubelet/plugins/kubernetes.io/csi/jiva.csi.openebs.io/0ac11ee4870718c6acac849e3fca5091569fa2af335fb1cd7cbe57510a28b00c/globalmount /var/snap/microk8s/common/var/lib/kubelet/pods/e5a7647f-75cb-4e49-9ad4-dacee02907f1/volumes/kubernetes.io~csi/pvc-5dc013e1-e51f-4eff-aeb3-2f05ca73e241/mount
Output: mount: /var/snap/microk8s/common/var/lib/kubelet/pods/e5a7647f-75cb-4e49-9ad4-dacee02907f1/volumes/kubernetes.io~csi/pvc-5dc013e1-e51f-4eff-aeb3-2f05ca73e241/mount: special device /var/snap/microk8s/common/var/lib/kubelet/plugins/kubernetes.io/csi/jiva.csi.openebs.io/0ac11ee4870718c6acac849e3fca5091569fa2af335fb1cd7cbe57510a28b00c/globalmount does not exist.

The "special file" it complains as not existing is there, which makes this even more mysterious:

# ls -ld /var/snap/microk8s/common/var/lib/kubelet/plugins/kubernetes.io/csi/jiva.csi.openebs.io/0ac11ee4870718c6acac849e3fca5091569fa2af335fb1cd7cbe57510a28b00c/globalmount
drwxr-x--- 2 root root 4096 Jan 16 16:44 /var/snap/microk8s/common/var/lib/kubelet/plugins/kubernetes.io/csi/jiva.csi.openebs.io/0ac11ee4870718c6acac849e3fca5091569fa2af335fb1cd7cbe57510a28b00c/globalmount

Victor Adossi ("vados") · Answer 2 · Sat Jan 21 2023 12:47:00 GMT+0800 (China Standard Time)

I happen to be considering using openebs' dynamic-nfs-provisioner vs other options right now, and I was curious about this problem you ran into.

A few questions:

Can you clarify where the workload pod, and control plane pods were and end up?

A table like the following:

Type	Node (pre-reboot)	Node (post-reboot)	Failing?
workload pod	???	???	???
NFS server pod	???	???	???
OpenEBS NFS pod	???	???	???
OpenEBS JIVA control plane pod	???	???	???
OpenEBS JIVA data plane pod	???	???	???

The control plane pods ("OpenEBS pod") isn't strictly necessary but I'm curious.

It's unclear exactly which pods are failing and how wide the failure is -- when you say "NFS volume pods", I assume you mean your workloads, but you must also mean the NFS server pod(s) as well, correct?

What exactly happened in the node failure?

Did the node crash and come back up? Was Jiva control plane running on the node that went down? Was the Jiva data plane running on the node that went down?

It seems like you had a Jiva failure which caused the NFS server pod to not be able to access it's own PVC, which means it can't serve the drive for your actual workload.

What's weird is that if the node went down but came back up (so the identical mount was available, which you saw), then maybe the Jiva data plane pod went to a different node in the meantime? That shouldn't be possible (it's been a while since I ran Jiva but it should pin controllers/data-plane managers to nodes)...

Since Jiva is longhorn underneath, you should be able to check the dashboard/longhorn UI.

I can't remember how hard the Longhorn UI was to get to was to get to, but you should be able to find where longhorn is running in the Jiva control plane pods and port-forward to get at the UI (assuming the port is exposed, you may have to edit the pod). That will tell you if the drive is failing at the Jiva level.

Tero Keski-Valkama · Answer 3 · Thu Jan 26 2023 23:19:56 GMT+0800 (China Standard Time)

I don't think I am able to reproduce this problem anymore as such in my cluster because I forced all the Dynamic NFS Provisioner server pods to be on the same node, so they won't switch between nodes in hard crashes anymore, and as they stay on the same node, as they mount their Jiva volumes ReadWriteOnce, it doesn't matter if there are some ghost containers leftover on that node as those won't prevent a remount on the same node.

However, it would be nice if there was a way to diagnose what exactly is keeping a Jiva volume mounted, or Dynamic NFS Provisioner to somehow get over this.

Victor Adossi ("vados") · Answer 4 · Thu Jan 26 2023 23:36:54 GMT+0800 (China Standard Time)

Ah OK, well if the problem is gone with Jiva being constrained that way it seems like it might be a Jiva-level problem for sure -- maybe this ticket is worth closing then?

What else do you think would make it an NFS provisioner specific issue?

Tero Keski-Valkama · Answer 5 · Sat Jan 28 2023 09:13:12 GMT+0800 (China Standard Time)

No matter what backing storage is used for NFS provisioner, it would generally be ReadWriteOnce. That means that if the NFS pod moves to another node, it should continue to be able to use these volumes (if they aren't node specific like hostpath), as long as it is the only one using them.

However, in some hard reboot cases, something is left hanging around, and the zombie NFS provisioner pods keep the volumes reserved from some other node.

It would be nice if there was a way to see what is keeping these volumes tangled, but I wonder if there might be some way NFS provisioner could unmount the volume in this error case automatically.

I don't know if that is possible, as I don't quite understand what exactly is left hanging to keep the volume mounted as it's not visible on the Kubernetes level, and I couldn't find it on the node mount level either.

Perhaps you're right that this is a Jiva level problem. Maybe Jiva is miscounting the mount locations somehow, still imagining that some already terminated container holds a mount. Ok, it can be closed, since you could be right, because I am personally not affected by this problem anymore due to the workaround, and if someone has the same problem they can find a workaround here from this closed issue.

Victor Adossi ("vados") · Answer 6 · Sat Jan 28 2023 13:59:13 GMT+0800 (China Standard Time)

Yeah if only someone had gotten to this and we were able to debug when you ran into it! Maybe someone will have this issue and ask for the ticket to be reopened.

…

On Sat, Jan 28, 2023 at 10:13, Tero Keski-Valkama ***@***.***> wrote: No matter what backing storage is used for NFS provisioner, it would generally be ReadWriteOnce. That means that if the NFS pod moves to another node, it should continue to be able to use these volumes (if they aren't node specific like hostpath), as long as it is the only one using them. However, in some hard reboot cases, something is left hanging around, and the zombie NFS provisioner pods keep the volumes reserved from some other node. It would be nice if there was a way to see what is keeping these volumes tangled, but I wonder if there might be some way NFS provisioner could unmount the volume in this error case automatically. I don't know if that is possible, as I don't quite understand what exactly is left hanging to keep the volume mounted as it's not visible on the Kubernetes level, and I couldn't find it on the node mount level either. Perhaps you're right that this is a Jiva level problem. Maybe Jiva is miscounting the mount locations somehow, still imagining that some already terminated container holds a mount. Ok, it can be closed, since you could be right, because I am personally not affected by this problem anymore due to the workaround, and if someone has the same problem they can find a workaround here from this closed issue. — Reply to this email directly, [view it on GitHub](#153 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AAIJCNQAPBVDCXA3BUIZLV3WURXDFANCNFSM6AAAAAAT5AXIEM). You are receiving this because you commented.Message ID: ***@***.***>

Chris Jones · Answer 7 · Sun Jun 04 2023 08:05:42 GMT+0800 (China Standard Time)

openebs/openebs#3632 seems like possibly the same issue.