kubernetes-csi / csi-driver-smb

This driver allows Kubernetes to access SMB Server on both Linux and Windows nodes.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue while mounting PVs on Azure AKS - Error while dialing dial unix /var/lib/kubelet/plugins/smb.csi.k8s.io/csi.sock

achaikaJH opened this issue · comments

What happened:
I'm having intermittent issues connecting to the SMB PVs using csi-smb driver. I see following error while describing the pod:

Warning  FailedMount  119s (x104 over 4h6m)  kubelet  MountVolume.SetUp failed for volume 'pv-samba-collateral' : kubernetes.io/csi: mounter.SetUpAt failed to check for STAGE_UNSTAGE_VOLUME capability: rpc error: code = Unavailable desc = connection error: desc = 'transport: Error while dialing dial unix /var/lib/kubelet/plugins/smb.csi.k8s.io/csi.sock: connect: connection refused'

Driver pod logs show the following:
liveness-probe container:

W0705 15:13:06.963083       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:06.963109       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:06.964188       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:06.964215       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:16.963221       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:16.963245       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:16.963230       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:16.964337       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:26.963252       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:26.963260       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:26.963274       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:26.964375       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:36.963231       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:36.963269       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:36.963269       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:36.964364       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:46.968396       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:46.969181       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:46.971674       1 connection.go:173] Still connecting to unix:///csi/csi.sock
W0705 15:13:46.971722       1 connection.go:173] Still connecting to unix:///csi/csi.sock

"smb" container:

I0627 20:47:31.895779       1 nodeserver.go:100] NodeUnpublishVolume: unmounting volume samba-mim-id on /var/lib/kubelet/pods/89023955-5103-4e04-9f5b-a4526fd6e5ca/volumes/kubernetes.io~csi/pv-samba-mim/mount
I0627 20:47:31.895789       1 mount_linux.go:294] Unmounting /var/lib/kubelet/pods/89023955-5103-4e04-9f5b-a4526fd6e5ca/volumes/kubernetes.io~csi/pv-samba-mim/mount
I0627 20:51:33.933268       1 utils.go:76] GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0627 20:51:33.933584       1 utils.go:77] GRPC request: {"target_path":"/var/lib/kubelet/pods/89023955-5103-4e04-9f5b-a4526fd6e5ca/volumes/kubernetes.io~csi/pv-samba-mim/mount","volume_id":"samba-mim-id"}
I0627 20:51:33.937102       1 nodeserver.go:100] NodeUnpublishVolume: unmounting volume samba-mim-id on /var/lib/kubelet/pods/89023955-5103-4e04-9f5b-a4526fd6e5ca/volumes/kubernetes.io~csi/pv-samba-mim/mount
I0627 20:51:33.937325       1 mount_linux.go:294] Unmounting /var/lib/kubelet/pods/89023955-5103-4e04-9f5b-a4526fd6e5ca/volumes/kubernetes.io~csi/pv-samba-mim/mount
I0627 20:55:35.933562       1 utils.go:76] GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0627 20:55:35.933592       1 utils.go:77] GRPC request: {"target_path":"/var/lib/kubelet/pods/89023955-5103-4e04-9f5b-a4526fd6e5ca/volumes/kubernetes.io~csi/pv-samba-mim/mount","volume_id":"samba-mim-id"}
I0627 20:55:35.933678       1 nodeserver.go:100] NodeUnpublishVolume: unmounting volume samba-mim-id on /var/lib/kubelet/pods/89023955-5103-4e04-9f5b-a4526fd6e5ca/volumes/kubernetes.io~csi/pv-samba-mim/mount
I0627 20:55:35.933686       1 mount_linux.go:294] Unmounting /var/lib/kubelet/pods/89023955-5103-4e04-9f5b-a4526fd6e5ca/volumes/kubernetes.io~csi/pv-samba-mim/mount
chaikal@macC02DM4ZLMD6P src-aks-flux-ga % k logs csi-smb-node-hl7pp -n kube-system -c smb --tail 10
I0627 20:47:31.895779       1 nodeserver.go:100] NodeUnpublishVolume: unmounting volume samba-mim-id on /var/lib/kubelet/pods/89023955-5103-4e04-9f5b-a4526fd6e5ca/volumes/kubernetes.io~csi/pv-samba-mim/mount
I0627 20:47:31.895789       1 mount_linux.go:294] Unmounting /var/lib/kubelet/pods/89023955-5103-4e04-9f5b-a4526fd6e5ca/volumes/kubernetes.io~csi/pv-samba-mim/mount
I0627 20:51:33.933268       1 utils.go:76] GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0627 20:51:33.933584       1 utils.go:77] GRPC request: {"target_path":"/var/lib/kubelet/pods/89023955-5103-4e04-9f5b-a4526fd6e5ca/volumes/kubernetes.io~csi/pv-samba-mim/mount","volume_id":"samba-mim-id"}
I0627 20:51:33.937102       1 nodeserver.go:100] NodeUnpublishVolume: unmounting volume samba-mim-id on /var/lib/kubelet/pods/89023955-5103-4e04-9f5b-a4526fd6e5ca/volumes/kubernetes.io~csi/pv-samba-mim/mount
I0627 20:51:33.937325       1 mount_linux.go:294] Unmounting /var/lib/kubelet/pods/89023955-5103-4e04-9f5b-a4526fd6e5ca/volumes/kubernetes.io~csi/pv-samba-mim/mount
I0627 20:55:35.933562       1 utils.go:76] GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0627 20:55:35.933592       1 utils.go:77] GRPC request: {"target_path":"/var/lib/kubelet/pods/89023955-5103-4e04-9f5b-a4526fd6e5ca/volumes/kubernetes.io~csi/pv-samba-mim/mount","volume_id":"samba-mim-id"}
I0627 20:55:35.933678       1 nodeserver.go:100] NodeUnpublishVolume: unmounting volume samba-mim-id on /var/lib/kubelet/pods/89023955-5103-4e04-9f5b-a4526fd6e5ca/volumes/kubernetes.io~csi/pv-samba-mim/mount
I0627 20:55:35.933686       1 mount_linux.go:294] Unmounting /var/lib/kubelet/pods/89023955-5103-4e04-9f5b-a4526fd6e5ca/volumes/kubernetes.io~csi/pv-samba-mim/mount

"node-driver-registrar" container

I0421 13:24:36.013122       1 main.go:166] Version: v2.4.0
I0421 13:24:36.013158       1 main.go:167] Running node-driver-registrar in mode=registration
I0421 13:24:36.013730       1 main.go:191] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0421 13:24:43.023531       1 main.go:198] Calling CSI driver to discover driver name
I0421 13:24:43.028370       1 main.go:208] CSI driver name: "smb.csi.k8s.io"
I0421 13:24:43.028407       1 node_register.go:53] Starting Registration Server at: /registration/smb.csi.k8s.io-reg.sock
I0421 13:24:43.028578       1 node_register.go:62] Registration Server started at: /registration/smb.csi.k8s.io-reg.sock
I0421 13:24:43.069212       1 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I0421 13:24:43.791694       1 main.go:102] Received GetInfo call: &InfoRequest{}
I0421 13:24:43.792003       1 main.go:109] "Kubelet registration probe created" path="/var/lib/kubelet/plugins/smb.csi.k8s.io/registration"
I0421 13:24:43.871716       1 main.go:120] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
E0627 20:55:57.002123       1 connection.go:132] Lost connection to unix:///csi/csi.sock.

If I ssh to the driver pod I can see that PV remote filesystem is mounted and I'm able to access it, So it is not network connection issue.
it isn't happening all the time and isn't affecting all pods and nodes. If I try to restart pod several times it might eventually work.

What you expected to happen:
PV to be mounted from the first try without errors

How to reproduce it:
I'm not able to reproduce on demand. It happens to seemingly random PVs.

Anything else we need to know?:

If I ssh to the driver pod I can see that PV remove filesystem is mounted and I'm able to access it, So it is not network connection issue.
it isn't happening all the time and isn't affecting all pod and nodes. If I try to restart pod several times it might eventually work.

I've seen this happening on AKS nodes running Ubuntu 22.04.2 LTS 5.15.0-1035-azure and Ubuntu 22.04.2 LTS 5.15.0-1037-azure. I haven't seen it on Ubuntu 22.04.2 LTS 5.15.0-1036-azure and Ubuntu 22.04.2 LTS 5.15.0-1038-azure

If I ssh to the driver pod I can see that PV remove filesystem is mounted and I'm able to access it, So it is not network connection issue.
it isn't happening all the time and isn't affecting all pod and nodes. If I try to restart pod several times it might eventually work.

Environment:

  • CSI Driver version: 1.6.0
  • Kubernetes version (use kubectl version): 1.25.6
  • OS (e.g. from /etc/os-release): Ubuntu 22.04.2 LTS
  • Kernel (e.g. uname -a): 5.15.0-1035-azure
  • Install tools:
  • Others:

can you check whether there is cpu, memory or disk throttling issue on the node in problem? make sure the driver has enough resource to run on that node.

I checked the nodes and I don't see any cpu, memory or disk pressure. Will try to upgrade to the latest version of k8s and then upgrade csi-smb driver.

any update? is this due to a linux kernel issue?