kubernetes-csi / csi-driver-smb

This driver allows Kubernetes to access SMB Server on both Linux and Windows nodes.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

On an OpenShift 4.13 cluster, the cifsd process causes CPU hangs

berendiwema opened this issue · comments

I'm not sure if cifsd is a part of this driver, or is supplied by the host OS. I wasn't able to locate the cifsd process either on the file system of the affected hosts.
Furthermore, reading the source code did not make it clear for me if cifds is a part of this driver or not.

I hope someone is familiair with issues like this and might know a way to mitigate it.

What happened:
On several nodes within an OpenShift 4.13 cluster we see nodes with hanging CPUs due to cifsd driver issues.

What you expected to happen:
The CIFSD driver does not cause hanging CPU's.

How to reproduce it:
Difficult: looks like network issues cause the share to hang or a lock to timeout, but we haven't been able to pinpoint it.

Anything else we need to know?:
System logs show:

[920285.608500] watchdog: BUG: soft lockup - CPU#1 stuck for 3703s! [.NET ThreadPool:2647689]
[920289.653493] watchdog: BUG: soft lockup - CPU#15 stuck for 3707s! [cifsd:17906]
[920301.643461] watchdog: BUG: soft lockup - CPU#12 stuck for 3595s! [cifsd:18471]
[920305.624468] watchdog: BUG: soft lockup - CPU#6 stuck for 2295s! [cifsd:18190]
[920313.608432] watchdog: BUG: soft lockup - CPU#1 stuck for 3729s! [.NET ThreadPool:2647689]
[920317.653421] watchdog: BUG: soft lockup - CPU#15 stuck for 3733s! [cifsd:17906]
[920322.866740] systemd[1]: Failed to start Journal Service.
[920329.643386] watchdog: BUG: soft lockup - CPU#12 stuck for 3621s! [cifsd:18471]
[920331.397393] rcu: INFO: rcu_preempt self-detected stall on CPU
[920331.402183] rcu:     15-....: (4019573 ticks this GP) idle=93d/1/0x4000000000000000 softirq=100792278/100814899 fqs=928269 
[920332.922398] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... 15-... } 4019285 jiffies s: 4880525 root: 0x8002/.
[920332.924796] rcu: blocking rcu_node structures (internal RCU debug):
[920333.624382] watchdog: BUG: soft lockup - CPU#6 stuck for 2321s! [cifsd:18190]
[920341.608365] watchdog: BUG: soft lockup - CPU#1 stuck for 3755s! [.NET ThreadPool:2647689]
[920357.643318] watchdog: BUG: soft lockup - CPU#12 stuck for 3647s! [cifsd:18471]
[920357.653318] watchdog: BUG: soft lockup - CPU#15 stuck for 3770s! [cifsd:17906]
[920361.624311] watchdog: BUG: soft lockup - CPU#6 stuck for 2347s! [cifsd:18190]
[920369.608292] watchdog: BUG: soft lockup - CPU#1 stuck for 3781s! [.NET ThreadPool:2647689]
[920385.643249] watchdog: BUG: soft lockup - CPU#12 stuck for 3673s! [cifsd:18471]
[920385.653254] watchdog: BUG: soft lockup - CPU#15 stuck for 3796s! [cifsd:17906]
[920389.624241] watchdog: BUG: soft lockup - CPU#6 stuck for 2373s! [cifsd:18190]
[920397.608226] watchdog: BUG: soft lockup - CPU#1 stuck for 3808s! [.NET ThreadPool:2647689]
[920398.458234] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-... 15-... } 4084821 jiffies s: 4880525 root: 0x8002/.
[920398.460628] rcu: blocking rcu_node structures (internal RCU debug):

Environment:

  • CSI Driver version: registry.k8s.io/sig-storage/smbplugin:v1.14.0
  • Kubernetes version (use kubectl version): Kubernetes Version: v1.26.13+8f85140
  • OS (e.g. from /etc/os-release): Red Hat Enterprise Linux CoreOS 413.92.202402131523-0 (Plow)
  • Kernel (e.g. uname -a): 5.14.0-284.52.1.el9_2.x86_64

cifsd is NOT part of this driver, it's supplied by the host