Cluster starts misbehaving after a while

Question

Cluster starts misbehaving after a while

nicoekkart opened this issue 4 years ago · comments

I'm working on GoEPC using nsm, and for this I've set up my own cluster on a virtualized testbed from my university. Everything works fine, and all use cases seem to work at first. After a while, some strange behaviour starts happening, and I have no idea what the root cause is of this. This might not be related to the cnf-testbed, but maybe someone here knows where to look. Whenever this happens, my current only option is to set up the cluster again, which costs me about an hour (which is not ideal).

Setup
My setup looks like this:

4 nodes, a master and a controlling PC. From the controlling PC, I'm running an adapted version of make k8s and afterwards an adapted version of make vswitch. I've removed some tasks from these scripts, which didn't seem necessary for me. This adapted code can be found at my fork.

Problem
I'm constantly installing helm charts and uninstalling them (approx 1 full restart in 10 minutes). After maybe 3 hours of working on 1 cluster, suddenly one or two nodes start misbehaving. I first notice the misbehaving because of an UnexpectedAdmissionError of a pod on the problematic node. When looking at more detail from the kubelet, it says: Update plugin resources failed due to no containers return in allocation response. This looks like something from the device plugin.
When then trying to delete the pods, the ones on that node are not deleted, but all containers in that pod are stopped.

I have looked at possible OOM, file openers running out, ...
I'm running out of ideas where to look for the cause of this problem. The weird thing is that everything is always fine at the beginning, but it suddenly starts to crash. My idea is that something is filling up, but I have no idea what. I have not found a way to recover a crashed cluster, so I end up creating a new one all the time.

Michael Sølvkær Pedersen · Answer 1 · Fri May 29 2020 20:00:14 GMT+0800 (China Standard Time)

I've never had that issue before. Without knowing the details, it sounds like it could be an issue with the NSM DP. I can see there were some issues with the number of socket devices some time ago (networkservicemesh/networkservicemesh#801), but I am not sure if this could be related.

Either way it might be worth checking with the NSM team to see if they have any ideas.

Nico Ekkart · Answer 2 · Fri May 29 2020 21:12:49 GMT+0800 (China Standard Time)

Thanks @michaelspedersen. I've asked the question there.

Nico Ekkart · Answer 3 · Sat May 30 2020 20:48:59 GMT+0800 (China Standard Time)

I'm continuing this issue on NSM, seems to be a problem with the nsmd file descriptors.