[Bug] PSO DB scheduler schedules pods which will never join the cluster.

Question

[Bug] PSO DB scheduler schedules pods which will never join the cluster.

Cellebyte opened this issue 4 years ago · comments

Hey got an issue with PSO on

K8s: v.1.19.4
PSO: v6.0.4

The db scheduler goes wild and creates a bunch of statefulsets.
The issue is that they are not able to join the existing cluster and stay in a not running state.
Is that wanted behavior?

pso-db-0-0                                   1/1     Running   0          6d18h
pso-db-1-0                                   1/1     Running   0          6d18h
pso-db-10-0                                  0/1     Running   0          6d18h
pso-db-11-0                                  0/1     Running   0          6d18h
pso-db-12-0                                  0/1     Running   0          6d18h
pso-db-13-0                                  0/1     Running   0          6d18h
pso-db-14-0                                  0/1     Running   0          6d18h
pso-db-15-0                                  0/1     Running   0          6d17h
pso-db-16-0                                  0/1     Running   0          6d17h
pso-db-17-0                                  0/1     Running   0          6d17h
pso-db-18-0                                  0/1     Running   0          6d17h
pso-db-19-0                                  0/1     Running   0          6d17h
pso-db-2-0                                   0/1     Running   0          6d18h
pso-db-20-0                                  0/1     Running   0          6d16h
pso-db-21-0                                  0/1     Running   0          6d16h
pso-db-22-0                                  0/1     Running   0          6d16h
pso-db-23-0                                  0/1     Running   0          6d16h
pso-db-24-0                                  0/1     Running   0          6d16h
pso-db-25-0                                  0/1     Running   0          6d16h
pso-db-26-0                                  0/1     Running   0          6d16h
pso-db-27-0                                  0/1     Running   0          6d16h
pso-db-28-0                                  0/1     Running   0          6d16h
pso-db-29-0                                  0/1     Running   0          6d16h

Simon Dodsley · Answer 1 · Mon Dec 14 2020 22:01:40 GMT+0800 (China Standard Time)

@Cellebyte can you confirm what deployment of Kubernetes you used, the CNI in place and that all the PSO pre-requested have been correctly installed and configured on all nodes in the cluster?
Do you see pso-db volumes being created on your backend devices?

Marcel Fest · Answer 2 · Tue Dec 15 2020 00:07:12 GMT+0800 (China Standard Time)

@sdodsley
I can confirm for every pod a pv is created and it is also available in the backend. We use calico and the default installation from your repository.

Simon Dodsley · Answer 3 · Tue Dec 15 2020 00:10:13 GMT+0800 (China Standard Time)

@Cellebyte - what I meant was, are you using OpenShift, Rancher, etc to deploy k8s?
The usual reason you get this is that IPv4 forwarding is not enabled on the worker nodes - Rancher deployments are definitely affected by this.

Marcel Fest · Answer 4 · Tue Dec 15 2020 00:11:54 GMT+0800 (China Standard Time)

@sdodsley we don't use Rancher nor OpenShift. We use Vanilla K8s. IPv4 forwarding is enabled.

Simon Dodsley · Answer 5 · Tue Dec 15 2020 00:13:50 GMT+0800 (China Standard Time)

Understood. Would it be possible to try and diagnose this live via a Zoom call? I can be free for the next hour to look at this.

Marcel Fest · Answer 6 · Tue Dec 15 2020 00:15:39 GMT+0800 (China Standard Time)

@sdodsley I am currently full with meetings.
Is there a time slot tomorrow or on wednesday?

Marcel Fest · Answer 7 · Tue Dec 15 2020 00:16:32 GMT+0800 (China Standard Time)

@sdodsley it happens if the statefulset gets rescheduled or containers are deleted and rescheduled.

Simon Dodsley · Answer 8 · Tue Dec 15 2020 00:20:44 GMT+0800 (China Standard Time)

@Cellebyte thanks for the additional information.
If you have a clean install and none of the statefulsets get rescheduled or deleted, does everything look correct and you can provision PVs and PVCs?
I can be available tomorrow from 0900 - 1100 EDT (GMT +5)

Marcel Fest · Answer 9 · Tue Dec 15 2020 00:25:28 GMT+0800 (China Standard Time)

@sdodsley provisioning is not blocked. The issue is that the db scheduler keeps adding unnecessary statefulsets and I was not able to delete them because they are added again after deletion ^^

Simon Dodsley · Answer 10 · Tue Dec 15 2020 00:28:43 GMT+0800 (China Standard Time)

@Cellebyte got it - are you able to do tomorrow at the times specified?

Marcel Fest · Answer 11 · Tue Dec 15 2020 00:35:10 GMT+0800 (China Standard Time)

I am available. 👍

Simon Dodsley · Answer 12 · Tue Dec 15 2020 01:09:31 GMT+0800 (China Standard Time)

@Cellebyte I sent you a Zoom invite for tomorrow

Marcel Fest · Answer 13 · Tue Dec 15 2020 22:39:37 GMT+0800 (China Standard Time)

@sdodsley will join at 16 GMT+1

Marcel Fest · Answer 14 · Wed Dec 16 2020 00:14:12 GMT+0800 (China Standard Time)

Hey,

so to recreate the issue.

kubectl delete pods -n <namespace-of-pso> --all

Delete all pods in the namespace where pso is deployed
and wait for them to reschedule it will take some time and then the db is not satisfied.
Some DB pods will reschedule on the nodes but the scheduler keeps up
creating new statefulsets and therefore pods.

When you are in that state and want to cleanup the namespace it is not possible because everything waits for the finalizer

unpublish-volume

this will not work until the database is up and running again which will never happen.
This keeps the messy namespace in the cluster until I start editing and removing the finalizers which will never succeed.
There should be a force deletion option which will remove the finalizers after some time when it sees that volume-unpublish is not working or even starting.

Simon Dodsley · Answer 15 · Thu Jan 14 2021 00:36:24 GMT+0800 (China Standard Time)

The issue is actually caused by environmental errors - a time synchronization service was not present. This causes the CockroachDB to have connectivity issues and the cockroach operator goes bad after the maximum 20 DB pods are in a bad state.
Ensuring a time sync process is running on all nodes of the cluster will stop this issue happening again.