purestorage / pso-csi

PSO CSI helm chart

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug] PSO DB scheduler schedules pods which will never join the cluster.

Cellebyte opened this issue · comments

Hey got an issue with PSO on

  • K8s: v.1.19.4
  • PSO: v6.0.4

The db scheduler goes wild and creates a bunch of statefulsets.
The issue is that they are not able to join the existing cluster and stay in a not running state.
Is that wanted behavior?

pso-db-0-0                                   1/1     Running   0          6d18h
pso-db-1-0                                   1/1     Running   0          6d18h
pso-db-10-0                                  0/1     Running   0          6d18h
pso-db-11-0                                  0/1     Running   0          6d18h
pso-db-12-0                                  0/1     Running   0          6d18h
pso-db-13-0                                  0/1     Running   0          6d18h
pso-db-14-0                                  0/1     Running   0          6d18h
pso-db-15-0                                  0/1     Running   0          6d17h
pso-db-16-0                                  0/1     Running   0          6d17h
pso-db-17-0                                  0/1     Running   0          6d17h
pso-db-18-0                                  0/1     Running   0          6d17h
pso-db-19-0                                  0/1     Running   0          6d17h
pso-db-2-0                                   0/1     Running   0          6d18h
pso-db-20-0                                  0/1     Running   0          6d16h
pso-db-21-0                                  0/1     Running   0          6d16h
pso-db-22-0                                  0/1     Running   0          6d16h
pso-db-23-0                                  0/1     Running   0          6d16h
pso-db-24-0                                  0/1     Running   0          6d16h
pso-db-25-0                                  0/1     Running   0          6d16h
pso-db-26-0                                  0/1     Running   0          6d16h
pso-db-27-0                                  0/1     Running   0          6d16h
pso-db-28-0                                  0/1     Running   0          6d16h
pso-db-29-0                                  0/1     Running   0          6d16h

@Cellebyte can you confirm what deployment of Kubernetes you used, the CNI in place and that all the PSO pre-requested have been correctly installed and configured on all nodes in the cluster?
Do you see pso-db volumes being created on your backend devices?

@sdodsley
I can confirm for every pod a pv is created and it is also available in the backend. We use calico and the default installation from your repository.

@Cellebyte - what I meant was, are you using OpenShift, Rancher, etc to deploy k8s?
The usual reason you get this is that IPv4 forwarding is not enabled on the worker nodes - Rancher deployments are definitely affected by this.

@sdodsley we don't use Rancher nor OpenShift. We use Vanilla K8s. IPv4 forwarding is enabled.

Understood. Would it be possible to try and diagnose this live via a Zoom call? I can be free for the next hour to look at this.

@sdodsley I am currently full with meetings.
Is there a time slot tomorrow or on wednesday?

@sdodsley it happens if the statefulset gets rescheduled or containers are deleted and rescheduled.

@Cellebyte thanks for the additional information.
If you have a clean install and none of the statefulsets get rescheduled or deleted, does everything look correct and you can provision PVs and PVCs?
I can be available tomorrow from 0900 - 1100 EDT (GMT +5)

@sdodsley provisioning is not blocked. The issue is that the db scheduler keeps adding unnecessary statefulsets and I was not able to delete them because they are added again after deletion ^^

@Cellebyte got it - are you able to do tomorrow at the times specified?

I am available. 👍

@Cellebyte I sent you a Zoom invite for tomorrow

@sdodsley will join at 16 GMT+1

Hey,

so to recreate the issue.

kubectl delete pods -n <namespace-of-pso> --all

Delete all pods in the namespace where pso is deployed
and wait for them to reschedule it will take some time and then the db is not satisfied.
Some DB pods will reschedule on the nodes but the scheduler keeps up
creating new statefulsets and therefore pods.

When you are in that state and want to cleanup the namespace it is not possible because everything waits for the finalizer

unpublish-volume

this will not work until the database is up and running again which will never happen.
This keeps the messy namespace in the cluster until I start editing and removing the finalizers which will never succeed.
There should be a force deletion option which will remove the finalizers after some time when it sees that volume-unpublish is not working or even starting.

The issue is actually caused by environmental errors - a time synchronization service was not present. This causes the CockroachDB to have connectivity issues and the cockroach operator goes bad after the maximum 20 DB pods are in a bad state.
Ensuring a time sync process is running on all nodes of the cluster will stop this issue happening again.