reactive-tech / kubegres

Kubegres is a Kubernetes operator allowing to deploy one or many clusters of PostgreSql instances and manage databases replication, failover and backup.

Home Page:https://www.kubegres.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unable to promote last remaining pod

2fst4u opened this issue · comments

I somehow managed to have my cluster silently break overnight and only one of my 3 kubegres pods is running, but it wasn't the primary. I have gone through the instructions to promote it forcefully but it seems it won't as it keeps looking for the current primary.

This seems like an oversight since the only reason you'd want to forcefully failover is when the primary isn't available.

Is there anything I can do to make the remaining version the primary? I can't retrieve the lost data but that's fine, I don't need it as badly as I need a working database at all.

This is what the log of the remaining pod is showing over and over:

could not connect to the primary server: could not translate host name "postgres" to address: no address associated with hostname

So there's some sort of resolution that isn't going right. It's also saying it's trying to connect to the primary when I'm trying to make it itself the primary.

I have added the following to the kubegres CRD:

failover:
      promotePod: "postgres-12-0"

(The correct name of the remaining pod).

The "status" part of my kubegres CRD states the following:

status:
  blockingOperation:
    operationId: Primary DB count spec enforcement
    statefulSetOperation:
      instanceIndex: 11
      name: postgres-11
    statefulSetSpecUpdateOperation: {}
    stepId: Failing over by promoting a Replica DB as a Primary DB
    timeOutEpocInSeconds: 1657330727
  enforcedReplicas: 8
  lastCreatedInstanceIndex: 12
  previousBlockingOperation:
    hasTimedOut: true
    operationId: Primary DB count spec enforcement
    statefulSetOperation:
      instanceIndex: 11
      name: postgres-11
    statefulSetSpecUpdateOperation: {}
    stepId: Waiting few seconds before failing over by promoting a Replica DB as a
      Primary DB
    timeOutEpocInSeconds: 1657330426

So it looks like it's attempting the failover but something is stopping it.

Anyone? I'm stuck here with a replica that won't promote.

The old master is not up and running? In that case the cluster is not in a healthy state and promotePod likely requires the cluster to be healthy as it reaches out to the old (previous) master when promoting.

Kubegres should try to heal itself by failing over to a replica. Have you enabled failover?

I have, or more to my understanding, failover is enabled by default is it not?

It seems when I have a catastrophic failure and 2 of my 3 pods go down, the primary and a replica, the last replica has no idea what to do. I guess because it can't create quorum and decide to promote. But I thought that was the point of the force promote setting?

I'm going to try with 6 replicas. Will that help with quorum in the event of a failure like this?

I have no idea how but increasing replicas to 6 made the last replica kick into gear and it repaired itself. I feel like the issue still exists in general but it personally doesn't affect my situation anymore.

For that reason I'll leave this open if the dev gets a chance to look at and address this.

Again I'm having this issue. I have some unrelated problems with my cluster that I'm working on but in the meantime, my inability to promote replicas to primary is holding me back from bringing postgres online. In the logs the controller keeps referring to an old, since killed pod as the primary.

My kubegres manager just keeps saying this over and over:

2022-08-12T20:19:08.018Z	DEBUG	controller-runtime.manager.events	Warning	{"object": {"kind":"Kubegres","namespace":"nextcloud","name":"postgres","uid":"6bbfacb2-999a-4064-9125-08cf5441e7c0","apiVersion":"kubegres.reactive-tech.io/v1","resourceVersion":"96337781"}, "reason": "FailOverTimedOutErr", "message": "Last FailOver attempt has timed-out after 300 seconds. The new Primary DB is still NOT ready. It must be fixed manually. Until the PrimaryDB is ready, most of the features of Kubegres are disabled for safety reason.  'Primary DB StatefulSet to fix': postgres-76 - FailOver timed-out"}

Postgres-76 doesn't exist anymore, it was killed. Why is it not triggering failover?

By adding the promote_replica_to_primary.log file I managed to make a replica promote itself, but the manager doesn't know about it and the service endpoint isn't pointing to the primary for other apps to contact the database.

I need the manager to know the replica it is trying to promote is dead.