EnterpriseDB / repmgr

A lightweight replication manager for PostgreSQL (Postgres)

Home Page:https://repmgr.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

--dry-run does not catch if e.g. max_workers params are different on servers and hence followers will not be able to attach properly

NielsKSchjoedt opened this issue · comments

We were just carrying out a switchover of our primary using repmgr 5.3.3:

sudo -u postgres repmgr standby switchover --siblings-follow --dry-run

postgres@psql-09:/root$ repmgr standby switchover --siblings-follow --dry-run
NOTICE: checking switchover on node "psql-09" (ID: 9) in --dry-run mode
INFO: SSH connection to host "10.10.10.7" succeeded
INFO: able to execute "repmgr" on remote host "10.10.10.7"
INFO: all sibling nodes are reachable via SSH
INFO: 4 walsenders required, 20 available
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: archive mode is "off"
INFO: replication lag on this standby is 2 seconds
INFO: 4 replication slots required, 20 available
NOTICE: attempting to pause repmgrd on 5 nodes
NOTICE: local node "psql-09" (ID: 9) would be promoted to primary; current primary "psql-07" (ID: 7) would be demoted to standby
INFO: following shutdown command would be run on node "psql-07":
  "sudo /usr/bin/pg_ctlcluster 15 main stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
INFO: prerequisites for executing STANDBY SWITCHOVER are met

However psql-09 (which is a more powerful server) was configured to max_worker_processes=64 while psql-07 was just max_worker_processes=32. So when we actually did the switchover, we ended up in a limbo state where none of the replicas could join, because they could not restart because of the difference to that param:

Aug 21 22:11:47 psql-08 postgres[4082218]: [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
Aug 21 22:11:47 psql-08 postgres[4082221]: [1] LOG:  database system was interrupted while in recovery at log time 2023-08-21 21:49:15 UTC
Aug 21 22:11:47 psql-08 postgres[4082221]: [2] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
Aug 21 22:11:48 psql-08 postgres[4082221]: [1] LOG:  entering standby mode
Aug 21 22:11:48 psql-08 postgres[4082221]: [1] FATAL:  recovery aborted because of insufficient parameter settings
Aug 21 22:11:48 psql-08 postgres[4082221]: [2] DETAIL:  max_worker_processes = 32 is a lower setting than on the primary server, where its value was 64.
Aug 21 22:11:48 psql-08 postgres[4082221]: [3] HINT:  You can restart the server after making the necessary configuration changes.
Aug 21 22:11:48 psql-08 postgres[4082218]: [1] LOG:  startup process (PID 4082221) exited with exit code 1
Aug 21 22:11:48 psql-08 postgres[4082218]: [1] LOG:  aborting startup due to startup process failure
Aug 21 22:11:48 psql-08 postgres[4082218]: [1] LOG:  database system is shut down

That's unexpected that this was not caught 😬