Manual failover candidates

Question

Manual failover candidates

elodiefb opened this issue 6 months ago · comments

What happened?

Manual failover now lists all members as candidates including current leader.

[postgres@pghost5 patroni]$ patronictl -c patroni.yml failover
Current cluster topology
+ Cluster: postgres-cluster (7303378219179270632) ---+----+-----------+
| Member      | Host           | Role    | State     | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgres_01 | 123.0.0.1 | Replica | streaming | 16 |         0 |
| postgres_02 | 123.0.0.2 | Leader  | running   | 16 |           |
| postgres_03 | 123.0.0.3 | Replica | streaming | 16 |         0 |
+-------------+----------------+---------+-----------+----+-----------+
Candidate ['postgres_01', 'postgres_02', 'postgres_03'] []: postgres_02
Are you sure you want to failover cluster postgres-cluster, demoting current leader postgres_02? [y/N]: y
Failover failed, details: 503, Failover failed

How can we reproduce it (as minimally and precisely as possible)?

patronictl -c patroni.yml failover

What did you expect to happen?

I think it will be more reasonable to remove current leader from failover candidates list since it in fact cannot be a failover-to node, and even if it is selected as the candidate, it will be refused with error reported. So, listing it as a failover candidate only causes confusions from user's view - you told me it could be selected, but when i selected it, you refused...

Moreover, in previous versions current leader was NOT listed as a failover candidate, which i think is more acceptable from user's view.

Patroni/PostgreSQL/DCS version

Patroni version: 3.2.0
PostgreSQL version: 14.0
DCS (and its version): etcd 3.5.9

Patroni configuration file

scope: postgres-cluster
namespace: /service/
name: postgres_01

restapi:
  listen: 123.0.0.1:8008
  connect_address: 123.0.0.1:8008

etcd:
  hosts: 123.0.0.1:2379,123.0.0.2:2379,123.0.0.3:2379

log:
  level: INFO
  traceback_level: INFO
  dir: /home/postgres/patroni
  file_num: 10
  file_size: 104857600

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    master_start_timeout: 300
    synchronous_mode: false
    postgresql:
      use_pg_rewind: true
      parameters:
        listen_addresses: "*"
        port: 5432
        wal_level: replica
        hot_standby: "on"
        wal_keep_size: 100
        max_wal_senders: 10
        max_replication_slots: 10
        wal_log_hints: "on"
        archive_mode: "off"
        archive_timeout: 1800s
 #------------log---------------------#
        logging_collector: on
        log_destination: 'stderr'
        log_truncate_on_rotation: on
        log_checkpoints: on
        log_connections: on
        log_disconnections: on
        log_error_verbosity: default
        log_lock_waits: on
        log_temp_files: 0
        log_autovacuum_min_duration: 0
        log_min_duration_statement: 50
        log_timezone: 'PRC'
        log_filename: postgresql-%Y-%m-%d_%H.log
        log_line_prefix: '%t [%p]: db=%d,user=%u,app=%a,client=%h '
#-----------------------------------

postgresql:
  database: postgres
  listen: 0.0.0.0:5432
  connect_address: 123.0.0.1:5432
  bin_dir: /usr/local/pgsql/bin
  data_dir: /usr/local/pgsql/data
  pgpass: /home/postgres/tmp/.pgpass

  authentication:
    replication:
      username: postgres
      password: postgres
    superuser:
      username: postgres
      password: postgres
    rewind:
      username: postgres
      password: postgres

  pg_hba:
  - local   all             all                                     trust
  - host    all             all             0.0.0.0/0               trust
  - host    all             all             ::1/128                 trust
  - local   replication     all                                     trust
  - host    replication     all             0.0.0.0/0               trust
  - host    replication     all             ::1/128                 trust

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

patronictl show-config

loop_wait: 10
master_start_timeout: 300
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    archive_mode: 'off'
    archive_timeout: 1800s
    hot_standby: 'on'
    listen_addresses: '*'
    log_autovacuum_min_duration: 0
    log_checkpoints: true
    log_connections: true
    log_destination: stderr
    log_disconnections: true
    log_error_verbosity: default
    log_filename: postgresql-%Y-%m-%d_%H.log
    log_line_prefix: '%t [%p]: db=%d,user=%u,app=%a,client=%h '
    log_lock_waits: true
    log_min_duration_statement: 50
    log_temp_files: 0
    log_timezone: PRC
    log_truncate_on_rotation: true
    logging_collector: true
    max_replication_slots: 10
    max_wal_senders: 10
    port: 5432
    wal_keep_size: 100
    wal_level: replica
    wal_log_hints: 'on'
  use_pg_rewind: true
retry_timeout: 10
synchronous_mode: false
ttl: 30

Patroni log files

2023-12-12 17:07:08,054 INFO: no action. I am (postgres_02), the leader with the lock
2023-12-12 17:07:18,046 INFO: no action. I am (postgres_02), the leader with the lock
2023-12-12 17:07:18,775 INFO: received failover request with leader=None candidate=postgres_02 scheduled_at=None
2023-12-12 17:07:18,778 INFO: Got response from postgres_02 http://123.0.0.2:8008/patroni: {"state": "running", "postmaster_start_time": "2023-12-12 16:09:53.667173+08:00", "role": "master", "server_version": 140000, "xlog": {"location": 268435456}, "timeline": 16, "replication": [{"usename": "postgres", "application_name": "postgres_01", "client_addr": "123.0.0.1", "state": "streaming", "sync_state": "async", "sync_priority": 0}, {"usename": "postgres", "application_name": "postgres_03", "client_addr": "123.0.0.3", "state": "streaming", "sync_state": "async", "sync_priority": 0}], "dcs_last_seen": 1702372038, "database_system_identifier": "7303378219179270632", "patroni": {"version": "3.2.0", "scope": "postgres-cluster", "name": "postgres_02"}}
2023-12-12 17:07:18,783 INFO: Lock owner: postgres_02; I am postgres_02
2023-12-12 17:07:18,786 WARNING: manual failover: I am already the leader, no need to failover
2023-12-12 17:07:18,786 INFO: Cleaning up failover key
2023-12-12 17:07:18,791 INFO: no action. I am (postgres_02), the leader with the lock
2023-12-12 17:07:28,791 INFO: no action. I am (postgres_02), the leader with the lock
2023-12-12 17:07:38,791 INFO: no action. I am (postgres_02), the leader with the lock

PostgreSQL log files

N/A

Have you tried to use GitHub issue search?

Yes

Anything else we need to know?

No response