List order of sync_standby is inconsistent with synchronous_standby_names

Question

List order of sync_standby is inconsistent with synchronous_standby_names

XiuhuaRuan opened this issue 19 days ago · comments

What happened?

We want to bind standby VIP to the sync standby with highest sync_priority according to the sync_standby list in DCS. But sync_standby list order is inconsistent with synchronous_standby_names. It is sorted by application name not by sync_ priority. Besides, synchronous_standby_names is not stable enough when changing synchronous_node_count.

To make sync_standby and synchronous_standby_names order more stable, I suggest to consider sync_priority as the third sort condition following sync_state and lsn when collecting ReplicaList from pg_stat_replication. Please help evaluate this proposal. Thanks.

How can we reproduce it (as minimally and precisely as possible)?

Set up a cluster with 3 nodes
Set synchronous_mode to true via patronictl edit-config, ee_03 was selected as sync standby
Set synchronous_node_count to 2 via patronictl edit-config, ee_03 and ee_02 were selected as sync standby
check sync standby list
postgresql.conf:
synchronous_standby_names = '2 (ee_03,ee_02)'
/sync key in etcd:
{"leader":"ee_01","sync_standby":"ee_02,ee_03"}
set synchronous_node_count to 1, sometimes ee_02 was selected as sync standby, sometimes ee_03 was selected as sync standby

What did you expect to happen?

sync_standby list order is consistent with synchronous_standby_names
synchronous_standby_names is changed consistently when changing synchronous_node_count. For example, when decrease synchronous_node_count, sync standby with higher sync_priority is expected to remain in the list.

Patroni/PostgreSQL/DCS version

Patroni version: 3.1.0
PostgreSQL version: 13.0
DCS (and its version): etcd3.5.9

Patroni configuration file

scope: postgres-cluster
namespace: /service/
name: ee_01

restapi:
  listen: 192.168.61.105:8008
  connect_address: 192.168.61.105:8008

etcd:
  hosts: 192.168.61.105:2379,192.168.61.106:2379,192.168.61.107:2379

log:
  level: INFO
  traceback_level: INFO
  dir: /home/postgres/patroni
  file_num: 10
  file_size: 104857600

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    master_start_timeout: 300
    synchronous_mode: false
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_level: replica
        hot_standby: "on"
        wal_keep_size: 100
        max_wal_senders: 10
        max_replication_slots: 10
        wal_log_hints: "on"
        archive_mode: "off"
        archive_timeout: 1800s
        logging_collector: "on"

postgresql:
  database: postgres
  listen: 0.0.0.0:5432
  connect_address: 192.168.61.105:5432
  bin_dir: /usr/local/pgsql/bin
  data_dir: /usr/local/pgsql/data
  pgpass: /home/postgres/tmp/.pgpass

  authentication:
    replication:
      username: postgres
      password: postgres
    superuser:
      username: postgres
      password: postgres
    rewind:
      username: postgres
      password: postgres

  pg_hba:
  - local   all             all                                     trust
  - host    all             all             0.0.0.0/0               trust
  - host    all             all             ::1/128                 trust
  - local   replication     all                                     trust
  - host    replication     all             0.0.0.0/0               trust
  - host    replication     all             ::1/128                 trust

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

patronictl show-config

failsafe_mode: true
loop_wait: 10
master_start_timeout: 300
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    archive_command: test ! -f /home/postgres/wal_archive/%f && cp %p /home/postgres/wal_archive/%f
    archive_mode: 'on'
    archive_timeout: 1800s
    hot_standby: 'on'
    logging_collector: 'on'
    max_replication_slots: 10
    wal_keep_size: 100
    wal_level: logical
    wal_log_hints: 'on'
  use_pg_rewind: true
retry_timeout: 10
synchronous_mode: true
synchronous_node_count: 1
ttl: '30'

Patroni log files

From the patroni log, the list order changed when changing synchronous_node_count from 1 to 2. This caused some confusion. At the begining, it showed ['ee_03', 'ee_02'], because list(picked) was ['ee_03', 'ee_02'] as candidates. Then it showd ['ee_02', 'ee_03'], because list(allow_promote) was changed to ['ee_02', 'ee_03'] as sync_nodes. When sync_state and lsn are the same, if we use sync_priority as the third sort condition of ReplicaList, the list(allow_promote) will be ['ee_03', 'ee_02'] same as list(picked).

2024-05-09 16:12:23,737 INFO: Assigning synchronous standby status to ['ee_03']
2024-05-09 16:12:26,024 INFO: Synchronous standby status assigned to ['ee_03']
2024-05-09 16:12:26,080 INFO: no action. I am (ee_01), the leader with the lock
2024-05-09 16:12:33,723 INFO: no action. I am (ee_01), the leader with the lock
2024-05-09 16:12:43,765 INFO: no action. I am (ee_01), the leader with the lock
2024-05-09 16:12:48,987 INFO: no action. I am (ee_01), the leader with the lock
2024-05-09 16:12:59,085 INFO: no action. I am (ee_01), the leader with the lock
2024-05-09 16:13:08,985 INFO: no action. I am (ee_01), the leader with the lock
2024-05-09 16:13:19,042 INFO: no action. I am (ee_01), the leader with the lock
2024-05-09 16:13:28,938 INFO: Lock owner: ee_01; I am ee_01
**2024-05-09 16:13:28,992 INFO: Assigning synchronous standby status to ['ee_03', 'ee_02']
2024-05-09 16:13:31,303 INFO: Synchronous standby status assigned to ['ee_02', 'ee_03']**
2024-05-09 16:13:31,356 INFO: no action. I am (ee_01), the leader with the lock
2024-05-09 16:13:39,031 INFO: no action. I am (ee_01), the leader with the lock
2024-05-09 16:13:49,091 INFO: no action. I am (ee_01), the leader with the lock
2024-05-09 16:13:59,009 INFO: no action. I am (ee_01), the leader with the lock
2024-05-09 16:13:59,010 INFO: Lock owner: ee_01; I am ee_01
2024-05-09 16:13:59,059 INFO: Updating synchronous privilege temporarily from ['ee_02', 'ee_03'] to ['ee_02']
2024-05-09 16:13:59,106 INFO: Assigning synchronous standby status to ['ee_02']
2024-05-09 16:13:59,431 INFO: no action. I am (ee_01), the leader with the lock

PostgreSQL log files

From the postgresql log, ee_02 with lower priority remained when changing synchronous_node_count from 2 to 1. For the priority-based synchronous replication, the standbys with higher priority will be considered as sync and other standbys may be considered as potential. Although we set precise node number in synchronous_standby_names, it is more reasonable to keep original higher priority node in the list.

2024-05-09 16:12:23.865 CST [6977] LOG:  parameter "synchronous_standby_names" changed to "ee_03"
**2024-05-09 16:12:23.975 CST [7013] LOG:  standby "ee_03" is now a synchronous standby with priority 1**
2024-05-09 16:12:23.975 CST [7013] STATEMENT:  START_REPLICATION SLOT "ee_03" 0/ED000000 TIMELINE 50
2024-05-09 16:12:49.111 CST [6977] LOG:  received SIGHUP, reloading configuration files
2024-05-09 16:13:29.116 CST [6977] LOG:  parameter "synchronous_standby_names" changed to "2 (ee_03,ee_02)"
**2024-05-09 16:13:29.235 CST [17473] LOG:  standby "ee_02" is now a synchronous standby with priority 2**
2024-05-09 16:13:29.235 CST [17473] STATEMENT:  START_REPLICATION SLOT "ee_02" 0/F3000000 TIMELINE 50
2024-05-09 16:13:59.235 CST [6977] LOG:  received SIGHUP, reloading configuration files
2024-05-09 16:13:59.236 CST [6977] LOG:  parameter "synchronous_standby_names" changed to "ee_02"
2024-05-09 16:13:59.550 CST [6977] LOG:  received SIGHUP, reloading configuration files

Have you tried to use GitHub issue search?

Yes

Anything else we need to know?

No response