Restarting replica failed if patroni joined running postgres

Question

Restarting replica failed if patroni joined running postgres

XiuhuaRuan opened this issue 5 months ago · comments

Sophia Ruan commented 5 months ago

What happened?

If patroni joined a running postgres as replica, executing patronictl restart replica always returned status code=503.

How can we reproduce it (as minimally and precisely as possible)?

Set up a postgres streaming replication without patroni.

postgres=# select * from pg_stat_replication;
 pid  | usesysid | usename  | application_name |  client_addr   | client_hostname | client_port |         backend_start         | backen
sn | flush_lsn | replay_lsn |    write_lag    |    flush_lag    |   replay_lag    | sync_priority | sync_state |          reply_time
------+----------+----------+------------------+----------------+-----------------+-------------+-------------------------------+-------
---+-----------+------------+-----------------+-----------------+-----------------+---------------+------------+------------------------
 2016 |       10 | postgres | walreceiver      | 192.168.61.106 |                 |       42156 | 2023-12-26 09:23:34.19185+08  |
60 | 0/9000060 | 0/9000060  | 00:00:00.036591 | 00:00:00.036897 | 00:00:00.036908 |             0 | async      | 2023-12-26 09:23:44.599

Start patroni on primary and replica node in sequence and show patronictl list.

[postgres@sophia-pghost5 patroni]$ patronictl -c patroni.yml list
+ Cluster: postgres-cluster (7316468372200812230) ---+----+-----------+
| Member      | Host           | Role    | State     | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgres_01 | 192.168.61.105 | Leader  | running   |  1 |           |
| postgres_02 | 192.168.61.106 | Replica | streaming |  1 |         0 |
+-------------+----------------+---------+-----------+----+-----------+

patronctl restart leader returned success

 [postgres@sophia-pghost5 patroni]$ patronictl -c patroni.yml restart postgres-cluster postgres_01
+ Cluster: postgres-cluster (7316468372200812230) ---+----+-----------+
| Member      | Host           | Role    | State     | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgres_01 | 192.168.61.105 | Leader  | running   |  1 |           |
| postgres_02 | 192.168.61.106 | Replica | streaming |  1 |         0 |
+-------------+----------------+---------+-----------+----+-----------+
When should the restart take place (e.g. 2023-12-26T10:55)  [now]:
Are you sure you want to restart members postgres_01? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []:
Success: restart on member postgres_01

But patronctl restart replica node failed.

[postgres@sophia-pghost5 patroni]$ patronictl -c patroni.yml restart postgres-cluster postgres_02
+ Cluster: postgres-cluster (7316468372200812230) ---+----+-----------+
| Member      | Host           | Role    | State     | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgres_01 | 192.168.61.105 | Leader  | running   |  1 |           |
| postgres_02 | 192.168.61.106 | Replica | streaming |  1 |         0 |
+-------------+----------------+---------+-----------+----+-----------+
When should the restart take place (e.g. 2023-12-26T11:24)  [now]:
Are you sure you want to restart members postgres_02? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []:
Failed: restart for member postgres_02, status code=503, (postgres is still starting)

What did you expect to happen?

Restarting replica node should return success.

Patroni/PostgreSQL/DCS version

Patroni version: 3.2.0
PostgreSQL version: 14.0
DCS (and its version): etcd3.5.9

Patroni configuration file

scope: postgres-cluster
namespace: /service/
name: postgres_01

restapi:
  listen: 192.168.61.105:8008
  connect_address: 192.168.61.105:8008

etcd:
  hosts: 192.168.61.105:2379,192.168.61.106:2379,192.168.61.107:2379

log:
  level: INFO
  traceback_level: INFO
  dir: /home/postgres/patroni
  file_num: 10
  file_size: 104857600

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    master_start_timeout: 300
    synchronous_mode: false
    postgresql:
      use_pg_rewind: true
      parameters:
        listen_addresses: "*"
        port: 5432
        wal_level: replica
        hot_standby: "on"
        wal_keep_size: 100
        max_wal_senders: 10
        max_replication_slots: 10
        wal_log_hints: "on"
        archive_mode: "off"
        archive_timeout: 1800s
        logging_collector: on
        log_destination: 'stderr'
        log_truncate_on_rotation: on
        log_checkpoints: on
        log_connections: on
        log_disconnections: on
        log_error_verbosity: default
        log_lock_waits: on
        log_temp_files: 0
        log_autovacuum_min_duration: 0
        log_min_duration_statement: 50
        log_timezone: 'PRC'
        log_filename: postgresql-%Y-%m-%d_%H.log
        log_line_prefix: '%t [%p]: db=%d,user=%u,app=%a,client=%h '


postgresql:
  database: postgres
  listen: 0.0.0.0:5432
  connect_address: 192.168.61.105:5432
  bin_dir: /usr/local/pgsql/bin
  data_dir: /usr/local/pgsql/data
  pgpass: /home/postgres/tmp/.pgpass

  authentication:
    replication:
      username: postgres
      password: postgres
    superuser:
      username: postgres
      password: postgres
    rewind:
      username: postgres
      password: postgres

  pg_hba:
  - local   all             all                                     trust
  - host    all             all             0.0.0.0/0               trust
  - host    all             all             ::1/128                 trust
  - local   replication     all                                     trust
  - host    replication     all             0.0.0.0/0               trust
  - host    replication     all             ::1/128                 trust

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

patronictl show-config

[postgres@sophia-pghost5 patroni]$ patronictl -c patroni.yml show-config
loop_wait: 10
master_start_timeout: 300
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    archive_mode: 'off'
    archive_timeout: 1800s
    hot_standby: 'on'
    listen_addresses: '*'
    log_autovacuum_min_duration: 0
    log_checkpoints: true
    log_connections: true
    log_destination: stderr
    log_disconnections: true
    log_error_verbosity: default
    log_filename: postgresql-%Y-%m-%d_%H.log
    log_line_prefix: '%t [%p]: db=%d,user=%u,app=%a,client=%h '
    log_lock_waits: true
    log_min_duration_statement: 50
    log_temp_files: 0
    log_timezone: PRC
    log_truncate_on_rotation: true
    logging_collector: true
    max_replication_slots: 10
    max_wal_senders: 10
    port: 5432
    wal_keep_size: 100
    wal_level: replica
    wal_log_hints: 'on'
  use_pg_rewind: true
retry_timeout: 10
synchronous_mode: false
ttl: 30

Patroni log files

patroni.log on replica:
2023-12-26 10:23:33,507 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:23:33,988 INFO: establishing a new patroni heartbeat connection to postgres
2023-12-26 10:23:34,000 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:23:43,508 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:23:53,515 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:03,515 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:13,506 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:23,514 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:33,512 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:43,514 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:50,629 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:24:50,641 INFO: closed patroni connections to postgres
2023-12-26 10:24:53,513 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:24:53,516 INFO: failed to start postgres
2023-12-26 10:25:03,513 WARNING: Postgresql is not running.
2023-12-26 10:25:03,514 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:25:03,515 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202107181
  Database system identifier: 7316468372200812230
  Database cluster state: shut down in recovery
  pg_control last modified: Tue Dec 26 10:24:50 2023
  Latest checkpoint location: 0/90003C8
  Latest checkpoint's REDO location: 0/9000390
  Latest checkpoint's REDO WAL file: 000000010000000000000009
  Latest checkpoint's TimeLineID: 1
  Latest checkpoint's PrevTimeLineID: 1
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:733
  Latest checkpoint's NextOID: 13893
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 726
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 733
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Tue Dec 26 10:00:32 2023
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/9000478
  Min recovery ending loc's timeline: 1
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: f455e96399d6e8a07ea03b663e2c7f203b2730c6e7a734722a254c2384f04b92

2023-12-26 10:25:03,516 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:25:03,520 INFO: Local timeline=1 lsn=0/9000478
2023-12-26 10:25:03,525 INFO: primary_timeline=1
2023-12-26 10:25:03,526 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:25:03,529 INFO: starting as a secondary
2023-12-26 10:25:13,519 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:25:13,522 INFO: failed to start postgres
2023-12-26 10:25:23,510 WARNING: Postgresql is not running.
2023-12-26 10:25:23,510 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:25:23,511 INFO: pg_controldata:

PostgreSQL log files

postgresql-2023-12-26_10.log on replica:
2023-12-26 10:23:24.085 CST [3604] LOG:  database system is ready to accept read-only connections
2023-12-26 10:23:24.089 CST [3610] LOG:  started streaming WAL from primary at 0/9000000 on timeline 1
2023-12-26 10:24:50.632 CST [3604] LOG:  received fast shutdown request
2023-12-26 10:24:50.633 CST [3604] LOG:  aborting any active transactions
2023-12-26 10:24:50.633 CST [3613] FATAL:  terminating connection due to administrator command
2023-12-26 10:24:50.633 CST [3610] FATAL:  terminating walreceiver process due to administrator command
2023-12-26 10:24:50.635 CST [3607] LOG:  shutting down
2023-12-26 10:24:50.638 CST [3604] LOG:  database system is shut down
Comments: There is no more postgresql log after database shutdown because patroni actually haven't sent start command to postgresql when exeception occured.

Have you tried to use GitHub issue search?

Yes

Anything else we need to know?

I made an initial debug on source code and found in patroni/postgresql/init.py, load_current_server_parameters() was introduced in patroni3.2.0 when we are "joining" already running postgres. But it didn't get full server_parameters actually. That caused KeyError during calling self.config.effective_configuration when patroni trying to start postgres.

        if self.state == 'running':  # we are "joining" already running postgres
            # we know that PostgreSQL is accepting connections and can read some GUC's from pg_settings
            self.config.load_current_server_parameters()