zalando / patroni

A template for PostgreSQL High Availability with Etcd, Consul, ZooKeeper, or Kubernetes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Restarting replica failed if patroni joined running postgres

XiuhuaRuan opened this issue · comments

What happened?

If patroni joined a running postgres as replica, executing patronictl restart replica always returned status code=503.

How can we reproduce it (as minimally and precisely as possible)?

  1. Set up a postgres streaming replication without patroni.
postgres=# select * from pg_stat_replication;
 pid  | usesysid | usename  | application_name |  client_addr   | client_hostname | client_port |         backend_start         | backen
sn | flush_lsn | replay_lsn |    write_lag    |    flush_lag    |   replay_lag    | sync_priority | sync_state |          reply_time
------+----------+----------+------------------+----------------+-----------------+-------------+-------------------------------+-------
---+-----------+------------+-----------------+-----------------+-----------------+---------------+------------+------------------------
 2016 |       10 | postgres | walreceiver      | 192.168.61.106 |                 |       42156 | 2023-12-26 09:23:34.19185+08  |
60 | 0/9000060 | 0/9000060  | 00:00:00.036591 | 00:00:00.036897 | 00:00:00.036908 |             0 | async      | 2023-12-26 09:23:44.599
  1. Start patroni on primary and replica node in sequence and show patronictl list.
[postgres@sophia-pghost5 patroni]$ patronictl -c patroni.yml list
+ Cluster: postgres-cluster (7316468372200812230) ---+----+-----------+
| Member      | Host           | Role    | State     | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgres_01 | 192.168.61.105 | Leader  | running   |  1 |           |
| postgres_02 | 192.168.61.106 | Replica | streaming |  1 |         0 |
+-------------+----------------+---------+-----------+----+-----------+
  1. patronctl restart leader returned success
 [postgres@sophia-pghost5 patroni]$ patronictl -c patroni.yml restart postgres-cluster postgres_01
+ Cluster: postgres-cluster (7316468372200812230) ---+----+-----------+
| Member      | Host           | Role    | State     | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgres_01 | 192.168.61.105 | Leader  | running   |  1 |           |
| postgres_02 | 192.168.61.106 | Replica | streaming |  1 |         0 |
+-------------+----------------+---------+-----------+----+-----------+
When should the restart take place (e.g. 2023-12-26T10:55)  [now]:
Are you sure you want to restart members postgres_01? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []:
Success: restart on member postgres_01
  1. But patronctl restart replica node failed.
[postgres@sophia-pghost5 patroni]$ patronictl -c patroni.yml restart postgres-cluster postgres_02
+ Cluster: postgres-cluster (7316468372200812230) ---+----+-----------+
| Member      | Host           | Role    | State     | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgres_01 | 192.168.61.105 | Leader  | running   |  1 |           |
| postgres_02 | 192.168.61.106 | Replica | streaming |  1 |         0 |
+-------------+----------------+---------+-----------+----+-----------+
When should the restart take place (e.g. 2023-12-26T11:24)  [now]:
Are you sure you want to restart members postgres_02? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []:
Failed: restart for member postgres_02, status code=503, (postgres is still starting)

What did you expect to happen?

Restarting replica node should return success.

Patroni/PostgreSQL/DCS version

  • Patroni version: 3.2.0
  • PostgreSQL version: 14.0
  • DCS (and its version): etcd3.5.9

Patroni configuration file

scope: postgres-cluster
namespace: /service/
name: postgres_01

restapi:
  listen: 192.168.61.105:8008
  connect_address: 192.168.61.105:8008

etcd:
  hosts: 192.168.61.105:2379,192.168.61.106:2379,192.168.61.107:2379

log:
  level: INFO
  traceback_level: INFO
  dir: /home/postgres/patroni
  file_num: 10
  file_size: 104857600

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    master_start_timeout: 300
    synchronous_mode: false
    postgresql:
      use_pg_rewind: true
      parameters:
        listen_addresses: "*"
        port: 5432
        wal_level: replica
        hot_standby: "on"
        wal_keep_size: 100
        max_wal_senders: 10
        max_replication_slots: 10
        wal_log_hints: "on"
        archive_mode: "off"
        archive_timeout: 1800s
        logging_collector: on
        log_destination: 'stderr'
        log_truncate_on_rotation: on
        log_checkpoints: on
        log_connections: on
        log_disconnections: on
        log_error_verbosity: default
        log_lock_waits: on
        log_temp_files: 0
        log_autovacuum_min_duration: 0
        log_min_duration_statement: 50
        log_timezone: 'PRC'
        log_filename: postgresql-%Y-%m-%d_%H.log
        log_line_prefix: '%t [%p]: db=%d,user=%u,app=%a,client=%h '


postgresql:
  database: postgres
  listen: 0.0.0.0:5432
  connect_address: 192.168.61.105:5432
  bin_dir: /usr/local/pgsql/bin
  data_dir: /usr/local/pgsql/data
  pgpass: /home/postgres/tmp/.pgpass

  authentication:
    replication:
      username: postgres
      password: postgres
    superuser:
      username: postgres
      password: postgres
    rewind:
      username: postgres
      password: postgres

  pg_hba:
  - local   all             all                                     trust
  - host    all             all             0.0.0.0/0               trust
  - host    all             all             ::1/128                 trust
  - local   replication     all                                     trust
  - host    replication     all             0.0.0.0/0               trust
  - host    replication     all             ::1/128                 trust

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

patronictl show-config

[postgres@sophia-pghost5 patroni]$ patronictl -c patroni.yml show-config
loop_wait: 10
master_start_timeout: 300
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    archive_mode: 'off'
    archive_timeout: 1800s
    hot_standby: 'on'
    listen_addresses: '*'
    log_autovacuum_min_duration: 0
    log_checkpoints: true
    log_connections: true
    log_destination: stderr
    log_disconnections: true
    log_error_verbosity: default
    log_filename: postgresql-%Y-%m-%d_%H.log
    log_line_prefix: '%t [%p]: db=%d,user=%u,app=%a,client=%h '
    log_lock_waits: true
    log_min_duration_statement: 50
    log_temp_files: 0
    log_timezone: PRC
    log_truncate_on_rotation: true
    logging_collector: true
    max_replication_slots: 10
    max_wal_senders: 10
    port: 5432
    wal_keep_size: 100
    wal_level: replica
    wal_log_hints: 'on'
  use_pg_rewind: true
retry_timeout: 10
synchronous_mode: false
ttl: 30

Patroni log files

patroni.log on replica:
2023-12-26 10:23:33,507 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:23:33,988 INFO: establishing a new patroni heartbeat connection to postgres
2023-12-26 10:23:34,000 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:23:43,508 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:23:53,515 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:03,515 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:13,506 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:23,514 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:33,512 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:43,514 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:50,629 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:24:50,641 INFO: closed patroni connections to postgres
2023-12-26 10:24:53,513 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:24:53,516 INFO: failed to start postgres
2023-12-26 10:25:03,513 WARNING: Postgresql is not running.
2023-12-26 10:25:03,514 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:25:03,515 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202107181
  Database system identifier: 7316468372200812230
  Database cluster state: shut down in recovery
  pg_control last modified: Tue Dec 26 10:24:50 2023
  Latest checkpoint location: 0/90003C8
  Latest checkpoint's REDO location: 0/9000390
  Latest checkpoint's REDO WAL file: 000000010000000000000009
  Latest checkpoint's TimeLineID: 1
  Latest checkpoint's PrevTimeLineID: 1
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:733
  Latest checkpoint's NextOID: 13893
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 726
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 733
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Tue Dec 26 10:00:32 2023
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/9000478
  Min recovery ending loc's timeline: 1
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: f455e96399d6e8a07ea03b663e2c7f203b2730c6e7a734722a254c2384f04b92

2023-12-26 10:25:03,516 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:25:03,520 INFO: Local timeline=1 lsn=0/9000478
2023-12-26 10:25:03,525 INFO: primary_timeline=1
2023-12-26 10:25:03,526 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:25:03,529 INFO: starting as a secondary
2023-12-26 10:25:13,519 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:25:13,522 INFO: failed to start postgres
2023-12-26 10:25:23,510 WARNING: Postgresql is not running.
2023-12-26 10:25:23,510 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:25:23,511 INFO: pg_controldata:

PostgreSQL log files

postgresql-2023-12-26_10.log on replica:
2023-12-26 10:23:24.085 CST [3604] LOG:  database system is ready to accept read-only connections
2023-12-26 10:23:24.089 CST [3610] LOG:  started streaming WAL from primary at 0/9000000 on timeline 1
2023-12-26 10:24:50.632 CST [3604] LOG:  received fast shutdown request
2023-12-26 10:24:50.633 CST [3604] LOG:  aborting any active transactions
2023-12-26 10:24:50.633 CST [3613] FATAL:  terminating connection due to administrator command
2023-12-26 10:24:50.633 CST [3610] FATAL:  terminating walreceiver process due to administrator command
2023-12-26 10:24:50.635 CST [3607] LOG:  shutting down
2023-12-26 10:24:50.638 CST [3604] LOG:  database system is shut down
Comments: There is no more postgresql log after database shutdown because patroni actually haven't sent start command to postgresql when exeception occured. 

Have you tried to use GitHub issue search?

  • Yes

Anything else we need to know?

I made an initial debug on source code and found in patroni/postgresql/init.py, load_current_server_parameters() was introduced in patroni3.2.0 when we are "joining" already running postgres. But it didn't get full server_parameters actually. That caused KeyError during calling self.config.effective_configuration when patroni trying to start postgres.

        if self.state == 'running':  # we are "joining" already running postgres
            # we know that PostgreSQL is accepting connections and can read some GUC's from pg_settings
            self.config.load_current_server_parameters()