Restarting replica failed if patroni joined running postgres
XiuhuaRuan opened this issue · comments
Sophia Ruan commented
What happened?
If patroni joined a running postgres as replica, executing patronictl restart replica always returned status code=503.
How can we reproduce it (as minimally and precisely as possible)?
- Set up a postgres streaming replication without patroni.
postgres=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backen
sn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time
------+----------+----------+------------------+----------------+-----------------+-------------+-------------------------------+-------
---+-----------+------------+-----------------+-----------------+-----------------+---------------+------------+------------------------
2016 | 10 | postgres | walreceiver | 192.168.61.106 | | 42156 | 2023-12-26 09:23:34.19185+08 |
60 | 0/9000060 | 0/9000060 | 00:00:00.036591 | 00:00:00.036897 | 00:00:00.036908 | 0 | async | 2023-12-26 09:23:44.599
- Start patroni on primary and replica node in sequence and show patronictl list.
[postgres@sophia-pghost5 patroni]$ patronictl -c patroni.yml list
+ Cluster: postgres-cluster (7316468372200812230) ---+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgres_01 | 192.168.61.105 | Leader | running | 1 | |
| postgres_02 | 192.168.61.106 | Replica | streaming | 1 | 0 |
+-------------+----------------+---------+-----------+----+-----------+
- patronctl restart leader returned success
[postgres@sophia-pghost5 patroni]$ patronictl -c patroni.yml restart postgres-cluster postgres_01
+ Cluster: postgres-cluster (7316468372200812230) ---+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgres_01 | 192.168.61.105 | Leader | running | 1 | |
| postgres_02 | 192.168.61.106 | Replica | streaming | 1 | 0 |
+-------------+----------------+---------+-----------+----+-----------+
When should the restart take place (e.g. 2023-12-26T10:55) [now]:
Are you sure you want to restart members postgres_01? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2) []:
Success: restart on member postgres_01
- But patronctl restart replica node failed.
[postgres@sophia-pghost5 patroni]$ patronictl -c patroni.yml restart postgres-cluster postgres_02
+ Cluster: postgres-cluster (7316468372200812230) ---+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgres_01 | 192.168.61.105 | Leader | running | 1 | |
| postgres_02 | 192.168.61.106 | Replica | streaming | 1 | 0 |
+-------------+----------------+---------+-----------+----+-----------+
When should the restart take place (e.g. 2023-12-26T11:24) [now]:
Are you sure you want to restart members postgres_02? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2) []:
Failed: restart for member postgres_02, status code=503, (postgres is still starting)
What did you expect to happen?
Restarting replica node should return success.
Patroni/PostgreSQL/DCS version
- Patroni version: 3.2.0
- PostgreSQL version: 14.0
- DCS (and its version): etcd3.5.9
Patroni configuration file
scope: postgres-cluster
namespace: /service/
name: postgres_01
restapi:
listen: 192.168.61.105:8008
connect_address: 192.168.61.105:8008
etcd:
hosts: 192.168.61.105:2379,192.168.61.106:2379,192.168.61.107:2379
log:
level: INFO
traceback_level: INFO
dir: /home/postgres/patroni
file_num: 10
file_size: 104857600
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
master_start_timeout: 300
synchronous_mode: false
postgresql:
use_pg_rewind: true
parameters:
listen_addresses: "*"
port: 5432
wal_level: replica
hot_standby: "on"
wal_keep_size: 100
max_wal_senders: 10
max_replication_slots: 10
wal_log_hints: "on"
archive_mode: "off"
archive_timeout: 1800s
logging_collector: on
log_destination: 'stderr'
log_truncate_on_rotation: on
log_checkpoints: on
log_connections: on
log_disconnections: on
log_error_verbosity: default
log_lock_waits: on
log_temp_files: 0
log_autovacuum_min_duration: 0
log_min_duration_statement: 50
log_timezone: 'PRC'
log_filename: postgresql-%Y-%m-%d_%H.log
log_line_prefix: '%t [%p]: db=%d,user=%u,app=%a,client=%h '
postgresql:
database: postgres
listen: 0.0.0.0:5432
connect_address: 192.168.61.105:5432
bin_dir: /usr/local/pgsql/bin
data_dir: /usr/local/pgsql/data
pgpass: /home/postgres/tmp/.pgpass
authentication:
replication:
username: postgres
password: postgres
superuser:
username: postgres
password: postgres
rewind:
username: postgres
password: postgres
pg_hba:
- local all all trust
- host all all 0.0.0.0/0 trust
- host all all ::1/128 trust
- local replication all trust
- host replication all 0.0.0.0/0 trust
- host replication all ::1/128 trust
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
patronictl show-config
[postgres@sophia-pghost5 patroni]$ patronictl -c patroni.yml show-config
loop_wait: 10
master_start_timeout: 300
maximum_lag_on_failover: 1048576
postgresql:
parameters:
archive_mode: 'off'
archive_timeout: 1800s
hot_standby: 'on'
listen_addresses: '*'
log_autovacuum_min_duration: 0
log_checkpoints: true
log_connections: true
log_destination: stderr
log_disconnections: true
log_error_verbosity: default
log_filename: postgresql-%Y-%m-%d_%H.log
log_line_prefix: '%t [%p]: db=%d,user=%u,app=%a,client=%h '
log_lock_waits: true
log_min_duration_statement: 50
log_temp_files: 0
log_timezone: PRC
log_truncate_on_rotation: true
logging_collector: true
max_replication_slots: 10
max_wal_senders: 10
port: 5432
wal_keep_size: 100
wal_level: replica
wal_log_hints: 'on'
use_pg_rewind: true
retry_timeout: 10
synchronous_mode: false
ttl: 30
Patroni log files
patroni.log on replica:
2023-12-26 10:23:33,507 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:23:33,988 INFO: establishing a new patroni heartbeat connection to postgres
2023-12-26 10:23:34,000 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:23:43,508 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:23:53,515 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:03,515 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:13,506 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:23,514 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:33,512 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:43,514 INFO: no action. I am (postgres_02), a secondary, and following a leader (postgres_01)
2023-12-26 10:24:50,629 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:24:50,641 INFO: closed patroni connections to postgres
2023-12-26 10:24:53,513 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:24:53,516 INFO: failed to start postgres
2023-12-26 10:25:03,513 WARNING: Postgresql is not running.
2023-12-26 10:25:03,514 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:25:03,515 INFO: pg_controldata:
pg_control version number: 1300
Catalog version number: 202107181
Database system identifier: 7316468372200812230
Database cluster state: shut down in recovery
pg_control last modified: Tue Dec 26 10:24:50 2023
Latest checkpoint location: 0/90003C8
Latest checkpoint's REDO location: 0/9000390
Latest checkpoint's REDO WAL file: 000000010000000000000009
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:733
Latest checkpoint's NextOID: 13893
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 726
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 733
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid: 0
Latest checkpoint's newestCommitTsXid: 0
Time of latest checkpoint: Tue Dec 26 10:00:32 2023
Fake LSN counter for unlogged rels: 0/3E8
Minimum recovery ending location: 0/9000478
Min recovery ending loc's timeline: 1
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: on
max_connections setting: 100
max_worker_processes setting: 8
max_wal_senders setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float8 argument passing: by value
Data page checksum version: 0
Mock authentication nonce: f455e96399d6e8a07ea03b663e2c7f203b2730c6e7a734722a254c2384f04b92
2023-12-26 10:25:03,516 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:25:03,520 INFO: Local timeline=1 lsn=0/9000478
2023-12-26 10:25:03,525 INFO: primary_timeline=1
2023-12-26 10:25:03,526 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:25:03,529 INFO: starting as a secondary
2023-12-26 10:25:13,519 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:25:13,522 INFO: failed to start postgres
2023-12-26 10:25:23,510 WARNING: Postgresql is not running.
2023-12-26 10:25:23,510 INFO: Lock owner: postgres_01; I am postgres_02
2023-12-26 10:25:23,511 INFO: pg_controldata:
PostgreSQL log files
postgresql-2023-12-26_10.log on replica:
2023-12-26 10:23:24.085 CST [3604] LOG: database system is ready to accept read-only connections
2023-12-26 10:23:24.089 CST [3610] LOG: started streaming WAL from primary at 0/9000000 on timeline 1
2023-12-26 10:24:50.632 CST [3604] LOG: received fast shutdown request
2023-12-26 10:24:50.633 CST [3604] LOG: aborting any active transactions
2023-12-26 10:24:50.633 CST [3613] FATAL: terminating connection due to administrator command
2023-12-26 10:24:50.633 CST [3610] FATAL: terminating walreceiver process due to administrator command
2023-12-26 10:24:50.635 CST [3607] LOG: shutting down
2023-12-26 10:24:50.638 CST [3604] LOG: database system is shut down
Comments: There is no more postgresql log after database shutdown because patroni actually haven't sent start command to postgresql when exeception occured.
Have you tried to use GitHub issue search?
- Yes
Anything else we need to know?
I made an initial debug on source code and found in patroni/postgresql/init.py, load_current_server_parameters() was introduced in patroni3.2.0 when we are "joining" already running postgres. But it didn't get full server_parameters actually. That caused KeyError during calling self.config.effective_configuration when patroni trying to start postgres.
if self.state == 'running': # we are "joining" already running postgres
# we know that PostgreSQL is accepting connections and can read some GUC's from pg_settings
self.config.load_current_server_parameters()