strange behavior of the restore_command
glushakov opened this issue · comments
What happened?
Hi!
I ran some tests and accidentally discovered some strange behavior when using the restore_command.
I did not find any mention in the documentation, or on Github other than this:
#1870
"All recovery settings that you put to the postgresql.recovery_conf are ignored, because they potentially could break replication."
So, I tried set the restore_command parameter via DCS.
+ Cluster: server01+server02 (7301273363016975757) ----------+
| Member | Host | Role | State | TL | Lag in MB |
+----------------------+----------------------+---------+-----------+----+-----------+
| server01 | server01 | Leader | running | 13 | |
| server02 | server02 | Replica | streaming | 13 | 0 |
+----------------------+----------------------+---------+-----------+----+-----------+
patronictl edit-config
---
+++
@@ -15,6 +15,7 @@
wal_keep_size: 2000
wal_level: replica
wal_log_hints: true
+ restore_command: '/usr/bin/cp /tmp/%f %p'
use_pg_rewind: true
use_slots: true
primary_start_timeout: 0
Apply these changes? [y/N]: y
Configuration changed
(patroni_venv) [09:30:17 postgres@server01:]:~$ psql -c "show restore_command"
restore_command
-----------------
(1 row)
(patroni_venv) [09:30:52 postgres@server02:]:~$ psql -c "show restore_command"
restore_command
------------------------
/usr/bin/cp /tmp/%f %p
(1 row)
As a result, the parameter was applied only on the replica.
The first strange thing is that patroni did not mention this anywhere in the logs.
server01
Patroni logs
2024-01-25 21:28:30,868 INFO: no action. I am (server01), the leader with the lock
2024-01-25 21:28:30,870 INFO: No PostgreSQL configuration items changed, nothing to reload.
2024-01-25 21:28:40,869 INFO: no action. I am (server01), the leader with the lock
2024-01-25 21:28:50,833 INFO: No local configuration items changed.
2024-01-25 21:28:50,835 INFO: Reloading PostgreSQL configuration.
2024-01-25 21:28:51,889 INFO: no action. I am (server01), the leader with the lock
2024-01-25 21:29:00,868 INFO: no action. I am (server01), the leader with the lock
PG logs
2024-01-25 21:28:50.839 MSK 2052875 0 LOG: received SIGHUP, reloading configuration files
<no message about changed parameter>
server02
Patroni logs
2024-01-25 21:28:30,864 INFO: no action. I am (server02), a secondary, and following a leader (server01)
2024-01-25 21:28:30,866 INFO: No PostgreSQL configuration items changed, nothing to reload.
2024-01-25 21:28:41,423 INFO: no action. I am (server02), a secondary, and following a leader (server01)
2024-01-25 21:28:51,370 INFO: No local configuration items changed.
2024-01-25 21:28:51,373 INFO: Reloading PostgreSQL configuration.
2024-01-25 21:28:52,382 INFO: Lock owner: server01; I am server02
2024-01-25 21:28:52,398 INFO: Local timeline=13 lsn=5D/FE00E360
2024-01-25 21:28:52,444 INFO: primary_timeline=13
2024-01-25 21:28:52,492 INFO: no action. I am (server02), a secondary, and following a leader (server01)
2024-01-25 21:29:01,363 INFO: no action. I am (server02), a secondary, and following a leader (server01)
PG logs
2024-01-25 21:28:41.369 MSK 1685495 0 LOG: received SIGHUP, reloading configuration files
2024-01-25 21:28:41.370 MSK 1685495 0 LOG: parameter "restore_command" changed to "/usr/bin/cp /tmp/%f %p"
I assumed that this was normal behavior, because on the master node this parameter is not needed in normal mode, so patroni ignored it. (but it would be better if reported this in the log)
Then I performed a switchover (from 01 to 02), and after changing roles, the parameter was also applied on the new master node.
(patroni_venv) [09:39:50 postgres@server01:]:~$ patronictl switchover --force
<...>
2024-01-25 21:40:01.39245 Successfully switched over to "server02"
(patroni_venv) [09:40:31 postgres@server01:]:~$ psql -c "show restore_command"
restore_command
------------------------
/usr/bin/cp /tmp/%f %p
(1 row)
(patroni_venv) [09:41:14 postgres@server02:]:~$ psql -c "show restore_command"
restore_command
------------------------
/usr/bin/cp /tmp/%f %p
(1 row)
Next, I removed the parameter from DCS and everything worked the other way around - the parameter remained on master, but disappeared on the replica node.
(patroni_venv) [09:55:54 postgres@server01:]:~$ patronictl edit-config
---
+++
@@ -11,7 +11,6 @@
max_replication_slots: 10
max_wal_senders: 10
max_worker_processes: 8
- restore_command: /usr/bin/cp /tmp/%f %p
track_commit_timestamp: false
wal_keep_size: 2000
wal_level: replica
Apply these changes? [y/N]: y
Configuration changed
(patroni_venv) [09:57:07 postgres@server01:]:~$ psql -c "show restore_command"
restore_command
-----------------
(1 row)
(patroni_venv) [09:57:01 postgres@server02:]:~$ psql -c "show restore_command"
restore_command
------------------------
/usr/bin/cp /tmp/%f %p
(1 row)
How can we reproduce it (as minimally and precisely as possible)?
set restore_command , make sure that it is applied only on the replica and make switchover
What did you expect to happen?
- What is the correct behavior when setting a parameter? How patroni works with it?
- events of setting or ignoring parameters should be reported in the logs
Patroni/PostgreSQL/DCS version
- Patroni version: 3.2.2
- PostgreSQL version: 15.5
- DCS (and its version): etcd 3.5.4
Patroni configuration file
scope: server01+server02
namespace: /service/
name: server01
log:
level: INFO
format: '%(asctime)s %(levelname)s: %(message)s'
dateformat: ''
max_queue_size: 1000
dir: /var/log/patroni
file_num: 4
file_size: 25000000
restapi:
listen: 0.0.0.0:8008
connect_address: server01:8008
authentication:
username: user
password: pass
etcd3:
hosts: etcd-server1:2379
protocol: http
username: user
password: pass
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
primary_start_timeout: 0
primary_stop_timeout: 30
synchronous_mode: true
synchronous_mode_strict: false
check_timeline: false
failsafe_mode: false
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
max_connections: 300
max_locks_per_transaction: 64
max_worker_processes: 8
max_prepared_transactions: 0
wal_level: replica
wal_log_hints: on
track_commit_timestamp: off
max_wal_senders: 10
max_replication_slots: 10
wal_keep_size: 2000
postgresql:
use_unix_socket: true
listen: 0.0.0.0:5432
connect_address: server01:5432
data_dir: /var/lib/pgsql/15/data
bin_dir: /usr/pgsql-15/bin/
config_dir: /var/lib/pgsql/15/data
pgpass: /var/lib/pgsql/.pgpass
authentication:
superuser:
username: postgres
password: pass
replication:
username: replicator
password: pass
rewind:
username: postgres
password: pass
create_replica_methods:
- basebackup
basebackup:
checkpoint: fast
parameters:
unix_socket_directories: /var/run/postgresql
pg_hba:
- local all postgres peer
- host all all 0.0.0.0/0 scram-sha-256
- host replication replicator 0.0.0.0/0 scram-sha-256
pg_ctl_timeout: 60
remove_data_directory_on_rewind_failure: false
remove_data_directory_on_diverged_timelines: false
watchdog:
mode: off
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
patronictl show-config
check_timeline: false
failsafe_mode: false
loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
parameters:
archive_command: /usr/bin/cp %p /tmp/postgresql_archives/%f
max_connections: 300
max_locks_per_transaction: 64
max_prepared_transactions: 0
max_replication_slots: 10
max_wal_senders: 10
max_worker_processes: 8
track_commit_timestamp: false
wal_keep_size: 2000
wal_level: replica
wal_log_hints: true
use_pg_rewind: true
use_slots: true
primary_start_timeout: 0
primary_stop_timeout: 30
retry_timeout: 10
synchronous_mode: false
synchronous_mode_strict: false
ttl: 30
Patroni log files
relevant part already included
PostgreSQL log files
relevant part already included
Have you tried to use GitHub issue search?
- Yes
Anything else we need to know?
No response
The restore_command
on the primary is no-op, therefore when you put it to the config Patroni doesn't even try to change it on the primary.
In case if primary crashed, or after switchover, the postgresql.con is always updated and the restore_command
will be added there right on time.
Removing restore_command
from the config on promoted standby also doesn't bring anything, therefore it is left as it is.
I don't see any good reason to add additional code just to report something that works as expected.