Broken replication
Pluggi opened this issue · comments
What happened?
Our Postgres cluster is deployed using CrunchyData's PGO operator which uses Patroni 2.1.7 under the hood.
We deploy it on an AKS Kubernetes cluster v1.26.3.
We have one primary and one replica in two different Availability Zones. We know Azure's network is not great.
We are using synchronous_commit: on
for PG, synchronous_mode: true
and synchronous_mode_strict: false
for Patroni.
At 10:30 am UTC, we saw a huge increase in query times, exceeding 10s for queries that usually take less than 1s.
Note: Time is in UTC+1, so it's actually around 10:30 am UTC
Commits where also taking a looot of time to succeed. We think that the connection between the primary and replica was lost, because we see at 10:36 am that a new replication slot is opened and all operations resumed after this.
There were absolutely no error logs in Patroni, just the usual "I am the leader" and "Following a leader".
We found out that our retry_timeout
configuration was not following the new warning that we must ensure loop_wait + 2 * retry_timeout <= ttl
because it was added recently. Could it be that Patroni did not realize the replication was broken because of this misconfiguration?
Also, SELECT * FROM pg_replication_slot
on either the primary or the replica returns nothing.
However, SELECT * FROM pg_stat_replication
works. We are running both queries as postgres
user.
How can we reproduce it (as minimally and precisely as possible)?
N/A
Even we don't have a reproduction setup yet.
What did you expect to happen?
We'd expect Patroni and/or Postgres to detect the replication slot failure and remove the replica from the synchronous_standby_names
.
Patroni/PostgreSQL/DCS version
- Patroni version: 2.1.7
- PostgreSQL version: 14.7
- DCS (and its version): Kubernetes (AKS) 1.26.3
Patroni configuration file
N/A
patronictl show-config
failsafe_mode: true
loop_wait: 10
postgresql:
parameters:
archive_command: pgbackrest --stanza=db archive-push "%p"
archive_mode: 'on'
archive_timeout: 60s
checkpoint_timeout: 600
effective_cache_size: 87859MB
effective_io_concurrency: 200
fsync: 'on'
full_page_writes: 'on'
idle_in_transaction_session_timeout: 300000
jit: 'off'
lock_timeout: 90000
log_connections: 'on'
log_destination: stderr
log_duration: 'off'
log_filename: postgresql-%Y-%m-%d_%H%M.log
log_line_prefix: '%m [%p] %r %a %u@%d '
log_lock_waits: 'on'
log_min_duration_statement: '500'
log_rotation_age: '60'
log_rotation_size: 256MB
logging_collector: 'on'
maintenance_work_mem: 256MB
max_connections: 1000
max_parallel_maintenance_workers: 4
max_parallel_workers: 12
max_parallel_workers_per_gather: 4
max_wal_size: 4096
max_worker_processes: 12
password_encryption: scram-sha-256
pgnodemx.kdapi_path: /etc/database-containerinfo
random_page_cost: '1.1'
restore_command: pgbackrest --stanza=db archive-get %f "%p"
shared_buffers: 26624MB
shared_preload_libraries: pg_stat_statements,pgnodemx,pgaudit,anon
ssl: 'on'
ssl_ca_file: /pgconf/tls/ca.crt
ssl_cert_file: /pgconf/tls/tls.crt
ssl_key_file: /pgconf/tls/tls.key
synchronous_commit: 'on'
synchronous_standby_names: '*'
unix_socket_directories: /tmp/postgres
wal_level: logical
work_mem: 8MB
pg_hba:
- local all "postgres" peer
- hostssl replication "_crunchyrepl" all cert
- hostssl "postgres" "_crunchyrepl" all cert
- host all "_crunchyrepl" all reject
- host all "ccp_monitoring" "127.0.0.0/8" scram-sha-256
- host all "ccp_monitoring" "::1/128" scram-sha-256
- host all "ccp_monitoring" all reject
- hostssl all "_crunchypgbouncer" all scram-sha-256
- host all "_crunchypgbouncer" all reject
- hostssl all vault 51.138.203.230/32 md5
- hostssl all all 10.0.0.0/8 md5
- local all "postgres" peer
- local all all md5
- hostssl replication "_crunchyrepl" all cert
- hostssl "postgres" "_crunchyrepl" all cert
- host all "_crunchyrepl" all reject
- host all "ccp_monitoring" "127.0.0.0/8" md5
- host all "ccp_monitoring" "::1/128" md5
- hostssl all "_crunchypgbouncer" all scram-sha-256
- host all "_crunchypgbouncer" all reject
- host all all all reject
use_pg_rewind: true
use_slots: false
retry_timeout: 20
synchronous_mode: true
synchronous_mode_strict: false
ttl: 30
Patroni log files
Just a bunch of
2023-11-30 10:30:03,441 INFO: no action. I am (REDACTED), a secondary, and following a leader (REDACTED)
PostgreSQL log files
2023-11-30 10:36:05.440 UTC [309903] @ FATAL: terminating walreceiver due to timeout
Replication resumed
2023-11-30 10:36:06.607 UTC [2070242] 10.250.2.125(54542) [unknown] _crunchyrepl@[unknown] LOG: connection authenticated: identity="CN=_crunchyrepl" method=cert (/pgdata/pg14/pg_hba.conf:4)
2023-11-30 10:36:06.607 UTC [2070242] 10.250.2.125(54542) [unknown] _crunchyrepl@[unknown] LOG: replication connection authorized: user=_crunchyrepl application_name=REDACTED SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256)
2023-11-30 10:36:06.618 UTC [2070242] 10.250.2.125(54542) <hostname redacted> _crunchyrepl@[unknown] LOG: standby "<hostname redacted>" is now a synchronous standby with priority 1
2023-11-30 10:36:06.618 UTC [2070242] 10.250.2.125(54542) <hostname redacted> _crunchyrepl@[unknown] STATEMENT: START_REPLICATION 954/32000000 TIMELINE 16
Commits took a long time :
2023-11-30 10:36:06.625 UTC [2069483] 10.250.9.207(56372) index.js v-root-medical--OgFerMsqdLkGziZS3GUj-1700764129@REDACTED LOG: duration: 160255.965 ms statement: COMMIT;
2023-11-30 10:36:06.624 UTC [2044979] 10.250.9.207(41158) index.js v-root-medical--OgFerMsqdLkGziZS3GUj-1700764129@REDACTED LOG: duration: 170740.256 ms statement: COMMIT;
2023-11-30 10:36:06.624 UTC [2068130] 10.250.9.207(42234) index.js v-root-medical--OgFerMsqdLkGziZS3GUj-1700764129@REDACTED LOG: duration: 162124.700 ms statement: COMMIT;
2023-11-30 10:36:06.624 UTC [2069488] 10.250.5.114(45622) index.js v-root-medical--OgFerMsqdLkGziZS3GUj-1700764129@REDACTED LOG: duration: 165413.920 ms statement: COMMIT;
NOTE: 10.250.2.125 is the replica's IP
Have you tried to use GitHub issue search?
- Yes
Anything else we need to know?
No response
Welp, I just realized that CrunchyData PGO sets use_slots: false
, so that probably explains the empty SELECT * FROM pg_replication_slot
https://github.com/CrunchyData/postgres-operator/blob/master/internal/patroni/config.go#L206
We found out that our retry_timeout configuration was not following the new warning that we must ensure loop_wait + 2 * retry_timeout <= ttl because it was added recently.
This rule existed for ages. We just never enforce it.
Could it be that Patroni did not realize the replication was broken because of this misconfiguration?
In case if you don't see anything suspicious in Patroni logs, and it seems that logs look totally normal, not following this rule didn't cause any issues.
And regarding replication connection... The only Patroni could do - set primary_conninfo
that points to the primary, the rest is up to Postgres. It uses restore_command
when the postgres on standby was just restarted (or for example when walreceiver got terminated/broken) and will not make an attempt to start streaming while restore_command
successfully fetching files from the archive.
PostgreSQL log files
2023-11-30 10:36:05.440 UTC [309903] @ FATAL: terminating walreceiver due to timeout
Indeed, something was wrong with the replication connection. It explains slow commits.
When there is a network involved it takes some time (wal_sender_timeout/wal_receiver_timeout) until primary and replica will realize that walsender/walreceiver are broken and terminate them.
You have synchronous_mode_strict: false
, it means that Patroni will disable sync replication if it realizes that walsender isn't alive (empty pg_stat_replication). But it will happen only after Postgres primary realized that fact. This time could be reduced by tuning values of wal_sender_timeout
/wal_receiver_timeout
GUCs. IIRC, default values are 60s.
Regarding reproduction... I think it should be fairly simple to do by sending SIGSTOP to walreceiver process on the replica.
Hi @CyberDem0n, I'm a co-worker of @Pluggi also working on this case.
You have synchronous_mode_strict: false, it means that Patroni will disable sync replication if it realizes that walsender isn't alive (empty pg_stat_replication). But it will happen only after Postgres primary realized that fact.
This is indeed the behavior we expected, but it’s not what seems to actually have happened during our incident.
We tried to reproduce as suggested (by sending a SIGSTOP to the replica’s walreceiver process). Patroni detected it correctly and its logs tell us that replication was interrupted:
2023-12-01 10:41:03,993 INFO: Lock owner: REDACTED; I am REDACTED
2023-12-01 10:41:04,005 INFO: Updating synchronous privilege temporarily from ['REDACTED'] to []
2023-12-01 10:41:04,015 INFO: Assigning synchronous standby status to []
server signaled
We have no trace of this in the logs from the time of the incident, so this seems to confirm that the wal stream between replica and master was disrupted but that Patroni didn’t realize it. Moreover, the incident lasted almost 3 minutes, which is longer than the WAL send/receive timeout we have configured (1min by default).
Moreover, the incident lasted almost 3 minutes, which is longer than the WAL send/receive timeout we have configured (1min by default).
Well, it might have started from walreciever being slow flushing WAL stream to disk. It immediately impacts the primary, because with synchronous_commit=on
primary waits until flush happened on the standby before reporting transaction to the client. That is, it walreceiver could be blocked on IO. And it also may cause cancellation on timeout (but not immediately).
Do you have IO metrics (iops/latency,throughput) from the standby?
Nothing interesting on these graphs...
What about replication lag (from pg_stat_replication view)?
More specifically, flush_lag
values.
It seems that the blue line is a Lag time, and it was indeed exceeding 3 minutes.