Broken replication

Question

Broken replication

Pluggi opened this issue 6 months ago · comments

What happened?

Our Postgres cluster is deployed using CrunchyData's PGO operator which uses Patroni 2.1.7 under the hood.
We deploy it on an AKS Kubernetes cluster v1.26.3.
We have one primary and one replica in two different Availability Zones. We know Azure's network is not great.
We are using synchronous_commit: on for PG, synchronous_mode: true and synchronous_mode_strict: false for Patroni.

At 10:30 am UTC, we saw a huge increase in query times, exceeding 10s for queries that usually take less than 1s.

Note: Time is in UTC+1, so it's actually around 10:30 am UTC

Commits where also taking a looot of time to succeed. We think that the connection between the primary and replica was lost, because we see at 10:36 am that a new replication slot is opened and all operations resumed after this.

There were absolutely no error logs in Patroni, just the usual "I am the leader" and "Following a leader".

We found out that our retry_timeout configuration was not following the new warning that we must ensure loop_wait + 2 * retry_timeout <= ttl because it was added recently. Could it be that Patroni did not realize the replication was broken because of this misconfiguration?

Also, SELECT * FROM pg_replication_slot on either the primary or the replica returns nothing.
However, SELECT * FROM pg_stat_replication works. We are running both queries as postgres user.

How can we reproduce it (as minimally and precisely as possible)?

N/A

Even we don't have a reproduction setup yet.

What did you expect to happen?

We'd expect Patroni and/or Postgres to detect the replication slot failure and remove the replica from the synchronous_standby_names.

Patroni/PostgreSQL/DCS version

Patroni version: 2.1.7
PostgreSQL version: 14.7
DCS (and its version): Kubernetes (AKS) 1.26.3

Patroni configuration file

N/A

patronictl show-config

failsafe_mode: true
loop_wait: 10
postgresql:
  parameters:
    archive_command: pgbackrest --stanza=db archive-push "%p"
    archive_mode: 'on'
    archive_timeout: 60s
    checkpoint_timeout: 600
    effective_cache_size: 87859MB
    effective_io_concurrency: 200
    fsync: 'on'
    full_page_writes: 'on'
    idle_in_transaction_session_timeout: 300000
    jit: 'off'
    lock_timeout: 90000
    log_connections: 'on'
    log_destination: stderr
    log_duration: 'off'
    log_filename: postgresql-%Y-%m-%d_%H%M.log
    log_line_prefix: '%m [%p] %r %a %u@%d '
    log_lock_waits: 'on'
    log_min_duration_statement: '500'
    log_rotation_age: '60'
    log_rotation_size: 256MB
    logging_collector: 'on'
    maintenance_work_mem: 256MB
    max_connections: 1000
    max_parallel_maintenance_workers: 4
    max_parallel_workers: 12
    max_parallel_workers_per_gather: 4
    max_wal_size: 4096
    max_worker_processes: 12
    password_encryption: scram-sha-256
    pgnodemx.kdapi_path: /etc/database-containerinfo
    random_page_cost: '1.1'
    restore_command: pgbackrest --stanza=db archive-get %f "%p"
    shared_buffers: 26624MB
    shared_preload_libraries: pg_stat_statements,pgnodemx,pgaudit,anon
    ssl: 'on'
    ssl_ca_file: /pgconf/tls/ca.crt
    ssl_cert_file: /pgconf/tls/tls.crt
    ssl_key_file: /pgconf/tls/tls.key
    synchronous_commit: 'on'
    synchronous_standby_names: '*'
    unix_socket_directories: /tmp/postgres
    wal_level: logical
    work_mem: 8MB
  pg_hba:
  - local all "postgres" peer
  - hostssl replication "_crunchyrepl" all cert
  - hostssl "postgres" "_crunchyrepl" all cert
  - host all "_crunchyrepl" all reject
  - host all "ccp_monitoring" "127.0.0.0/8" scram-sha-256
  - host all "ccp_monitoring" "::1/128" scram-sha-256
  - host all "ccp_monitoring" all reject
  - hostssl all "_crunchypgbouncer" all scram-sha-256
  - host all "_crunchypgbouncer" all reject
  - hostssl all vault 51.138.203.230/32 md5
  - hostssl all all 10.0.0.0/8 md5
  - local all "postgres" peer
  - local all all md5
  - hostssl replication "_crunchyrepl" all cert
  - hostssl "postgres" "_crunchyrepl" all cert
  - host all "_crunchyrepl" all reject
  - host all "ccp_monitoring" "127.0.0.0/8" md5
  - host all "ccp_monitoring" "::1/128" md5
  - hostssl all "_crunchypgbouncer" all scram-sha-256
  - host all "_crunchypgbouncer" all reject
  - host all all all reject
  use_pg_rewind: true
  use_slots: false
retry_timeout: 20
synchronous_mode: true
synchronous_mode_strict: false
ttl: 30

Patroni log files

Just a bunch of

2023-11-30 10:30:03,441 INFO: no action. I am (REDACTED), a secondary, and following a leader (REDACTED)

PostgreSQL log files

2023-11-30 10:36:05.440 UTC [309903]   @ FATAL:  terminating walreceiver due to timeout

Replication resumed

2023-11-30 10:36:06.607 UTC [2070242] 10.250.2.125(54542) [unknown] _crunchyrepl@[unknown] LOG:  connection authenticated: identity="CN=_crunchyrepl" method=cert (/pgdata/pg14/pg_hba.conf:4)
2023-11-30 10:36:06.607 UTC [2070242] 10.250.2.125(54542) [unknown] _crunchyrepl@[unknown] LOG:  replication connection authorized: user=_crunchyrepl application_name=REDACTED SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256)
2023-11-30 10:36:06.618 UTC [2070242] 10.250.2.125(54542) <hostname redacted> _crunchyrepl@[unknown] LOG:  standby "<hostname redacted>" is now a synchronous standby with priority 1
2023-11-30 10:36:06.618 UTC [2070242] 10.250.2.125(54542) <hostname redacted> _crunchyrepl@[unknown] STATEMENT:  START_REPLICATION 954/32000000 TIMELINE 16


Commits took a long time :

2023-11-30 10:36:06.625 UTC [2069483] 10.250.9.207(56372) index.js v-root-medical--OgFerMsqdLkGziZS3GUj-1700764129@REDACTED LOG:  duration: 160255.965 ms  statement: COMMIT;
2023-11-30 10:36:06.624 UTC [2044979] 10.250.9.207(41158) index.js v-root-medical--OgFerMsqdLkGziZS3GUj-1700764129@REDACTED LOG:  duration: 170740.256 ms  statement: COMMIT;
2023-11-30 10:36:06.624 UTC [2068130] 10.250.9.207(42234) index.js v-root-medical--OgFerMsqdLkGziZS3GUj-1700764129@REDACTED LOG:  duration: 162124.700 ms  statement: COMMIT;
2023-11-30 10:36:06.624 UTC [2069488] 10.250.5.114(45622) index.js v-root-medical--OgFerMsqdLkGziZS3GUj-1700764129@REDACTED LOG:  duration: 165413.920 ms  statement: COMMIT;


NOTE:  10.250.2.125 is the replica's IP

Have you tried to use GitHub issue search?

Yes

Anything else we need to know?

No response

Pluggi · Answer 1 · Fri Dec 01 2023 18:11:33 GMT+0800 (China Standard Time)

Welp, I just realized that CrunchyData PGO sets use_slots: false, so that probably explains the empty SELECT * FROM pg_replication_slot

https://github.com/CrunchyData/postgres-operator/blob/master/internal/patroni/config.go#L206

Alexander Kukushkin · Answer 2 · Fri Dec 01 2023 18:31:01 GMT+0800 (China Standard Time)

We found out that our retry_timeout configuration was not following the new warning that we must ensure loop_wait + 2 * retry_timeout <= ttl because it was added recently.

This rule existed for ages. We just never enforce it.

Could it be that Patroni did not realize the replication was broken because of this misconfiguration?

In case if you don't see anything suspicious in Patroni logs, and it seems that logs look totally normal, not following this rule didn't cause any issues.
And regarding replication connection... The only Patroni could do - set primary_conninfo that points to the primary, the rest is up to Postgres. It uses restore_command when the postgres on standby was just restarted (or for example when walreceiver got terminated/broken) and will not make an attempt to start streaming while restore_command successfully fetching files from the archive.

PostgreSQL log files
2023-11-30 10:36:05.440 UTC [309903] @ FATAL: terminating walreceiver due to timeout

Indeed, something was wrong with the replication connection. It explains slow commits.
When there is a network involved it takes some time (wal_sender_timeout/wal_receiver_timeout) until primary and replica will realize that walsender/walreceiver are broken and terminate them.

You have synchronous_mode_strict: false, it means that Patroni will disable sync replication if it realizes that walsender isn't alive (empty pg_stat_replication). But it will happen only after Postgres primary realized that fact. This time could be reduced by tuning values of wal_sender_timeout/wal_receiver_timeout GUCs. IIRC, default values are 60s.

Regarding reproduction... I think it should be fairly simple to do by sending SIGSTOP to walreceiver process on the replica.

Saïfane FARFAR · Answer 3 · Fri Dec 01 2023 19:02:44 GMT+0800 (China Standard Time)

Hi @CyberDem0n, I'm a co-worker of @Pluggi also working on this case.

You have synchronous_mode_strict: false, it means that Patroni will disable sync replication if it realizes that walsender isn't alive (empty pg_stat_replication). But it will happen only after Postgres primary realized that fact.

This is indeed the behavior we expected, but it’s not what seems to actually have happened during our incident.

We tried to reproduce as suggested (by sending a SIGSTOP to the replica’s walreceiver process). Patroni detected it correctly and its logs tell us that replication was interrupted:

2023-12-01 10:41:03,993 INFO: Lock owner: REDACTED; I am REDACTED
2023-12-01 10:41:04,005 INFO: Updating synchronous privilege temporarily from ['REDACTED'] to []
2023-12-01 10:41:04,015 INFO: Assigning synchronous standby status to []
server signaled

We have no trace of this in the logs from the time of the incident, so this seems to confirm that the wal stream between replica and master was disrupted but that Patroni didn’t realize it. Moreover, the incident lasted almost 3 minutes, which is longer than the WAL send/receive timeout we have configured (1min by default).

Alexander Kukushkin · Answer 4 · Fri Dec 01 2023 19:11:04 GMT+0800 (China Standard Time)

Moreover, the incident lasted almost 3 minutes, which is longer than the WAL send/receive timeout we have configured (1min by default).

Well, it might have started from walreciever being slow flushing WAL stream to disk. It immediately impacts the primary, because with synchronous_commit=on primary waits until flush happened on the standby before reporting transaction to the client. That is, it walreceiver could be blocked on IO. And it also may cause cancellation on timeout (but not immediately).
Do you have IO metrics (iops/latency,throughput) from the standby?

Pluggi · Answer 5 · Mon Dec 04 2023 16:38:50 GMT+0800 (China Standard Time)

We did have a spike in IO because of a pgbackrest differential backup that started at 10:00 and ended at 10:25.

Note : Time is again in UTC+1

Leader

Replica

Alexander Kukushkin · Answer 6 · Mon Dec 04 2023 16:54:21 GMT+0800 (China Standard Time)

Nothing interesting on these graphs...
What about replication lag (from pg_stat_replication view)?
More specifically, flush_lag values.

Pluggi · Answer 7 · Mon Dec 04 2023 17:02:01 GMT+0800 (China Standard Time)

Unfortunately, all I have is this

The yellow line in the top graph is the leader.

I don't think we have any metrics with the flush_lag but I'll check.

Alexander Kukushkin · Answer 8 · Mon Dec 04 2023 17:16:46 GMT+0800 (China Standard Time)

It seems that the blue line is a Lag time, and it was indeed exceeding 3 minutes.