Demotion of primary hangs when logical replication connection is established

Question

Demotion of primary hangs when logical replication connection is established

jasper-d opened this issue 3 months ago · comments

What happened?

We are running a Postgres cluster with two members on K8S using the operator and Patroni 3.0.1.
A client application has a logical replication connection established to Postgres.
Patroni failed to reach DCS and initiated demotion of the primary.
Because of the logical replication connection, demotion did hang indefinitely until the replication connection was manually closed by killing the client.

How can we reproduce it (as minimally and precisely as possible)?

Have a 2+ member cluster
Establish a logical replication connection to the primary
Trigger demotion of primary by making DCS unavailble

What did you expect to happen?

Demotion should succeed and logical replication connection should be closed forcefully.

Patroni/PostgreSQL/DCS version

ghcr.io/zalando/spilo-15:3.0-p1
Patroni version: 3.0.1
PostgreSQL version: 15
DCS (and its version): K8S etcd

Patroni configuration file

bootstrap:
  dcs:
    failsafe_mode: false
    loop_wait: 10
    maximum_lag_on_failover: 33554432
    postgresql:
      parameters:
        archive_mode: 'on'
        archive_timeout: 1800s
        autovacuum_analyze_scale_factor: 0.02
        autovacuum_max_workers: 5
        autovacuum_vacuum_scale_factor: 0.05
        checkpoint_completion_target: 0.9
        hot_standby: 'on'
        log_autovacuum_min_duration: 0
        log_checkpoints: 'on'
        log_connections: 'on'
        log_disconnections: 'on'
        log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
        log_lock_waits: 'on'
        log_min_duration_statement: 500
        log_statement: ddl
        log_temp_files: 0
        max_connections: 100
        max_replication_slots: 10
        max_wal_senders: 10
        tcp_keepalives_idle: 900
        tcp_keepalives_interval: 100
        track_functions: all
        wal_compression: 'on'
        wal_level: logical
        wal_log_hints: 'on'
      use_pg_rewind: true
      use_slots: true
    retry_timeout: 10
    ttl: 30
  initdb:
  - auth-host: md5
  - auth-local: trust
  post_init: /scripts/post_init.sh "zalandos"
  users:
    zalandos:
      options:
      - CREATEDB
      - NOLOGIN
      password: ''
kubernetes:
  bypass_api_service: true
  labels:
    application: spilo
  port: tcp://10.233.0.1:443
  port_443_tcp: tcp://10.233.0.1:443
  port_443_tcp_addr: 10.233.0.1
  port_443_tcp_port: '443'
  port_443_tcp_proto: tcp
  ports:
  - name: postgresql
    port: 5432
  role_label: spilo-role
  scope_label: cluster-name
  service_host: 10.233.0.1
  service_port: '443'
  service_port_https: '443'
  use_endpoints: true
namespace: REDACTED
postgresql:
  authentication:
    replication:
      password: REDACTED
      username: standby
    superuser:
      password: REDACTED
      username: postgres
  basebackup_fast_xlog:
    command: /scripts/basebackup.sh
    retries: 2
  bin_dir: /usr/lib/postgresql/15/bin
  callbacks:
    on_role_change: /scripts/on_role_change.sh zalandos true
  connect_address: 10.233.71.212:5432
  create_replica_method:
  - wal_e
  - basebackup_fast_xlog
  data_dir: /home/postgres/pgdata/pgroot/data
  listen: '*:5432'
  name: REDACTED
  parameters:
    archive_command: envdir "/run/etc/wal-e.d/env" wal-g wal-push "%p"
    bg_mon.history_buckets: 120
    bg_mon.listen_address: '::'
    extwlist.custom_path: /scripts
    extwlist.extensions: btree_gin,btree_gist,citext,extra_window_functions,first_last_agg,hll,hstore,hypopg,intarray,ltree,pgcrypto,pgq,pgq_node,pg_trgm,postgres_fdw,tablefunc,uuid-ossp,pg_partman
    log_destination: csvlog
    log_directory: ../pg_log
    log_file_mode: '0644'
    log_filename: postgresql-%u.log
    log_rotation_age: 1d
    log_truncate_on_rotation: 'on'
    logging_collector: 'on'
    pg_stat_statements.track_utility: 'off'
    shared_buffers: 409MB
    shared_preload_libraries: bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,pg_cron,pg_stat_kcache
    ssl: 'on'
    ssl_cert_file: /run/certs/server.crt
    ssl_key_file: /run/certs/server.key
  pg_hba:
  - local      all             all            trust
  - host       all             all            all                md5
  - host       all             +zalandos      all                pam
  - local      replication     standby        trust
  - hostssl    replication     standby        all                md5
  pgpass: /run/postgresql/pgpass
  recovery_conf:
    restore_command: envdir "/run/etc/wal-e.d/env" timeout "0" /scripts/restore_command.sh "%f" "%p"
  use_unix_socket: true
  use_unix_socket_repl: true
  wal_e:
    command: envdir /run/etc/wal-e.d/env bash /scripts/wale_restore.sh
    no_master: 1
    retries: 2
    threshold_backup_size_percentage: 30
    threshold_megabytes: 102400
restapi:
  connect_address: 10.233.71.212:8008
  listen: :8008
scope: REDACTED

patronictl show-config

failsafe_mode: false
loop_wait: 10
maximum_lag_on_failover: 33554432
pg_hba:
- local      all             all            trust
- host       all             all            all                md5
- host       all             +zalandos      all                pam
- local      replication     standby        trust
- hostssl    replication     standby        all                md5
postgresql:
  parameters:
    archive_mode: 'on'
    archive_timeout: 1800s
    autovacuum_analyze_scale_factor: 0.02
    autovacuum_max_workers: 5
    autovacuum_vacuum_scale_factor: 0.05
    checkpoint_completion_target: 0.9
    hot_standby: 'on'
    log_autovacuum_min_duration: 0
    log_checkpoints: 'on'
    log_connections: 'on'
    log_disconnections: 'on'
    log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
    log_lock_waits: 'on'
    log_min_duration_statement: 500
    log_statement: ddl
    log_temp_files: 0
    max_connections: 100
    max_replication_slots: 10
    max_wal_senders: 10
    tcp_keepalives_idle: 900
    tcp_keepalives_interval: 100
    track_functions: all
    wal_compression: 'on'
    wal_level: logical
    wal_log_hints: 'on'
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
ttl: 30

Patroni log files

2024-02-14 06:25:06,976 ERROR: Error communicating with DCS
2024-02-14 06:25:06,976 INFO: demoting self because DCS is not accessible and I was a leader
2024-02-14 06:25:06,976 INFO: Demoting self (offline)
2024-02-14 06:25:06,977 WARNING: Loop time exceeded, rescheduling immediately.
2024-02-14 06:25:06,978 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:25:07,083 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:25:17,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:25:17,487 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:25:22.673 UTC [40] LOG {ticks: 0, maint: 0, retry: 0}
2024-02-14 06:25:22.675 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:25:22.675 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  the database system is shutting down
2024-02-14 06:25:22.675 UTC [40] WARNING [postgres]: default timeout
2024-02-14 06:25:22.676 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:25:22.676 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  the database system is shutting down
2024-02-14 06:25:22.677 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:25:22.677 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  the database system is shutting down
2024-02-14 06:25:22.677 UTC [40] WARNING [ccc]: default timeout
2024-02-14 06:25:27,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:25:27,489 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:25:37,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:25:37,488 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:25:47,481 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:25:47,495 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:25:52.701 UTC [40] LOG {ticks: 0, maint: 0, retry: 0}
2024-02-14 06:25:57,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:25:57,487 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:26:07,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:26:07,487 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:26:17,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:26:22,481 ERROR: Request to server https://10.233.0.1:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=4.999785443767905)")
2024-02-14 06:26:22.701 UTC [40] LOG {ticks: 0, maint: 0, retry: 0}
2024-02-14 06:26:22.703 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:26:22.703 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  the database system is shutting down
2024-02-14 06:26:22.704 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:26:22.704 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  the database system is shutting down
2024-02-14 06:26:22.704 UTC [40] WARNING [postgres]: default timeout
2024-02-14 06:26:22.705 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:26:22.705 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  the database system is shutting down
2024-02-14 06:26:22.705 UTC [40] WARNING [ccc]: default timeout
2024-02-14 06:26:23,374 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:26:27,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:26:27,492 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:26:37,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:26:39,992 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:26:47,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:26:47,489 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:26:47,489 ERROR: ObjectCache.run ProtocolError("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
2024-02-14 06:26:52.729 UTC [40] LOG {ticks: 0, maint: 0, retry: 0}
2024-02-14 06:26:57,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:27:02,481 ERROR: Request to server https://10.233.0.1:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=4.99343760125339)")
2024-02-14 06:27:05,301 ERROR: Request to server https://10.233.0.1:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.233.0.1', port=443): Read timed out. (read timeout=2.1720608193427324)")
2024-02-14 06:27:05,752 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:27:07,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:27:07,488 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:27:17,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:27:17,489 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:27:22.733 UTC [40] LOG {ticks: 0, maint: 0, retry: 0}
2024-02-14 06:27:22.735 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:27:22.735 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  the database system is shutting down
2024-02-14 06:27:22.735 UTC [40] WARNING [postgres]: default timeout
2024-02-14 06:27:22.737 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:27:22.737 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  the database system is shutting down
2024-02-14 06:27:22.737 UTC [40] ERROR connection error: PQconnectPoll
2024-02-14 06:27:22.737 UTC [40] ERROR libpq: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  the database system is shutting down
2024-02-14 06:27:22.737 UTC [40] WARNING [ccc]: default timeout
2024-02-14 06:27:27,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:27:27,487 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:27:37,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:27:37,488 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:27:47,477 INFO: Lock owner: aa-bb-ccc-postgres-1; I am aa-bb-ccc-postgres-1
2024-02-14 06:27:47,488 INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader
2024-02-14 06:27:52.761 UTC [40] LOG {ticks: 0, maint: 0, retry: 0}

PostgreSQL log files

Please see Patroni log files section.

Have you tried to use GitHub issue search?

Yes

Anything else we need to know?

We enabled failsafe_mode as a mitigation.

Polina Bungina · Answer 1 · Wed Feb 14 2024 23:25:59 GMT+0800 (China Standard Time)

Patroni currently only forces termination in synchronous mode (see primary_stop_timeout parameter)
Regarding the original issue, it is really the problem of libraries not implementing replication protocol correctly. Patroni just waits for the clean shutdown (which is not possible given there is at least one walsender that hasn't got the confirmation of all the changes being received).

JFYI the problem of endless shutdown should have been fixed in pgjdbc v42.7.0 - Use KeepAlive to confirm LSNs
(although I don't know what you are actually using in the client application)

Polina Bungina · Answer 2 · Wed Feb 14 2024 23:28:16 GMT+0800 (China Standard Time)

And please make sure you actually checked the previously opened issues - this was discussed many times