zalando / patroni

A template for PostgreSQL High Availability with Etcd, Consul, ZooKeeper, or Kubernetes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

loop after experiencing DCS (clickhouse-keeper) connection issues

Girgitt opened this issue · comments

What happened?

Patroni did not properly renewed connection to DCS and stayed in a loop after DCS node serving Patroni node got detached (network partition) from the rest of ZK cluster and temporarily stopped serving clients. Affected node was not present in "patronictl topology" output. Two clusters sharing the same VM and the same DCS were affected. Restarting Patroni fixed the issue on one cluster node's Patroni instance but other instance had to be re-initialized as after restart its pg_hba.conf file was empty (probably due to some unrelated issue; fixed by re-initializing Patroni node via "patronictl reinit").

Despite getting ZK cluster to normal condition Patroni was logging non-stop (every couple milliseconds) entries attached to this report.

How can we reproduce it (as minimally and precisely as possible)?

Configure ZK cluster using clickhouse-keeper. Put one ZK node in a local network and other nodes possibly away (cloud deployment etc.) so local Patroni node will connect to the local ZK node. Other patroni nodes can be remote as well (e.g. sharing hosts with ZK nodes)

Configure Patroni cluster using ZK as a DCS.

Drop network between local nodes (Patroni, ZK) and the rest of the ZK nodes effectively creating network partition between local nodes and other ZK and Patroni nodes.

When Patroni starts reporting errors related to DCS connection re-establish connection between local and remote nodes.

What did you expect to happen?

Patroni successfully renews connection to DCS - not necessary to the same ZK node as before.

Patroni/PostgreSQL/DCS version

  • Patroni version: 3.1.2
  • PostgreSQL version: 15
  • DCS (and its version): ClickHouse Keeper version: v23.8.4.69-stable-d4d1e7b9dedd6666f1c621fee6204b798ca185f1

Patroni configuration file

(patroni-venv) [root@db-1 ~]# cat /opt/patroni_pgdo1.yml 
scope: patroni_pgdo1
namespace: /patroni_pgdo1/
name: pg_1

restapi:
  listen: 10.11.1.111:8008
  connect_address: 10.11.1.111:8008
#  certfile: /etc/ssl/certs/ssl-cert-snakeoil.pem
#  keyfile: /etc/ssl/private/ssl-cert-snakeoil.key
#  authentication:
#    username: username
#    password: password

# ctl:
#   insecure: false # Allow connections to SSL sites without certs
#   certfile: /etc/ssl/certs/ssl-cert-snakeoil.pem
#   cacert: /etc/ssl/certs/ssl-cacert-snakeoil.pem

#raft:
#  data_dir: .
#  self_addr: 127.0.0.1:2222
#  partner_addrs:
#  - 127.0.0.1:2223
#  - 127.0.0.1:2224

zookeeper:
  hosts: [ 'zk-1.tac:2181', 'zk-2.tac:2181', 'zk-3.tac:2181' ]

bootstrap:
  # this section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
  # and all other cluster members will use it as a 
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
#    master_start_timeout: 300
#    synchronous_mode: false
    #standby_cluster:
      #host: 127.0.0.1
      #port: 1111
      #primary_slot_name: patroni
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
         wal_level: hot_standby
#        hot_standby: "on"
#        max_connections: 100
#        max_worker_processes: 8
#        wal_keep_segments: 8
#        max_wal_senders: 10
#        max_replication_slots: 10
#        max_prepared_transactions: 0
#        max_locks_per_transaction: 64
         wal_log_hints: "on"
#        track_commit_timestamp: "off"
#        archive_mode: "on"
#        archive_timeout: 1800s
#        archive_command: mkdir -p ../wal_archive && test ! -f ../wal_archive/%f && cp %p ../wal_archive/%f
#      recovery_conf:
#        restore_command: cp ../wal_archive/%f %p

  # some desired options for 'initdb'
  initdb:  # Note: It needs to be a list (some options need values, others are switches)
  - encoding: UTF8
  - data-checksums

  pg_hba:  # Add following lines to pg_hba.conf after running 'initdb'
  # For kerberos gss based connectivity (discard @.*$)
  #- host replication replicator 127.0.0.1/32 gss include_realm=0
  #- host all all 0.0.0.0/0 gss include_realm=0
  - host replication patroni_replicator 10.11.1.0/24 md5
  - host all all 0.0.0.0/0 md5
  - host all all 10.11.1.0/24 trust
  - host    replication     patroni_replicator    10.11.1.0/24   md5

#  - hostssl all all 0.0.0.0/0 md5

  # Additional script to be launched after initial cluster creation (will be passed the connection URL as parameter)
# post_init: /usr/local/bin/setup_cluster.sh

  # Some additional users users which needs to be created after initializing new cluster
  users:
    admin:
      password: admin%
      options:
        - createrole
        - createdb

postgresql:
  use_pg_rewind: true
  listen: 10.11.1.111:5432
  connect_address: 10.11.1.111:5432
  data_dir: /mnt/pg_vol_1/pgdo1
  # for some reason PGSQL installation does not result in creating global symlink to pg_rewind binary; forcing bin_dir should fix the problem
  bin_dir: /usr/pgsql-15/bin/
#  config_dir:
  pgpass: /tmp/pgpass_pgdo1
  authentication:
    superuser:
      username: patroni_superuser
      password: "!secret"
    replication:
      username: patroni_replicator
      password: "!secret"
    rewind:  # Has no effect on postgres 10 and lower
      username: patroni_rewind
      password: "!secret"
  # Server side kerberos spn
#  krbsrvname: postgres
  parameters:
    # Fully qualified kerberos ticket file for the running user
    # same as KRB5CCNAME used by the GSS
#   krb_server_keyfile: /var/spool/keytabs/postgres
    unix_socket_directories: '.'  # data_dir
  # Additional fencing script executed after acquiring the leader lock but before promoting the replica
  #pre_promote: /path/to/pre_promote.sh

#watchdog:
#  mode: automatic # Allowed values: off, automatic, required
#  device: /dev/watchdog
#  safety_margin: 5

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

patronictl show-config

(patroni-venv) [root@db-1 ~]# patronictl -c /opt/patroni_pgdo1.yml show-config
loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    wal_level: hot_standby
    wal_log_hints: 'on'
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
ttl: 30

Patroni log files

2023-11-22 15:24:03,371 ERROR: Error communicating with DCS
2023-11-22 15:24:03,372 INFO: DCS is not accessible
2023-11-22 15:24:03,373 ERROR: get_cluster
Traceback (most recent call last):
  File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 310, in _load_cluster
    cluster = self._client.retry(loader, path)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 347, in _retry
    return self._retry.copy()(*args, **kwargs)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/retry.py", line 126, in __call__
    return func(*args, **kwargs)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 248, in _cluster_loader
    nodes = set(self.get_children(path, self.cluster_watcher))
  File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 233, in get_children
    return self._client.get_children(key, watch)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 1218, in get_children
    return self.get_children_async(path, watch=watch,
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 80, in get
    raise self._exception
kazoo.exceptions.ConnectionClosedError: Connection has been closed
2023-11-22 15:24:03,373 ERROR: Error communicating with DCS
2023-11-22 15:24:03,374 INFO: DCS is not accessible
2023-11-22 15:24:03,375 ERROR: get_cluster
Traceback (most recent call last):
  File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 310, in _load_cluster
    cluster = self._client.retry(loader, path)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 347, in _retry
    return self._retry.copy()(*args, **kwargs)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/retry.py", line 126, in __call__
    return func(*args, **kwargs)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 248, in _cluster_loader
    nodes = set(self.get_children(path, self.cluster_watcher))
  File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 233, in get_children
    return self._client.get_children(key, watch)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 1218, in get_children
    return self.get_children_async(path, watch=watch,
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 80, in get
    raise self._exception
kazoo.exceptions.ConnectionClosedError: Connection has been closed
2023-11-22 15:24:03,375 ERROR: Error communicating with DCS

PostgreSQL log files

2023-11-22 15:45:16.500 CET [730753] LOG:  waiting for WAL to become available at 0/C002000
2023-11-22 15:45:21.501 CET [730753] LOG:  waiting for WAL to become available at 0/C002000
2023-11-22 15:45:26.505 CET [730753] LOG:  waiting for WAL to become available at 0/C002000
2023-11-22 15:45:31.506 CET [730753] LOG:  waiting for WAL to become available at 0/C002000
2023-11-22 15:45:36.510 CET [730753] LOG:  waiting for WAL to become available at 0/C002000
2023-11-22 15:45:41.511 CET [730753] LOG:  waiting for WAL to become available at 0/C002000
2023-11-22 15:45:46.515 CET [730753] LOG:  waiting for WAL to become available at 0/C002000
2023-11-22 15:45:51.516 CET [730753] LOG:  waiting for WAL to become available at 0/C002000
2023-11-22 15:45:56.520 CET [730753] LOG:  waiting for WAL to become available at 0/C002000
2023-11-22 15:46:01.521 CET [730753] LOG:  waiting for WAL to become available at 0/C002000
2023-11-22 15:46:06.525 CET [730753] LOG:  waiting for WAL to become available at 0/C002000
2023-11-22 15:46:11.526 CET [730753] LOG:  waiting for WAL to become available at 0/C002000

Have you tried to use GitHub issue search?

  • Yes

Anything else we need to know?

Patroni runs in a virtual env from source via supervisord using the following config:

[program:patroni_pgdo1]
command=/opt/patroni-venv/bin/python -m patroni /opt/patroni_pgdo1.yml
user=postgres
stopsignal=INT
directory=/opt/
environment=PATH="/usr/pgsql-15/bin:/opt/:%(ENV_PATH)s"
autostart=true                ; start at supervisord start (default: true)
autorestart=true              ; retstart at unexpected quit (default: true)
startsecs=10                  ; number of secs prog must stay running (def. 1)
startretries=3                ; max # of serial start failures (default 3)
stdout_logfile=/var/log/patroni_pgdo1.log        ; stdout log path, NONE for none; default AUTO
stderr_logfile=/var/log/patroni_pgdo1.log        ; stdout log path, NONE for none; default AUTO

virtual env details:

(patroni-venv) [root@db-1 ~]# pip freeze
click==8.1.7
kazoo==2.9.0
patroni==3.1.2
prettytable==3.9.0
psutil==5.9.6
psycopg==3.1.12
psycopg-binary==3.1.12
python-dateutil==2.8.2
PyYAML==6.0.1
six==1.16.0
typing_extensions==4.8.0
urllib3==2.0.7
wcwidth==0.2.8
ydiff==1.2

maybe the issue is with kazoo: python-zk/kazoo#428

No response

Patroni resumed normal operation after restart.

After recovery a different node (former leader to which Patroni failed-over) got into the same DCS-related loop after switchover:

The original scenario is either incorrect or the issue manifests in different ways.

(log below was copied from /var/log/ to avoid rollover as DCS-related exception is logged very frequently)

[root@db-2 ~]# sed -n -e 245241,245341p /tmp/patroni/patroni_pgvm1.log.2 
2023-11-22 16:22:41,623 INFO: received switchover request with leader=pg_2 candidate=pg_1 scheduled_at=None
2023-11-22 16:22:41,679 INFO: Got response from pg_1 http://10.11.1.91:8008/patroni: {"state": "running", "postmaster_start_time": "2023-11-22 16:22:07.755635+01:00", "role": "replica", "server_version": 150004, "xlog": {"received_location": 318768408, "replayed_location": 318768408, "replayed_timestamp": null, "paused": false}, "timeline": 22, "replication_state": "streaming", "dcs_last_seen": 1700666559, "database_system_identifier": "7293861610560159297", "patroni": {"version": "3.1.2", "scope": "patroni_pgvm1"}}
2023-11-22 16:22:42,104 INFO: Got response from pg_1 http://10.11.1.91:8008/patroni: {"state": "running", "postmaster_start_time": "2023-11-22 16:22:07.755635+01:00", "role": "replica", "server_version": 150004, "xlog": {"received_location": 318768408, "replayed_location": 318768408, "replayed_timestamp": null, "paused": false}, "timeline": 22, "replication_state": "streaming", "dcs_last_seen": 1700666559, "database_system_identifier": "7293861610560159297", "patroni": {"version": "3.1.2", "scope": "patroni_pgvm1"}}
2023-11-22 16:22:42,049 INFO: Lock owner: pg_2; I am pg_2
2023-11-22 16:22:42,162 INFO: manual failover: demoting myself
2023-11-22 16:22:42,453 INFO: Lock owner: pg_2; I am pg_2
2023-11-22 16:22:42,453 INFO: updated leader lock during manual failover: demote
2023-11-22 16:22:42,454 INFO: Demoting self (graceful)
2023-11-22 16:22:43,663 WARNING: Connection dropped: socket connection broken
2023-11-22 16:22:43,663 WARNING: Transition to CONNECTING
2023-11-22 16:22:43,663 INFO: Zookeeper connection lost
2023-11-22 16:22:43,664 INFO: Zookeeper session closed, state: CLOSED
2023-11-22 16:22:43,668 INFO: Connecting to zk-1.tac(10.11.1.11):2181, use_ssl: False
2023-11-22 16:22:43,800 INFO: Zookeeper connection established, state: CONNECTED
2023-11-22 16:22:43,888 ERROR: Unhandled exception in connection loop
Traceback (most recent call last):
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 602, in _connect_attempt
    response = self._read_socket(read_timeout)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 453, in _read_socket
    return self._read_response(header, buffer, offset)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 383, in _read_response
    raise exc
RuntimeError: ('xids do not match, expected %r received %r', 1, 2)
2023-11-22 16:22:43,890 ERROR: touch_member
Traceback (most recent call last):
  File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 425, in touch_member
    self._client.create_async(self.member_path, encoded_data, makepath=True, ephemeral=True).get(timeout=1)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 86, in get
    raise self._exception
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 292, in captured_function
    return function(*args, **kwargs)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 313, in captured_function
    value = function(*args, **kwargs)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 1022, in create_completion
    return self.unchroot(result.get())
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 80, in get
    raise self._exception
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 602, in _connect_attempt
    response = self._read_socket(read_timeout)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 453, in _read_socket
    return self._read_response(header, buffer, offset)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 383, in _read_response
    raise exc
RuntimeError: ('xids do not match, expected %r received %r', 1, 2)
2023-11-22 16:22:43,897 INFO: Leader key released
Exception in thread Thread-1194052:
2023-11-22 16:22:43,898 INFO: Zookeeper session closed, state: CLOSED
Traceback (most recent call last):
  File "/usr/lib64/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 512, in zk_loop
    if retry(self._connect_loop, retry) is STOP_CONNECTING:
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/retry.py", line 126, in __call__
    return func(*args, **kwargs)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 552, in _connect_loop
    status = self._connect_attempt(host, hostip, port, retry)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 602, in _connect_attempt
    response = self._read_socket(read_timeout)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 453, in _read_socket
    return self._read_response(header, buffer, offset)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 383, in _read_response
    raise exc
RuntimeError: ('xids do not match, expected %r received %r', 1, 2)
2023-11-22 16:22:44,013 ERROR: get_cluster
Traceback (most recent call last):
  File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 310, in _load_cluster
    cluster = self._client.retry(loader, path)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 347, in _retry
    return self._retry.copy()(*args, **kwargs)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/retry.py", line 126, in __call__
    return func(*args, **kwargs)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 248, in _cluster_loader
    nodes = set(self.get_children(path, self.cluster_watcher))
  File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 233, in get_children
    return self._client.get_children(key, watch)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 1218, in get_children
    return self.get_children_async(path, watch=watch,
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 80, in get
    raise self._exception
kazoo.exceptions.ConnectionClosedError: Connection has been closed
2023-11-22 16:22:44,014 ERROR: Error communicating with DCS
2023-11-22 16:22:44,015 INFO: DCS is not accessible
2023-11-22 16:22:44,015 ERROR: get_cluster
Traceback (most recent call last):
  File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 310, in _load_cluster
    cluster = self._client.retry(loader, path)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 347, in _retry
    return self._retry.copy()(*args, **kwargs)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/retry.py", line 126, in __call__
    return func(*args, **kwargs)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 248, in _cluster_loader
    nodes = set(self.get_children(path, self.cluster_watcher))
  File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 233, in get_children
    return self._client.get_children(key, watch)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 1218, in get_children
    return self.get_children_async(path, watch=watch,
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 80, in get
    raise self._exception
kazoo.exceptions.ConnectionClosedError: Connection has been closed

Again restarting Patroni fixed DCS connection issue:

azoo.exceptions.ConnectionClosedError: Connection has been closed
2023-11-22 16:37:24,838 ERROR: Error communicating with DCS
2023-11-22 16:37:24,842 INFO: DCS is not accessible
2023-11-22 16:37:24,843 ERROR: get_cluster
Traceback (most recent call last):
  File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 310, in _load_cluster
    cluster = self._client.retry(loader, path)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 347, in _retry
    return self._retry.copy()(*args, **kwargs)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/retry.py", line 126, in __call__
    return func(*args, **kwargs)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 248, in _cluster_loader
    nodes = set(self.get_children(path, self.cluster_watcher))
  File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 233, in get_children
    return self._client.get_children(key, watch)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 1218, in get_children
    return self.get_children_async(path, watch=watch,
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 80, in get
    raise self._exception
kazoo.exceptions.ConnectionClosedError: Connection has been closed
2023-11-22 16:37:24,843 ERROR: Error communicating with DCS
2023-11-22 16:37:24,923 ERROR: touch_member
Traceback (most recent call last):
  File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 425, in touch_member
    self._client.create_async(self.member_path, encoded_data, makepath=True, ephemeral=True).get(timeout=1)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 80, in get
    raise self._exception
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 292, in captured_function
    return function(*args, **kwargs)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 1004, in do_create
    result = self._create_async_inner(
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 1052, in _create_async_inner
    raise async_result.exception
kazoo.exceptions.ConnectionClosedError: Connection has been closed
2023-11-22 16:37:29,428 INFO: Connecting to zk-1.tac(10.11.1.11):2181, use_ssl: False
2023-11-22 16:37:29,560 INFO: Zookeeper connection established, state: CONNECTED
2023-11-22 16:37:29,814 INFO: No PostgreSQL configuration items changed, nothing to reload.
2023-11-22 16:37:29,936 WARNING: Postgresql is not running.
2023-11-22 16:37:29,937 INFO: Lock owner: pg_1; I am pg_2
2023-11-22 16:37:29,941 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202209061
  Database system identifier: 7293861610560159297
  Database cluster state: shut down in recovery
  pg_control last modified: Wed Nov 22 16:37:24 2023
  Latest checkpoint location: 0/130005C8
  Latest checkpoint's REDO location: 0/130005C8
  Latest checkpoint's REDO WAL file: 000000160000000000000013
  Latest checkpoint's TimeLineID: 22
  Latest checkpoint's PrevTimeLineID: 22
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:51758
  Latest checkpoint's NextOID: 23098
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 717
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Wed Nov 22 16:22:42 2023
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/13000640
  Min recovery ending loc's timeline: 22
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: 904aa8b010e238e03d68379679602e07eeb104589b5dfe7667501d1aaa08be82

2023-11-22 16:37:30,216 INFO: Lock owner: pg_1; I am pg_2
2023-11-22 16:37:30,241 INFO: Local timeline=22 lsn=0/13000640
2023-11-22 16:37:30,579 INFO: primary_timeline=23
2023-11-22 16:37:30,605 INFO: primary: history=19	0/11E4E0F8	no recovery target specified
20	0/13000270	no recovery target specified
21	0/13000400	no recovery target specified
22	0/13000640	no recovery target specified
2023-11-22 16:37:30,605 INFO: Lock owner: pg_1; I am pg_2
2023-11-22 16:37:30,607 INFO: starting as a secondary
2023-11-22 16:37:31.550 CET [688084] LOG:  redirecting log output to logging collector process
2023-11-22 16:37:31.550 CET [688084] HINT:  Future log output will appear in directory "log".
2023-11-22 16:37:31,621 INFO: postmaster pid=688084
10.11.1.92:5432 - accepting connections
10.11.1.92:5432 - accepting connections
2023-11-22 16:37:32,064 INFO: Lock owner: pg_1; I am pg_2
2023-11-22 16:37:32,064 INFO: establishing a new patroni connection to the postgres cluster
2023-11-22 16:37:32,132 INFO: Dropped unknown replication slot 'pg_1'
2023-11-22 16:37:32,144 INFO: Dropped unknown replication slot 'pg_3'
2023-11-22 16:37:32,373 INFO: no action. I am (pg_2), a secondary, and following a leader (pg_1)
2023-11-22 16:37:32,753 INFO: no action. I am (pg_2), a secondary, and following a leader (pg_1)
2023-11-22 16:37:42,877 INFO: no action. I am (pg_2), a secondary, and following a leader (pg_1)
2023-11-22 16:37:52,877 INFO: no action. I am (pg_2), a secondary, and following a leader (pg_1)

Patroni has no control to which zookeeper nodes it is connecting to, it is handled by kazoo.
Also, you are not using a real zookeeper.

Thank you for a quick reply. I have other setup with real zookeeper which never hit such issue in ~2 years but it uses an older Patroni version (2.1.4) and never experienced network partitioning.

Even if this problem is not related to Patroni but to kazoo or clickhouse-keeper the connection state handling could be improved to explicitly handle exception kazoo.exceptions.ConnectionClosedError at least to avoid calling "_load_cluster" method every couple ms (probably as a result of call done in the daemon main loop: main.run_cycle->ha._run_cycle->ha.load_cluster_from_dcs->ha.dcs.get_cluster->zookeeper.__get_patroni_cluster->zookeeper._load_cluster) or better - to re-initialize connection to DCS - but I know too little of Patroni codebase to suggest any feasible solution especially given other functionality that can be easily broken (like DCS Failsafe Mode or "fast" shutdown of postgres if leader lost DCS connection mentioned here #1346 (comment)).

The failover worked, there was no data loss and in 2 out of 3 cases simple restart of Patroni service fixed lost replicas which is good enough. Also couple additional swichovers did not recreate the problem from my last comment.
The issue is minor.

@CyberDem0n could you please take a look at a naive attempt to mitigate the problem:
https://github.com/zalando/patroni/compare/REL_3_1...Girgitt:patroni:%232957-zk-conn-lost-loop?expand=1
and comment if restarting kazoo, when the connection to ZooKeeper is closed anyway and cluster data cannot be fetched, can possibly do more harm than good?
Unit tests for ha and zookeeper pass but I do not see kazoo connection or retrying-related exceptions being tested.

@Girgitt what caught my eye - an exception that happened in a thread that is responsible for communicating with zookeeper:

2023-11-22 16:22:43,888 ERROR: Unhandled exception in connection loop
Traceback (most recent call last):
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 602, in _connect_attempt
    response = self._read_socket(read_timeout)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 453, in _read_socket
    return self._read_response(header, buffer, offset)
  File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 383, in _read_response
    raise exc
RuntimeError: ('xids do not match, expected %r received %r', 1, 2)

And as a result:

Exception in thread Thread-1194052:

From this moment the thread that is started here

self._client.start()
or here
self._client.restart()
is dead and will not resurrect.

Now we need to understand what brought kazoo to this state. For that you need to enable DEBUG level logs and and configure format so that includes thread name: https://patroni.readthedocs.io/en/latest/yaml_configuration.html#log

@CyberDem0n I will enable DEBUG logging level and try to reproduce the issue.

Regarding RuntimeError I suspect possible packet duplication. Hosts are interconnected via Linux bridges connected via multiple vxlans over wireguard and mstpd handles RSTP for each bridge. During spanning tree topology change there is a chance to receive duplicated packets (or even own packets) and I am not sure if such packets are dropped by kernel.

Based on your last comment regarding the zk client's thread termination - apart from trying to recreate the xids mismatch error, since my setup is not using the real ZooKeeper, zk client is created in a constructor making client's re-creation problematic and zookeper module was heavily refactored since REL_3_1, I will probably extend for my own purpose Patroni's zookeeper module, Ha,_run_cycle or Ha,_handle_dcs_error methods with new type of "DCSFatalError" exception raised on kazoo.exceptions.ConnectionClosedError and try to gracefully terminate Patroni to trigger restart by supervisord watchdog. This is just in case the original problem will be too difficult to reproduce - it happened only once in couple months.

@Girgitt the RuntimeError: ('xids do not match, expected %r received %r', 1, 2) isn't happening in the Patroni code.

The thing is, once KazooClient.start() method is starting a thread, that is supposed to iterate over Zookeeper servers, connect to them, authenticate, run client commands and handle all kinds of expected exceptions. That is, if something is wrong with one of the Zookeeper hosts, it is supposed to switch to another one.

But, the RuntimeError for some reason breaks this thread. It would be good to understand the reason for that. It is very unlikely that your network configuration could be blamed.

Could you repeat the same test with the real Zookeeper? In case if everything is fine, that would mean that ClickHouse Keeper isn't so compatible with Zookeeper.

Thank you, I understand the issue is not in Patroni code and that kazoo thread terminaes. I will try to reproduce the problem. Setting up a second, real ZK cluster and another Patroni cluster for testing is not a problem.