loop after experiencing DCS (clickhouse-keeper) connection issues
Girgitt opened this issue · comments
What happened?
Patroni did not properly renewed connection to DCS and stayed in a loop after DCS node serving Patroni node got detached (network partition) from the rest of ZK cluster and temporarily stopped serving clients. Affected node was not present in "patronictl topology" output. Two clusters sharing the same VM and the same DCS were affected. Restarting Patroni fixed the issue on one cluster node's Patroni instance but other instance had to be re-initialized as after restart its pg_hba.conf file was empty (probably due to some unrelated issue; fixed by re-initializing Patroni node via "patronictl reinit").
Despite getting ZK cluster to normal condition Patroni was logging non-stop (every couple milliseconds) entries attached to this report.
How can we reproduce it (as minimally and precisely as possible)?
Configure ZK cluster using clickhouse-keeper. Put one ZK node in a local network and other nodes possibly away (cloud deployment etc.) so local Patroni node will connect to the local ZK node. Other patroni nodes can be remote as well (e.g. sharing hosts with ZK nodes)
Configure Patroni cluster using ZK as a DCS.
Drop network between local nodes (Patroni, ZK) and the rest of the ZK nodes effectively creating network partition between local nodes and other ZK and Patroni nodes.
When Patroni starts reporting errors related to DCS connection re-establish connection between local and remote nodes.
What did you expect to happen?
Patroni successfully renews connection to DCS - not necessary to the same ZK node as before.
Patroni/PostgreSQL/DCS version
- Patroni version: 3.1.2
- PostgreSQL version: 15
- DCS (and its version): ClickHouse Keeper version: v23.8.4.69-stable-d4d1e7b9dedd6666f1c621fee6204b798ca185f1
Patroni configuration file
(patroni-venv) [root@db-1 ~]# cat /opt/patroni_pgdo1.yml
scope: patroni_pgdo1
namespace: /patroni_pgdo1/
name: pg_1
restapi:
listen: 10.11.1.111:8008
connect_address: 10.11.1.111:8008
# certfile: /etc/ssl/certs/ssl-cert-snakeoil.pem
# keyfile: /etc/ssl/private/ssl-cert-snakeoil.key
# authentication:
# username: username
# password: password
# ctl:
# insecure: false # Allow connections to SSL sites without certs
# certfile: /etc/ssl/certs/ssl-cert-snakeoil.pem
# cacert: /etc/ssl/certs/ssl-cacert-snakeoil.pem
#raft:
# data_dir: .
# self_addr: 127.0.0.1:2222
# partner_addrs:
# - 127.0.0.1:2223
# - 127.0.0.1:2224
zookeeper:
hosts: [ 'zk-1.tac:2181', 'zk-2.tac:2181', 'zk-3.tac:2181' ]
bootstrap:
# this section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
# and all other cluster members will use it as a
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
# master_start_timeout: 300
# synchronous_mode: false
#standby_cluster:
#host: 127.0.0.1
#port: 1111
#primary_slot_name: patroni
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
wal_level: hot_standby
# hot_standby: "on"
# max_connections: 100
# max_worker_processes: 8
# wal_keep_segments: 8
# max_wal_senders: 10
# max_replication_slots: 10
# max_prepared_transactions: 0
# max_locks_per_transaction: 64
wal_log_hints: "on"
# track_commit_timestamp: "off"
# archive_mode: "on"
# archive_timeout: 1800s
# archive_command: mkdir -p ../wal_archive && test ! -f ../wal_archive/%f && cp %p ../wal_archive/%f
# recovery_conf:
# restore_command: cp ../wal_archive/%f %p
# some desired options for 'initdb'
initdb: # Note: It needs to be a list (some options need values, others are switches)
- encoding: UTF8
- data-checksums
pg_hba: # Add following lines to pg_hba.conf after running 'initdb'
# For kerberos gss based connectivity (discard @.*$)
#- host replication replicator 127.0.0.1/32 gss include_realm=0
#- host all all 0.0.0.0/0 gss include_realm=0
- host replication patroni_replicator 10.11.1.0/24 md5
- host all all 0.0.0.0/0 md5
- host all all 10.11.1.0/24 trust
- host replication patroni_replicator 10.11.1.0/24 md5
# - hostssl all all 0.0.0.0/0 md5
# Additional script to be launched after initial cluster creation (will be passed the connection URL as parameter)
# post_init: /usr/local/bin/setup_cluster.sh
# Some additional users users which needs to be created after initializing new cluster
users:
admin:
password: admin%
options:
- createrole
- createdb
postgresql:
use_pg_rewind: true
listen: 10.11.1.111:5432
connect_address: 10.11.1.111:5432
data_dir: /mnt/pg_vol_1/pgdo1
# for some reason PGSQL installation does not result in creating global symlink to pg_rewind binary; forcing bin_dir should fix the problem
bin_dir: /usr/pgsql-15/bin/
# config_dir:
pgpass: /tmp/pgpass_pgdo1
authentication:
superuser:
username: patroni_superuser
password: "!secret"
replication:
username: patroni_replicator
password: "!secret"
rewind: # Has no effect on postgres 10 and lower
username: patroni_rewind
password: "!secret"
# Server side kerberos spn
# krbsrvname: postgres
parameters:
# Fully qualified kerberos ticket file for the running user
# same as KRB5CCNAME used by the GSS
# krb_server_keyfile: /var/spool/keytabs/postgres
unix_socket_directories: '.' # data_dir
# Additional fencing script executed after acquiring the leader lock but before promoting the replica
#pre_promote: /path/to/pre_promote.sh
#watchdog:
# mode: automatic # Allowed values: off, automatic, required
# device: /dev/watchdog
# safety_margin: 5
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
patronictl show-config
(patroni-venv) [root@db-1 ~]# patronictl -c /opt/patroni_pgdo1.yml show-config
loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
parameters:
wal_level: hot_standby
wal_log_hints: 'on'
use_pg_rewind: true
use_slots: true
retry_timeout: 10
ttl: 30
Patroni log files
2023-11-22 15:24:03,371 ERROR: Error communicating with DCS
2023-11-22 15:24:03,372 INFO: DCS is not accessible
2023-11-22 15:24:03,373 ERROR: get_cluster
Traceback (most recent call last):
File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 310, in _load_cluster
cluster = self._client.retry(loader, path)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 347, in _retry
return self._retry.copy()(*args, **kwargs)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/retry.py", line 126, in __call__
return func(*args, **kwargs)
File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 248, in _cluster_loader
nodes = set(self.get_children(path, self.cluster_watcher))
File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 233, in get_children
return self._client.get_children(key, watch)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 1218, in get_children
return self.get_children_async(path, watch=watch,
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 80, in get
raise self._exception
kazoo.exceptions.ConnectionClosedError: Connection has been closed
2023-11-22 15:24:03,373 ERROR: Error communicating with DCS
2023-11-22 15:24:03,374 INFO: DCS is not accessible
2023-11-22 15:24:03,375 ERROR: get_cluster
Traceback (most recent call last):
File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 310, in _load_cluster
cluster = self._client.retry(loader, path)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 347, in _retry
return self._retry.copy()(*args, **kwargs)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/retry.py", line 126, in __call__
return func(*args, **kwargs)
File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 248, in _cluster_loader
nodes = set(self.get_children(path, self.cluster_watcher))
File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 233, in get_children
return self._client.get_children(key, watch)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 1218, in get_children
return self.get_children_async(path, watch=watch,
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 80, in get
raise self._exception
kazoo.exceptions.ConnectionClosedError: Connection has been closed
2023-11-22 15:24:03,375 ERROR: Error communicating with DCS
PostgreSQL log files
2023-11-22 15:45:16.500 CET [730753] LOG: waiting for WAL to become available at 0/C002000
2023-11-22 15:45:21.501 CET [730753] LOG: waiting for WAL to become available at 0/C002000
2023-11-22 15:45:26.505 CET [730753] LOG: waiting for WAL to become available at 0/C002000
2023-11-22 15:45:31.506 CET [730753] LOG: waiting for WAL to become available at 0/C002000
2023-11-22 15:45:36.510 CET [730753] LOG: waiting for WAL to become available at 0/C002000
2023-11-22 15:45:41.511 CET [730753] LOG: waiting for WAL to become available at 0/C002000
2023-11-22 15:45:46.515 CET [730753] LOG: waiting for WAL to become available at 0/C002000
2023-11-22 15:45:51.516 CET [730753] LOG: waiting for WAL to become available at 0/C002000
2023-11-22 15:45:56.520 CET [730753] LOG: waiting for WAL to become available at 0/C002000
2023-11-22 15:46:01.521 CET [730753] LOG: waiting for WAL to become available at 0/C002000
2023-11-22 15:46:06.525 CET [730753] LOG: waiting for WAL to become available at 0/C002000
2023-11-22 15:46:11.526 CET [730753] LOG: waiting for WAL to become available at 0/C002000
Have you tried to use GitHub issue search?
- Yes
Anything else we need to know?
Patroni runs in a virtual env from source via supervisord using the following config:
[program:patroni_pgdo1]
command=/opt/patroni-venv/bin/python -m patroni /opt/patroni_pgdo1.yml
user=postgres
stopsignal=INT
directory=/opt/
environment=PATH="/usr/pgsql-15/bin:/opt/:%(ENV_PATH)s"
autostart=true ; start at supervisord start (default: true)
autorestart=true ; retstart at unexpected quit (default: true)
startsecs=10 ; number of secs prog must stay running (def. 1)
startretries=3 ; max # of serial start failures (default 3)
stdout_logfile=/var/log/patroni_pgdo1.log ; stdout log path, NONE for none; default AUTO
stderr_logfile=/var/log/patroni_pgdo1.log ; stdout log path, NONE for none; default AUTO
virtual env details:
(patroni-venv) [root@db-1 ~]# pip freeze
click==8.1.7
kazoo==2.9.0
patroni==3.1.2
prettytable==3.9.0
psutil==5.9.6
psycopg==3.1.12
psycopg-binary==3.1.12
python-dateutil==2.8.2
PyYAML==6.0.1
six==1.16.0
typing_extensions==4.8.0
urllib3==2.0.7
wcwidth==0.2.8
ydiff==1.2
maybe the issue is with kazoo: python-zk/kazoo#428
No response
Patroni resumed normal operation after restart.
After recovery a different node (former leader to which Patroni failed-over) got into the same DCS-related loop after switchover:
The original scenario is either incorrect or the issue manifests in different ways.
(log below was copied from /var/log/ to avoid rollover as DCS-related exception is logged very frequently)
[root@db-2 ~]# sed -n -e 245241,245341p /tmp/patroni/patroni_pgvm1.log.2
2023-11-22 16:22:41,623 INFO: received switchover request with leader=pg_2 candidate=pg_1 scheduled_at=None
2023-11-22 16:22:41,679 INFO: Got response from pg_1 http://10.11.1.91:8008/patroni: {"state": "running", "postmaster_start_time": "2023-11-22 16:22:07.755635+01:00", "role": "replica", "server_version": 150004, "xlog": {"received_location": 318768408, "replayed_location": 318768408, "replayed_timestamp": null, "paused": false}, "timeline": 22, "replication_state": "streaming", "dcs_last_seen": 1700666559, "database_system_identifier": "7293861610560159297", "patroni": {"version": "3.1.2", "scope": "patroni_pgvm1"}}
2023-11-22 16:22:42,104 INFO: Got response from pg_1 http://10.11.1.91:8008/patroni: {"state": "running", "postmaster_start_time": "2023-11-22 16:22:07.755635+01:00", "role": "replica", "server_version": 150004, "xlog": {"received_location": 318768408, "replayed_location": 318768408, "replayed_timestamp": null, "paused": false}, "timeline": 22, "replication_state": "streaming", "dcs_last_seen": 1700666559, "database_system_identifier": "7293861610560159297", "patroni": {"version": "3.1.2", "scope": "patroni_pgvm1"}}
2023-11-22 16:22:42,049 INFO: Lock owner: pg_2; I am pg_2
2023-11-22 16:22:42,162 INFO: manual failover: demoting myself
2023-11-22 16:22:42,453 INFO: Lock owner: pg_2; I am pg_2
2023-11-22 16:22:42,453 INFO: updated leader lock during manual failover: demote
2023-11-22 16:22:42,454 INFO: Demoting self (graceful)
2023-11-22 16:22:43,663 WARNING: Connection dropped: socket connection broken
2023-11-22 16:22:43,663 WARNING: Transition to CONNECTING
2023-11-22 16:22:43,663 INFO: Zookeeper connection lost
2023-11-22 16:22:43,664 INFO: Zookeeper session closed, state: CLOSED
2023-11-22 16:22:43,668 INFO: Connecting to zk-1.tac(10.11.1.11):2181, use_ssl: False
2023-11-22 16:22:43,800 INFO: Zookeeper connection established, state: CONNECTED
2023-11-22 16:22:43,888 ERROR: Unhandled exception in connection loop
Traceback (most recent call last):
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 602, in _connect_attempt
response = self._read_socket(read_timeout)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 453, in _read_socket
return self._read_response(header, buffer, offset)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 383, in _read_response
raise exc
RuntimeError: ('xids do not match, expected %r received %r', 1, 2)
2023-11-22 16:22:43,890 ERROR: touch_member
Traceback (most recent call last):
File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 425, in touch_member
self._client.create_async(self.member_path, encoded_data, makepath=True, ephemeral=True).get(timeout=1)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 86, in get
raise self._exception
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 292, in captured_function
return function(*args, **kwargs)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 313, in captured_function
value = function(*args, **kwargs)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 1022, in create_completion
return self.unchroot(result.get())
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 80, in get
raise self._exception
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 602, in _connect_attempt
response = self._read_socket(read_timeout)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 453, in _read_socket
return self._read_response(header, buffer, offset)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 383, in _read_response
raise exc
RuntimeError: ('xids do not match, expected %r received %r', 1, 2)
2023-11-22 16:22:43,897 INFO: Leader key released
Exception in thread Thread-1194052:
2023-11-22 16:22:43,898 INFO: Zookeeper session closed, state: CLOSED
Traceback (most recent call last):
File "/usr/lib64/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/usr/lib64/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 512, in zk_loop
if retry(self._connect_loop, retry) is STOP_CONNECTING:
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/retry.py", line 126, in __call__
return func(*args, **kwargs)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 552, in _connect_loop
status = self._connect_attempt(host, hostip, port, retry)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 602, in _connect_attempt
response = self._read_socket(read_timeout)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 453, in _read_socket
return self._read_response(header, buffer, offset)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 383, in _read_response
raise exc
RuntimeError: ('xids do not match, expected %r received %r', 1, 2)
2023-11-22 16:22:44,013 ERROR: get_cluster
Traceback (most recent call last):
File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 310, in _load_cluster
cluster = self._client.retry(loader, path)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 347, in _retry
return self._retry.copy()(*args, **kwargs)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/retry.py", line 126, in __call__
return func(*args, **kwargs)
File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 248, in _cluster_loader
nodes = set(self.get_children(path, self.cluster_watcher))
File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 233, in get_children
return self._client.get_children(key, watch)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 1218, in get_children
return self.get_children_async(path, watch=watch,
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 80, in get
raise self._exception
kazoo.exceptions.ConnectionClosedError: Connection has been closed
2023-11-22 16:22:44,014 ERROR: Error communicating with DCS
2023-11-22 16:22:44,015 INFO: DCS is not accessible
2023-11-22 16:22:44,015 ERROR: get_cluster
Traceback (most recent call last):
File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 310, in _load_cluster
cluster = self._client.retry(loader, path)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 347, in _retry
return self._retry.copy()(*args, **kwargs)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/retry.py", line 126, in __call__
return func(*args, **kwargs)
File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 248, in _cluster_loader
nodes = set(self.get_children(path, self.cluster_watcher))
File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 233, in get_children
return self._client.get_children(key, watch)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 1218, in get_children
return self.get_children_async(path, watch=watch,
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 80, in get
raise self._exception
kazoo.exceptions.ConnectionClosedError: Connection has been closed
Again restarting Patroni fixed DCS connection issue:
azoo.exceptions.ConnectionClosedError: Connection has been closed
2023-11-22 16:37:24,838 ERROR: Error communicating with DCS
2023-11-22 16:37:24,842 INFO: DCS is not accessible
2023-11-22 16:37:24,843 ERROR: get_cluster
Traceback (most recent call last):
File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 310, in _load_cluster
cluster = self._client.retry(loader, path)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 347, in _retry
return self._retry.copy()(*args, **kwargs)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/retry.py", line 126, in __call__
return func(*args, **kwargs)
File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 248, in _cluster_loader
nodes = set(self.get_children(path, self.cluster_watcher))
File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 233, in get_children
return self._client.get_children(key, watch)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 1218, in get_children
return self.get_children_async(path, watch=watch,
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 80, in get
raise self._exception
kazoo.exceptions.ConnectionClosedError: Connection has been closed
2023-11-22 16:37:24,843 ERROR: Error communicating with DCS
2023-11-22 16:37:24,923 ERROR: touch_member
Traceback (most recent call last):
File "/opt/patroni-venv/lib64/python3.9/site-packages/patroni/dcs/zookeeper.py", line 425, in touch_member
self._client.create_async(self.member_path, encoded_data, makepath=True, ephemeral=True).get(timeout=1)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 80, in get
raise self._exception
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/handlers/utils.py", line 292, in captured_function
return function(*args, **kwargs)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 1004, in do_create
result = self._create_async_inner(
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/client.py", line 1052, in _create_async_inner
raise async_result.exception
kazoo.exceptions.ConnectionClosedError: Connection has been closed
2023-11-22 16:37:29,428 INFO: Connecting to zk-1.tac(10.11.1.11):2181, use_ssl: False
2023-11-22 16:37:29,560 INFO: Zookeeper connection established, state: CONNECTED
2023-11-22 16:37:29,814 INFO: No PostgreSQL configuration items changed, nothing to reload.
2023-11-22 16:37:29,936 WARNING: Postgresql is not running.
2023-11-22 16:37:29,937 INFO: Lock owner: pg_1; I am pg_2
2023-11-22 16:37:29,941 INFO: pg_controldata:
pg_control version number: 1300
Catalog version number: 202209061
Database system identifier: 7293861610560159297
Database cluster state: shut down in recovery
pg_control last modified: Wed Nov 22 16:37:24 2023
Latest checkpoint location: 0/130005C8
Latest checkpoint's REDO location: 0/130005C8
Latest checkpoint's REDO WAL file: 000000160000000000000013
Latest checkpoint's TimeLineID: 22
Latest checkpoint's PrevTimeLineID: 22
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:51758
Latest checkpoint's NextOID: 23098
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 717
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid: 0
Latest checkpoint's newestCommitTsXid: 0
Time of latest checkpoint: Wed Nov 22 16:22:42 2023
Fake LSN counter for unlogged rels: 0/3E8
Minimum recovery ending location: 0/13000640
Min recovery ending loc's timeline: 22
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: on
max_connections setting: 100
max_worker_processes setting: 8
max_wal_senders setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float8 argument passing: by value
Data page checksum version: 1
Mock authentication nonce: 904aa8b010e238e03d68379679602e07eeb104589b5dfe7667501d1aaa08be82
2023-11-22 16:37:30,216 INFO: Lock owner: pg_1; I am pg_2
2023-11-22 16:37:30,241 INFO: Local timeline=22 lsn=0/13000640
2023-11-22 16:37:30,579 INFO: primary_timeline=23
2023-11-22 16:37:30,605 INFO: primary: history=19 0/11E4E0F8 no recovery target specified
20 0/13000270 no recovery target specified
21 0/13000400 no recovery target specified
22 0/13000640 no recovery target specified
2023-11-22 16:37:30,605 INFO: Lock owner: pg_1; I am pg_2
2023-11-22 16:37:30,607 INFO: starting as a secondary
2023-11-22 16:37:31.550 CET [688084] LOG: redirecting log output to logging collector process
2023-11-22 16:37:31.550 CET [688084] HINT: Future log output will appear in directory "log".
2023-11-22 16:37:31,621 INFO: postmaster pid=688084
10.11.1.92:5432 - accepting connections
10.11.1.92:5432 - accepting connections
2023-11-22 16:37:32,064 INFO: Lock owner: pg_1; I am pg_2
2023-11-22 16:37:32,064 INFO: establishing a new patroni connection to the postgres cluster
2023-11-22 16:37:32,132 INFO: Dropped unknown replication slot 'pg_1'
2023-11-22 16:37:32,144 INFO: Dropped unknown replication slot 'pg_3'
2023-11-22 16:37:32,373 INFO: no action. I am (pg_2), a secondary, and following a leader (pg_1)
2023-11-22 16:37:32,753 INFO: no action. I am (pg_2), a secondary, and following a leader (pg_1)
2023-11-22 16:37:42,877 INFO: no action. I am (pg_2), a secondary, and following a leader (pg_1)
2023-11-22 16:37:52,877 INFO: no action. I am (pg_2), a secondary, and following a leader (pg_1)
Patroni has no control to which zookeeper nodes it is connecting to, it is handled by kazoo.
Also, you are not using a real zookeeper.
Thank you for a quick reply. I have other setup with real zookeeper which never hit such issue in ~2 years but it uses an older Patroni version (2.1.4) and never experienced network partitioning.
Even if this problem is not related to Patroni but to kazoo or clickhouse-keeper the connection state handling could be improved to explicitly handle exception kazoo.exceptions.ConnectionClosedError at least to avoid calling "_load_cluster" method every couple ms (probably as a result of call done in the daemon main loop: main.run_cycle->ha._run_cycle->ha.load_cluster_from_dcs->ha.dcs.get_cluster->zookeeper.__get_patroni_cluster->zookeeper._load_cluster) or better - to re-initialize connection to DCS - but I know too little of Patroni codebase to suggest any feasible solution especially given other functionality that can be easily broken (like DCS Failsafe Mode or "fast" shutdown of postgres if leader lost DCS connection mentioned here #1346 (comment)).
The failover worked, there was no data loss and in 2 out of 3 cases simple restart of Patroni service fixed lost replicas which is good enough. Also couple additional swichovers did not recreate the problem from my last comment.
The issue is minor.
@CyberDem0n could you please take a look at a naive attempt to mitigate the problem:
https://github.com/zalando/patroni/compare/REL_3_1...Girgitt:patroni:%232957-zk-conn-lost-loop?expand=1
and comment if restarting kazoo, when the connection to ZooKeeper is closed anyway and cluster data cannot be fetched, can possibly do more harm than good?
Unit tests for ha and zookeeper pass but I do not see kazoo connection or retrying-related exceptions being tested.
@Girgitt what caught my eye - an exception that happened in a thread that is responsible for communicating with zookeeper:
2023-11-22 16:22:43,888 ERROR: Unhandled exception in connection loop
Traceback (most recent call last):
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 602, in _connect_attempt
response = self._read_socket(read_timeout)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 453, in _read_socket
return self._read_response(header, buffer, offset)
File "/opt/patroni-venv/lib64/python3.9/site-packages/kazoo/protocol/connection.py", line 383, in _read_response
raise exc
RuntimeError: ('xids do not match, expected %r received %r', 1, 2)
And as a result:
Exception in thread Thread-1194052:
From this moment the thread that is started here
patroni/patroni/dcs/zookeeper.py
Line 126 in ac6f6ae
patroni/patroni/dcs/zookeeper.py
Line 407 in ac6f6ae
Now we need to understand what brought kazoo to this state. For that you need to enable DEBUG level
logs and and configure format
so that includes thread name: https://patroni.readthedocs.io/en/latest/yaml_configuration.html#log
@CyberDem0n I will enable DEBUG logging level and try to reproduce the issue.
Regarding RuntimeError I suspect possible packet duplication. Hosts are interconnected via Linux bridges connected via multiple vxlans over wireguard and mstpd handles RSTP for each bridge. During spanning tree topology change there is a chance to receive duplicated packets (or even own packets) and I am not sure if such packets are dropped by kernel.
Based on your last comment regarding the zk client's thread termination - apart from trying to recreate the xids mismatch error, since my setup is not using the real ZooKeeper, zk client is created in a constructor making client's re-creation problematic and zookeper module was heavily refactored since REL_3_1, I will probably extend for my own purpose Patroni's zookeeper module, Ha,_run_cycle or Ha,_handle_dcs_error methods with new type of "DCSFatalError" exception raised on kazoo.exceptions.ConnectionClosedError and try to gracefully terminate Patroni to trigger restart by supervisord watchdog. This is just in case the original problem will be too difficult to reproduce - it happened only once in couple months.
@Girgitt the RuntimeError: ('xids do not match, expected %r received %r', 1, 2)
isn't happening in the Patroni code.
The thing is, once KazooClient.start()
method is starting a thread, that is supposed to iterate over Zookeeper servers, connect to them, authenticate, run client commands and handle all kinds of expected exceptions. That is, if something is wrong with one of the Zookeeper hosts, it is supposed to switch to another one.
But, the RuntimeError
for some reason breaks this thread. It would be good to understand the reason for that. It is very unlikely that your network configuration could be blamed.
Could you repeat the same test with the real Zookeeper? In case if everything is fine, that would mean that ClickHouse Keeper isn't so compatible with Zookeeper.
Thank you, I understand the issue is not in Patroni code and that kazoo thread terminaes. I will try to reproduce the problem. Setting up a second, real ZK cluster and another Patroni cluster for testing is not a problem.