patroni error python after Migration patroni 2.1.1 and patroni-etcd 2.1.1 to patroni-3.2.1 and patroni-etcd-3.2.1
laurentnadot45 opened this issue · comments
What happened?
Hello everyone,
I have an etcd cluster with 3 nodes.
I have a patroni cluster with 2 nodes.
When I simulate the loss of quorum on etcd,
I have python errors with the command "patronictl -c /etc/patroni/patroni.yml list"
Are you aware of this bug?
[root] # etcdctl -w table --endpoints=etcd1:2379,etcd2:2379,etcd3:2379 endpoint status
+------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| etcd1:2379 | 6c4bbf766909a6df | 3.5.0 | 37 kB | true | false | 2 | 33 | 33 | |
| etcd2:2379 | 14f66fec485f4fe | 3.5.0 | 37 kB | false | false | 2 | 33 | 33 | |
| etcd3:2379 | e0214760976b7075 | 3.5.0 | 37 kB | false | false | 2 | 33 | 33 | |
+------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root] # patronictl -c /etc/patroni/patroni.yml list
- Cluster: patroni (7329482298158406281) -------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+----------+--------------+---------+-----------+----+-----------+
| patroni1 | 172.16.17.95 | Leader | running | 3 | |
| patroni2 | 172.16.17.96 | Replica | streaming | 3 | 0 |
+----------+--------------+---------+-----------+----+-----------+
1/ Stop two etcd services
-(lun. janv. 29 13:59:04)--(etcd152_socle2022_01.lnadot:/var/lib/etcd)-
[root] # etcdctl -w table --endpoints=etcd1:2379,etcd2:2379,etcd3:2379 endpoint status
{"level":"warn","ts":"2024-01-29T14:01:12.889+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002daa80/#initially=[etcd1:2379;etcd2:2379;etcd3:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = "transport: Error while dialing dial tcp 172.16.17.93:2379: connect: connection refused""}
Failed to get the status of endpoint etcd2:2379 (context deadline exceeded)
{"level":"warn","ts":"2024-01-29T14:01:17.889+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002daa80/#initially=[etcd1:2379;etcd2:2379;etcd3:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = "transport: Error while dialing dial tcp 172.16.17.94:2379: connect: connection refused""}
Failed to get the status of endpoint etcd3:2379 (context deadline exceeded)
+------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
| etcd1:2379 | 6c4bbf766909a6df | 3.5.0 | 41 kB | false | false | 2 | 55 | 55 | etcdserver: no leader |
+------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
2/ Display information cluster Patroni
[root] # patronictl -c /etc/patroni/patroni.yml list
2024-01-29 09:56:23,183 - ERROR - Failed to get list of machines from http://172.16.17.94:2379/v3beta: MaxRetryError("HTTPConnectionPool(host='172.16.17.94', port=2379): Max retries exceeded with url: /version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b8956940>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:23,231 - ERROR - Request to server http://172.16.17.92:2379 failed: MaxRetryError("HTTPConnectionPool(host='172.16.17.92', port=2379): Max retries exceeded with url: /v3/kv/range (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b8956cf8>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:26,569 - ERROR - Request to server http://172.16.17.93:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='172.16.17.93', port=2379): Read timed out. (read timeout=3.332846157330399)",)
2024-01-29 09:56:26,570 - ERROR - Request to server http://172.16.17.94:2379 failed: MaxRetryError("HTTPConnectionPool(host='172.16.17.94', port=2379): Max retries exceeded with url: /v3/kv/range (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b88eee10>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:26,572 - ERROR - Failed to get list of machines from http://172.16.17.94:2379/v3: MaxRetryError("HTTPConnectionPool(host='172.16.17.94', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b88ee0f0>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:27,357 - ERROR - Request to server http://172.16.17.92:2379 failed: MaxRetryError("HTTPConnectionPool(host='172.16.17.92', port=2379): Max retries exceeded with url: /v3/kv/range (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b88ee1d0>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:29,318 - ERROR - Request to server http://172.16.17.93:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='172.16.17.93', port=2379): Read timed out. (read timeout=1.9578881942822288)",)
2024-01-29 09:56:29,320 - ERROR - Request to server http://172.16.17.94:2379 failed: MaxRetryError("HTTPConnectionPool(host='172.16.17.94', port=2379): Max retries exceeded with url: /v3/kv/range (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b8956dd8>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:29,321 - ERROR - Failed to get list of machines from http://172.16.17.94:2379/v3: MaxRetryError("HTTPConnectionPool(host='172.16.17.94', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b8956780>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:29,726 - ERROR - Request to server http://172.16.17.92:2379 failed: MaxRetryError("HTTPConnectionPool(host='172.16.17.92', port=2379): Max retries exceeded with url: /v3/kv/range (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b88ee4a8>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:30,897 - ERROR - Request to server http://172.16.17.93:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='172.16.17.93', port=2379): Read timed out. (read timeout=1.1678187757109602)",)
2024-01-29 09:56:30,899 - ERROR - Request to server http://172.16.17.94:2379 failed: MaxRetryError("HTTPConnectionPool(host='172.16.17.94', port=2379): Max retries exceeded with url: /v3/kv/range (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b88ee3c8>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:30,901 - ERROR - Failed to get list of machines from http://172.16.17.94:2379/v3: MaxRetryError("HTTPConnectionPool(host='172.16.17.94', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b88eee10>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:31,665 - ERROR - Request to server http://172.16.17.92:2379 failed: MaxRetryError("HTTPConnectionPool(host='172.16.17.92', port=2379): Max retries exceeded with url: /v3/kv/range (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b88ee748>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:31,667 - ERROR - Failed to get list of machines from http://172.16.17.94:2379/v3: MaxRetryError("HTTPConnectionPool(host='172.16.17.94', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b88ee7f0>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:31,669 - ERROR - get_cluster
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 808, in _load_cluster
cluster = loader(path)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 790, in _cluster_loader
for node in self._client.get_cluster(path)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 626, in get_cluster
ret = self._etcd3.retry(self.prefix, path, serializable).get('kvs', [])
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 495, in retry
return retry(method, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/utils.py", line 612, in call
return func(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 363, in prefix
return self.range(key, prefix_range_end(key), serializable, retry=retry)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 208, in wrapper
return self.handle_auth_errors(func, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 334, in handle_auth_errors
return func(self, *args, retry=retry, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 360, in range
return self.call_rpc('/kv/range', params, retry)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 634, in call_rpc
ret = super(PatroniEtcd3Client, self).call_rpc(method, fields, retry)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 294, in call_rpc
return self.api_execute(self.version_prefix + method, self._MPOST, fields)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 301, in api_execute
raise ex
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 282, in api_execute
response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 257, in _do_http_request
raise etcd.EtcdConnectionFailed('No more machines in the cluster')
etcd.EtcdConnectionFailed: No more machines in the cluster
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 808, in _load_cluster
cluster = loader(path)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 790, in _cluster_loader
for node in self._client.get_cluster(path)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 626, in get_cluster
ret = self._etcd3.retry(self.prefix, path, serializable).get('kvs', [])
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 495, in retry
return retry(method, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/utils.py", line 612, in call
return func(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 363, in prefix
return self.range(key, prefix_range_end(key), serializable, retry=retry)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 208, in wrapper
return self.handle_auth_errors(func, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 334, in handle_auth_errors
return func(self, *args, retry=retry, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 360, in range
return self.call_rpc('/kv/range', params, retry)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 634, in call_rpc
ret = super(PatroniEtcd3Client, self).call_rpc(method, fields, retry)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 294, in call_rpc
return self.api_execute(self.version_prefix + method, self._MPOST, fields)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 301, in api_execute
raise ex
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 282, in api_execute
response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 257, in _do_http_request
raise etcd.EtcdConnectionFailed('No more machines in the cluster')
etcd.EtcdConnectionFailed: No more machines in the cluster
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/bin/patronictl", line 11, in
load_entry_point('patroni==3.2.1', 'console_scripts', 'patronictl')()
File "/usr/lib/python3.6/site-packages/click/core.py", line 721, in call
return self.main(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/usr/lib/python3.6/site-packages/click/core.py", line 1065, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3.6/site-packages/click/core.py", line 894, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3.6/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/click/decorators.py", line 27, in new_func
return f(get_current_context().obj, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/ctl.py", line 1658, in members
cluster = dcs.get_cluster()
File "/usr/lib/python3.6/site-packages/patroni/dcs/init.py", line 1654, in get_cluster
cluster = self._get_citus_cluster() if self.is_citus_coordinator() else self.__get_patroni_cluster()
File "/usr/lib/python3.6/site-packages/patroni/dcs/init.py", line 1603, in __get_patroni_cluster
cluster = self._load_cluster(path, self._cluster_loader)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 812, in _load_cluster
self._handle_exception(e, 'get_cluster', raise_ex=Etcd3Error('Etcd is not responding properly'))
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 507, in _handle_exception
raise raise_ex
patroni.dcs.etcd3.Etcd3Error: Etcd is not responding properly
How can we reproduce it (as minimally and precisely as possible)?
Rocky Linux release 8.9 (Green Obsidian)
PG15.5
etcd-3.5.0-1
patroni-3.2.1
patroni-etcd-3.2.1
Simulation with two etcd nodes stopped :
What did you expect to happen?
[root] # patronictl -c /etc/patroni/patroni.yml list
- Cluster: patroni (7329482298158406281) -------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+----------+--------------+---------+-----------+----+-----------+
| patroni1 | 172.16.17.95 | Leader | running | 3 | |
| patroni2 | 172.16.17.96 | Replica | streaming | 3 | 0 |
+----------+--------------+---------+-----------+----+-----------+
Patroni/PostgreSQL/DCS version
- Patroni version:
- PostgreSQL version:
- DCS (and its version):
Patroni configuration file
[root] # cat /etc/patroni/patroni.yml
scope: patroni
name: patroni1
restapi:
listen: 172.16.17.95:8008
connect_address: 172.16.17.95:8008
log:
# level: INFO
level: DEBUG
dir: /var/log/postgresql/patroni
etcd3:
hosts:
- 172.16.17.92:2379
- 172.16.17.93:2379
- 172.16.17.94:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
archive_mode: true
archive_command: 'test ! -f "/var/lib/pgsql/PATRONI/archives/%f" && cp "%p" "/var/lib/pgsql/PATRONI/archives/%f"'
wal_level: replica
wal_log_hints: 'on'
max_wal_senders: 10
hot_standby: on
password_encryption: 'scram-sha-256'
log_directory: '/var/log/postgresql/PATRONI'
log_filename: 'PATRONI.log'
log_file_mode: 0640
log_truncate_on_rotation: off
log_rotation_age: 0
syslog_facility: 'local3'
log_checkpoints: on
log_line_prefix: 'user=%u,db=%d,time=%t,pid=%p'
log_lock_waits: on
log_statement: 'ddl'
log_temp_files: 0
idle_in_transaction_session_timeout: 604800000
lc_messages: 'C'
lc_monetary: 'C'
lc_numeric: 'C'
lc_time: 'C'
default_text_search_config: 'pg_catalog.french'
shared_preload_libraries: 'pg_stat_statements,auto_explain'
max_connections: 100
shared_buffers: 1GB
effective_cache_size: 3GB
maintenance_work_mem: 256MB
checkpoint_completion_target: 0.9
wal_buffers: 16MB
default_statistics_target: 100
random_page_cost: 1.1
effective_io_concurrency: 200
work_mem: 26MB
min_wal_size: 2GB
max_wal_size: 8GB
max_worker_processes: 4
max_parallel_workers_per_gather: 2
max_parallel_workers: 4
initdb:
- encoding: UTF8
- data-checksums
pg_hba:
- local all postgres peer
- local all all scram-sha-256
- host all all 127.0.0.1/32 scram-sha-256
- host replication replicator all scram-sha-256
- host all rewind_user 172.16.14.51/24 scram-sha-256
- host all rewind_user 172.16.14.52/24 scram-sha-256
- host all sinaps 10.154.58.240/32 scram-sha-256
- host all sinaps 172.19.226.0/27 scram-sha-256
- host all sinaps 172.19.227.0/27 scram-sha-256
- host all sinaps 172.19.228.0/27 scram-sha-256
- host all sinaps 172.19.224.0/24 scram-sha-256
- host all sinaps 192.168.144.0/24 scram-sha-256
- host testina testina 172.16.7.72/32 scram-sha-256
- host hr hr 172.16.17.89/32 scram-sha-256
- host pgbench pgbench 172.16.17.89/32 scram-sha-256
users:
postgres:
password: admin
options:
- createrole
- createdb
replicator:
password: admin
options:
- replication
postgresql:
listen: 0.0.0.0:5432
connect_address: 172.16.17.95:5432
data_dir: /var/lib/pgsql/PATRONI/data
bin_dir: /usr/pgsql-15/bin
authentication:
replication:
username: replicator
password: admin
superuser:
username: postgres
password: admin
rewind:
username: rewind_user
password: rewind_password
parameters:
unix_socket_directories: '/var/run/postgresql'
#watchdog:
# mode: required # Allowed values: off, automatic, required
# device: /dev/watchdog
# safety_margin: 5
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
patronictl show-config
[root] # patronictl -c /etc/patroni/patroni.yml show-config
loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
parameters:
archive_command: test ! -f "/var/lib/pgsql/PATRONI/archives/%f" && cp "%p" "/var/lib/pgsql/PATRONI/archives/%f"
archive_mode: true
checkpoint_completion_target: 0.9
default_statistics_target: 100
default_text_search_config: pg_catalog.french
effective_cache_size: 3GB
effective_io_concurrency: 200
hot_standby: true
idle_in_transaction_session_timeout: 604800000
lc_messages: C
lc_monetary: C
lc_numeric: C
lc_time: C
log_checkpoints: true
log_directory: /var/log/postgresql/PATRONI
log_file_mode: 416
log_filename: PATRONI.log
log_line_prefix: user=%u,db=%d,time=%t,pid=%p
log_lock_waits: true
log_rotation_age: 0
log_statement: ddl
log_temp_files: 0
log_truncate_on_rotation: false
maintenance_work_mem: 256MB
max_connections: 100
max_parallel_workers: 4
max_parallel_workers_per_gather: 2
max_wal_senders: 10
max_wal_size: 8GB
max_worker_processes: 4
min_wal_size: 2GB
password_encryption: scram-sha-256
random_page_cost: 1.1
shared_buffers: 1GB
shared_preload_libraries: pg_stat_statements,auto_explain
syslog_facility: local3
wal_buffers: 16MB
wal_level: replica
wal_log_hints: 'on'
work_mem: 26MB
use_pg_rewind: true
use_slots: true
retry_timeout: 10
ttl: 30
Patroni log files
user=,db=,time=2024-01-29 14:08:43 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:08:48 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:08:53 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:08:58 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:03 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:08 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:13 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:18 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:23 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:28 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:33 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:38 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:43 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:48 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:53 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:58 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:10:03 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:10:05 CET,pid=6197LOG: received promote request
user=,db=,time=2024-01-29 14:10:05 CET,pid=6197LOG: redo is not required
user=,db=,time=2024-01-29 14:10:05 CET,pid=6197LOG: selected new timeline ID: 3
user=,db=,time=2024-01-29 14:10:05 CET,pid=6197LOG: archive recovery complete
user=,db=,time=2024-01-29 14:10:05 CET,pid=6195LOG: checkpoint starting: force
user=,db=,time=2024-01-29 14:10:05 CET,pid=6192LOG: database system is ready to accept connections
user=,db=,time=2024-01-29 14:10:05 CET,pid=6195LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.010 s, sync=0.010 s, total=0.044 s; sync files=2, longest=0.009 s, average=0.005 s; distance=0 kB, estimate=0 kB
PostgreSQL log files
user=,db=,time=2024-01-29 14:08:53 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:08:58 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:03 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:08 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:13 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:18 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:23 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:28 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:33 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:38 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:43 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:48 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:53 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:58 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:10:03 CET,pid=6197LOG: waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:10:05 CET,pid=6197LOG: received promote request
user=,db=,time=2024-01-29 14:10:05 CET,pid=6197LOG: redo is not required
user=,db=,time=2024-01-29 14:10:05 CET,pid=6197LOG: selected new timeline ID: 3
user=,db=,time=2024-01-29 14:10:05 CET,pid=6197LOG: archive recovery complete
user=,db=,time=2024-01-29 14:10:05 CET,pid=6195LOG: checkpoint starting: force
user=,db=,time=2024-01-29 14:10:05 CET,pid=6192LOG: database system is ready to accept connections
user=,db=,time=2024-01-29 14:10:05 CET,pid=6195LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.010 s, sync=0.010 s, total=0.044 s; sync files=2, longest=0.009 s, average=0.005 s; distance=0 kB, estimate=0 kB
Have you tried to use GitHub issue search?
- Yes
Anything else we need to know?
Avec la version patroni 2.1.1 and patroni-etcd 2.1.1 and etcd-3.5.0-1 we have no problem
When I simulate the loss of quorum on etcd,
I have python errors with the command "patronictl -c /etc/patroni/patroni.yml list"
Are you aware of this bug?
Why do you think this is a bug?
with the version of patroni 2.1.1 and patroni-etcd 2.1.1 => I have no Python error
well, previously patronictl was showing a stale information and giving an impression that everything is fine.
We need to get Patroni into production. Do you confirm that this is a normal situation and everything is correct?
Best regards
Yes, it is expected behavior. You can get more details why it was changed from #1199
well, previously patronictl was showing a stale information and giving an impression that everything is fine.
I agree that with the version of Patroni 2.1.1, the stale information showed is not exact.
But when upgrading to 3.1.0 version, with the Python error, I thought this was an regression.
And with several Patroni/PostgreSQL instances, I need to check manually every PostgreSQL instance's status (which is Primary or Standby), specially when failsafe_mode: true
option is used, there are apparently differences in PostgreSQL instances between in losing all Etcd nodes and in losing quite a few Etcd nodes with at least one Etcd node alive when in lost of quorum situation.
Otherwise I have to use curl -s http://localhost:8008/patroni
command to query PostgreSQL instances' status one by one. I have also use curl -s http://localhost:8008/cluster
command, which returns no result in some situation.
Hello CyberDem0n,
Did you see the PYTHON errors on my first comment ? : #3014 (comment) with the command "patronictl -c /etc/patroni/patroni.yml list"
RESULT : .... Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 808, in _load_cluster
cluster = loader(path)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 790, in _cluster_loader
for node in self._client.get_cluster(path)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 626, in get_cluster
ret = self._etcd3.retry(self.prefix, path, serializable).get('kvs', [])
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 495, in retry
return retry(method, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/utils.py", line 612, in call ....
Best regards
Why isn't there a clean exit ?
- etcd cluster lost its quorum
patronictl list
does a quorum read, which fails- if command has failed why do you expect a clean exit?
Thank you for your answers