patroni error python after Migration patroni 2.1.1 and patroni-etcd 2.1.1 to patroni-3.2.1 and patroni-etcd-3.2.1

Question

patroni error python after Migration patroni 2.1.1 and patroni-etcd 2.1.1 to patroni-3.2.1 and patroni-etcd-3.2.1

laurentnadot45 opened this issue 4 months ago · comments

What happened?

Hello everyone,
I have an etcd cluster with 3 nodes.
I have a patroni cluster with 2 nodes.
When I simulate the loss of quorum on etcd,
I have python errors with the command "patronictl -c /etc/patroni/patroni.yml list"
Are you aware of this bug?

[root] # etcdctl -w table --endpoints=etcd1:2379,etcd2:2379,etcd3:2379 endpoint status
+------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| etcd1:2379 | 6c4bbf766909a6df | 3.5.0 | 37 kB | true | false | 2 | 33 | 33 | |
| etcd2:2379 | 14f66fec485f4fe | 3.5.0 | 37 kB | false | false | 2 | 33 | 33 | |
| etcd3:2379 | e0214760976b7075 | 3.5.0 | 37 kB | false | false | 2 | 33 | 33 | |
+------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

[root] # patronictl -c /etc/patroni/patroni.yml list

Cluster: patroni (7329482298158406281) -------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+----------+--------------+---------+-----------+----+-----------+
| patroni1 | 172.16.17.95 | Leader | running | 3 | |
| patroni2 | 172.16.17.96 | Replica | streaming | 3 | 0 |
+----------+--------------+---------+-----------+----+-----------+

1/ Stop two etcd services

-(lun. janv. 29 13:59:04)--(etcd152_socle2022_01.lnadot:/var/lib/etcd)-
[root] # etcdctl -w table --endpoints=etcd1:2379,etcd2:2379,etcd3:2379 endpoint status
{"level":"warn","ts":"2024-01-29T14:01:12.889+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002daa80/#initially=[etcd1:2379;etcd2:2379;etcd3:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = "transport: Error while dialing dial tcp 172.16.17.93:2379: connect: connection refused""}
Failed to get the status of endpoint etcd2:2379 (context deadline exceeded)
{"level":"warn","ts":"2024-01-29T14:01:17.889+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002daa80/#initially=[etcd1:2379;etcd2:2379;etcd3:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = "transport: Error while dialing dial tcp 172.16.17.94:2379: connect: connection refused""}
Failed to get the status of endpoint etcd3:2379 (context deadline exceeded)
+------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+
| etcd1:2379 | 6c4bbf766909a6df | 3.5.0 | 41 kB | false | false | 2 | 55 | 55 | etcdserver: no leader |
+------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+-----------------------+

2/ Display information cluster Patroni

[root] # patronictl -c /etc/patroni/patroni.yml list
2024-01-29 09:56:23,183 - ERROR - Failed to get list of machines from http://172.16.17.94:2379/v3beta: MaxRetryError("HTTPConnectionPool(host='172.16.17.94', port=2379): Max retries exceeded with url: /version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b8956940>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:23,231 - ERROR - Request to server http://172.16.17.92:2379 failed: MaxRetryError("HTTPConnectionPool(host='172.16.17.92', port=2379): Max retries exceeded with url: /v3/kv/range (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b8956cf8>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:26,569 - ERROR - Request to server http://172.16.17.93:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='172.16.17.93', port=2379): Read timed out. (read timeout=3.332846157330399)",)
2024-01-29 09:56:26,570 - ERROR - Request to server http://172.16.17.94:2379 failed: MaxRetryError("HTTPConnectionPool(host='172.16.17.94', port=2379): Max retries exceeded with url: /v3/kv/range (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b88eee10>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:26,572 - ERROR - Failed to get list of machines from http://172.16.17.94:2379/v3: MaxRetryError("HTTPConnectionPool(host='172.16.17.94', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b88ee0f0>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:27,357 - ERROR - Request to server http://172.16.17.92:2379 failed: MaxRetryError("HTTPConnectionPool(host='172.16.17.92', port=2379): Max retries exceeded with url: /v3/kv/range (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b88ee1d0>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:29,318 - ERROR - Request to server http://172.16.17.93:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='172.16.17.93', port=2379): Read timed out. (read timeout=1.9578881942822288)",)
2024-01-29 09:56:29,320 - ERROR - Request to server http://172.16.17.94:2379 failed: MaxRetryError("HTTPConnectionPool(host='172.16.17.94', port=2379): Max retries exceeded with url: /v3/kv/range (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b8956dd8>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:29,321 - ERROR - Failed to get list of machines from http://172.16.17.94:2379/v3: MaxRetryError("HTTPConnectionPool(host='172.16.17.94', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b8956780>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:29,726 - ERROR - Request to server http://172.16.17.92:2379 failed: MaxRetryError("HTTPConnectionPool(host='172.16.17.92', port=2379): Max retries exceeded with url: /v3/kv/range (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b88ee4a8>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:30,897 - ERROR - Request to server http://172.16.17.93:2379 failed: ReadTimeoutError("HTTPConnectionPool(host='172.16.17.93', port=2379): Read timed out. (read timeout=1.1678187757109602)",)
2024-01-29 09:56:30,899 - ERROR - Request to server http://172.16.17.94:2379 failed: MaxRetryError("HTTPConnectionPool(host='172.16.17.94', port=2379): Max retries exceeded with url: /v3/kv/range (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b88ee3c8>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:30,901 - ERROR - Failed to get list of machines from http://172.16.17.94:2379/v3: MaxRetryError("HTTPConnectionPool(host='172.16.17.94', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b88eee10>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:31,665 - ERROR - Request to server http://172.16.17.92:2379 failed: MaxRetryError("HTTPConnectionPool(host='172.16.17.92', port=2379): Max retries exceeded with url: /v3/kv/range (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b88ee748>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:31,667 - ERROR - Failed to get list of machines from http://172.16.17.94:2379/v3: MaxRetryError("HTTPConnectionPool(host='172.16.17.94', port=2379): Max retries exceeded with url: /v3/cluster/member/list (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f25b88ee7f0>: Failed to establish a new connection: [Errno 111] Connection refused',))",)
2024-01-29 09:56:31,669 - ERROR - get_cluster
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 808, in _load_cluster
cluster = loader(path)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 790, in _cluster_loader
for node in self._client.get_cluster(path)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 626, in get_cluster
ret = self._etcd3.retry(self.prefix, path, serializable).get('kvs', [])
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 495, in retry
return retry(method, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/utils.py", line 612, in call
return func(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 363, in prefix
return self.range(key, prefix_range_end(key), serializable, retry=retry)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 208, in wrapper
return self.handle_auth_errors(func, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 334, in handle_auth_errors
return func(self, *args, retry=retry, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 360, in range
return self.call_rpc('/kv/range', params, retry)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 634, in call_rpc
ret = super(PatroniEtcd3Client, self).call_rpc(method, fields, retry)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 294, in call_rpc
return self.api_execute(self.version_prefix + method, self._MPOST, fields)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 301, in api_execute
raise ex
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 282, in api_execute
response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 257, in _do_http_request
raise etcd.EtcdConnectionFailed('No more machines in the cluster')
etcd.EtcdConnectionFailed: No more machines in the cluster
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 808, in _load_cluster
cluster = loader(path)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 790, in _cluster_loader
for node in self._client.get_cluster(path)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 626, in get_cluster
ret = self._etcd3.retry(self.prefix, path, serializable).get('kvs', [])
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 495, in retry
return retry(method, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/utils.py", line 612, in call
return func(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 363, in prefix
return self.range(key, prefix_range_end(key), serializable, retry=retry)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 208, in wrapper
return self.handle_auth_errors(func, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 334, in handle_auth_errors
return func(self, *args, retry=retry, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 360, in range
return self.call_rpc('/kv/range', params, retry)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 634, in call_rpc
ret = super(PatroniEtcd3Client, self).call_rpc(method, fields, retry)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 294, in call_rpc
return self.api_execute(self.version_prefix + method, self._MPOST, fields)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 301, in api_execute
raise ex
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 282, in api_execute
response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 257, in _do_http_request
raise etcd.EtcdConnectionFailed('No more machines in the cluster')
etcd.EtcdConnectionFailed: No more machines in the cluster

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/bin/patronictl", line 11, in
load_entry_point('patroni==3.2.1', 'console_scripts', 'patronictl')()
File "/usr/lib/python3.6/site-packages/click/core.py", line 721, in call
return self.main(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/click/core.py", line 696, in main
rv = self.invoke(ctx)
File "/usr/lib/python3.6/site-packages/click/core.py", line 1065, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3.6/site-packages/click/core.py", line 894, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3.6/site-packages/click/core.py", line 534, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/click/decorators.py", line 27, in new_func
return f(get_current_context().obj, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/ctl.py", line 1658, in members
cluster = dcs.get_cluster()
File "/usr/lib/python3.6/site-packages/patroni/dcs/init.py", line 1654, in get_cluster
cluster = self._get_citus_cluster() if self.is_citus_coordinator() else self.__get_patroni_cluster()
File "/usr/lib/python3.6/site-packages/patroni/dcs/init.py", line 1603, in __get_patroni_cluster
cluster = self._load_cluster(path, self._cluster_loader)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 812, in _load_cluster
self._handle_exception(e, 'get_cluster', raise_ex=Etcd3Error('Etcd is not responding properly'))
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 507, in _handle_exception
raise raise_ex
patroni.dcs.etcd3.Etcd3Error: Etcd is not responding properly

How can we reproduce it (as minimally and precisely as possible)?

Rocky Linux release 8.9 (Green Obsidian)
PG15.5
etcd-3.5.0-1
patroni-3.2.1
patroni-etcd-3.2.1

Simulation with two etcd nodes stopped :

What did you expect to happen?

[root] # patronictl -c /etc/patroni/patroni.yml list

Cluster: patroni (7329482298158406281) -------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+----------+--------------+---------+-----------+----+-----------+
| patroni1 | 172.16.17.95 | Leader | running | 3 | |
| patroni2 | 172.16.17.96 | Replica | streaming | 3 | 0 |
+----------+--------------+---------+-----------+----+-----------+

Patroni/PostgreSQL/DCS version

Patroni version:
PostgreSQL version:
DCS (and its version):

Patroni configuration file

[root] # cat /etc/patroni/patroni.yml
scope: patroni
name: patroni1

restapi:
  listen: 172.16.17.95:8008
  connect_address: 172.16.17.95:8008

log:
#  level: INFO
  level: DEBUG
  dir: /var/log/postgresql/patroni


etcd3:
  hosts:
  - 172.16.17.92:2379
  - 172.16.17.93:2379
  - 172.16.17.94:2379  

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
         archive_mode: true
         archive_command: 'test ! -f "/var/lib/pgsql/PATRONI/archives/%f" && cp "%p" "/var/lib/pgsql/PATRONI/archives/%f"'
         wal_level: replica
         wal_log_hints: 'on'
         max_wal_senders: 10
         hot_standby: on
         password_encryption: 'scram-sha-256'
         log_directory: '/var/log/postgresql/PATRONI'
         log_filename: 'PATRONI.log'
         log_file_mode: 0640
         log_truncate_on_rotation: off
         log_rotation_age: 0
         syslog_facility: 'local3'
         log_checkpoints: on
         log_line_prefix: 'user=%u,db=%d,time=%t,pid=%p'
         log_lock_waits: on
         log_statement: 'ddl'
         log_temp_files: 0
         idle_in_transaction_session_timeout: 604800000
         lc_messages: 'C'
         lc_monetary: 'C'
         lc_numeric: 'C'
         lc_time: 'C'
         default_text_search_config: 'pg_catalog.french'
         shared_preload_libraries: 'pg_stat_statements,auto_explain'
         max_connections: 100 
         shared_buffers: 1GB
         effective_cache_size: 3GB
         maintenance_work_mem: 256MB
         checkpoint_completion_target: 0.9
         wal_buffers: 16MB
         default_statistics_target: 100
         random_page_cost: 1.1
         effective_io_concurrency: 200
         work_mem: 26MB
         min_wal_size: 2GB
         max_wal_size: 8GB
         max_worker_processes: 4
         max_parallel_workers_per_gather: 2
         max_parallel_workers: 4



  initdb:
  - encoding: UTF8
  - data-checksums

  pg_hba:
  - local all postgres peer
  - local all all scram-sha-256
  - host all all 127.0.0.1/32 scram-sha-256
  - host replication replicator all scram-sha-256
  - host all rewind_user 172.16.14.51/24 scram-sha-256
  - host all rewind_user 172.16.14.52/24 scram-sha-256
  - host all sinaps 10.154.58.240/32 scram-sha-256
  - host all sinaps 172.19.226.0/27 scram-sha-256
  - host all sinaps 172.19.227.0/27 scram-sha-256
  - host all sinaps 172.19.228.0/27 scram-sha-256
  - host all sinaps 172.19.224.0/24 scram-sha-256
  - host all sinaps 192.168.144.0/24 scram-sha-256
  - host testina testina 172.16.7.72/32 scram-sha-256   
  - host hr hr  172.16.17.89/32  scram-sha-256
  - host pgbench pgbench 172.16.17.89/32  scram-sha-256

  users:
    postgres:
      password: admin
      options:
        - createrole
        - createdb
    replicator:
      password: admin
      options:
        - replication

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 172.16.17.95:5432
  data_dir: /var/lib/pgsql/PATRONI/data
  bin_dir: /usr/pgsql-15/bin
  authentication:
    replication:
      username: replicator
      password: admin
    superuser:
      username: postgres
      password: admin
    rewind:  
      username: rewind_user
      password: rewind_password
  parameters:
    unix_socket_directories: '/var/run/postgresql'

#watchdog:
#    mode: required # Allowed values: off, automatic, required
#    device: /dev/watchdog
#    safety_margin: 5

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

patronictl show-config

[root] # patronictl -c /etc/patroni/patroni.yml show-config
loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    archive_command: test ! -f "/var/lib/pgsql/PATRONI/archives/%f" && cp "%p" "/var/lib/pgsql/PATRONI/archives/%f"
    archive_mode: true
    checkpoint_completion_target: 0.9
    default_statistics_target: 100
    default_text_search_config: pg_catalog.french
    effective_cache_size: 3GB
    effective_io_concurrency: 200
    hot_standby: true
    idle_in_transaction_session_timeout: 604800000
    lc_messages: C
    lc_monetary: C
    lc_numeric: C
    lc_time: C
    log_checkpoints: true
    log_directory: /var/log/postgresql/PATRONI
    log_file_mode: 416
    log_filename: PATRONI.log
    log_line_prefix: user=%u,db=%d,time=%t,pid=%p
    log_lock_waits: true
    log_rotation_age: 0
    log_statement: ddl
    log_temp_files: 0
    log_truncate_on_rotation: false
    maintenance_work_mem: 256MB
    max_connections: 100
    max_parallel_workers: 4
    max_parallel_workers_per_gather: 2
    max_wal_senders: 10
    max_wal_size: 8GB
    max_worker_processes: 4
    min_wal_size: 2GB
    password_encryption: scram-sha-256
    random_page_cost: 1.1
    shared_buffers: 1GB
    shared_preload_libraries: pg_stat_statements,auto_explain
    syslog_facility: local3
    wal_buffers: 16MB
    wal_level: replica
    wal_log_hints: 'on'
    work_mem: 26MB
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
ttl: 30

Patroni log files

user=,db=,time=2024-01-29 14:08:43 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:08:48 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:08:53 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:08:58 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:03 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:08 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:13 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:18 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:23 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:28 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:33 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:38 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:43 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:48 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:53 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:58 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:10:03 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:10:05 CET,pid=6197LOG:  received promote request
user=,db=,time=2024-01-29 14:10:05 CET,pid=6197LOG:  redo is not required
user=,db=,time=2024-01-29 14:10:05 CET,pid=6197LOG:  selected new timeline ID: 3
user=,db=,time=2024-01-29 14:10:05 CET,pid=6197LOG:  archive recovery complete
user=,db=,time=2024-01-29 14:10:05 CET,pid=6195LOG:  checkpoint starting: force
user=,db=,time=2024-01-29 14:10:05 CET,pid=6192LOG:  database system is ready to accept connections
user=,db=,time=2024-01-29 14:10:05 CET,pid=6195LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.010 s, sync=0.010 s, total=0.044 s; sync files=2, longest=0.009 s, average=0.005 s; distance=0 kB, estimate=0 kB

PostgreSQL log files

user=,db=,time=2024-01-29 14:08:53 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:08:58 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:03 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:08 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:13 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:18 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:23 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:28 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:33 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:38 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:43 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:48 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:53 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:09:58 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:10:03 CET,pid=6197LOG:  waiting for WAL to become available at 0/50000B8
user=,db=,time=2024-01-29 14:10:05 CET,pid=6197LOG:  received promote request
user=,db=,time=2024-01-29 14:10:05 CET,pid=6197LOG:  redo is not required
user=,db=,time=2024-01-29 14:10:05 CET,pid=6197LOG:  selected new timeline ID: 3
user=,db=,time=2024-01-29 14:10:05 CET,pid=6197LOG:  archive recovery complete
user=,db=,time=2024-01-29 14:10:05 CET,pid=6195LOG:  checkpoint starting: force
user=,db=,time=2024-01-29 14:10:05 CET,pid=6192LOG:  database system is ready to accept connections
user=,db=,time=2024-01-29 14:10:05 CET,pid=6195LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.010 s, sync=0.010 s, total=0.044 s; sync files=2, longest=0.009 s, average=0.005 s; distance=0 kB, estimate=0 kB

Have you tried to use GitHub issue search?

Yes

Anything else we need to know?

Avec la version patroni 2.1.1 and patroni-etcd 2.1.1 and etcd-3.5.0-1 we have no problem

Alexander Kukushkin · Answer 1 · Mon Jan 29 2024 21:33:41 GMT+0800 (China Standard Time)

When I simulate the loss of quorum on etcd,
I have python errors with the command "patronictl -c /etc/patroni/patroni.yml list"
Are you aware of this bug?

Why do you think this is a bug?

Nadot Laurent · Answer 2 · Mon Jan 29 2024 21:49:48 GMT+0800 (China Standard Time)

with the version of patroni 2.1.1 and patroni-etcd 2.1.1 => I have no Python error

Alexander Kukushkin · Answer 3 · Mon Jan 29 2024 21:55:01 GMT+0800 (China Standard Time)

well, previously patronictl was showing a stale information and giving an impression that everything is fine.

Nadot Laurent · Answer 4 · Mon Jan 29 2024 22:04:37 GMT+0800 (China Standard Time)

We need to get Patroni into production. Do you confirm that this is a normal situation and everything is correct?
Best regards

Alexander Kukushkin · Answer 5 · Mon Jan 29 2024 22:16:37 GMT+0800 (China Standard Time)

Yes, it is expected behavior. You can get more details why it was changed from #1199

Henri Ky · Answer 6 · Mon Jan 29 2024 22:38:39 GMT+0800 (China Standard Time)

well, previously patronictl was showing a stale information and giving an impression that everything is fine.

I agree that with the version of Patroni 2.1.1, the stale information showed is not exact.
But when upgrading to 3.1.0 version, with the Python error, I thought this was an regression.

And with several Patroni/PostgreSQL instances, I need to check manually every PostgreSQL instance's status (which is Primary or Standby), specially when failsafe_mode: true option is used, there are apparently differences in PostgreSQL instances between in losing all Etcd nodes and in losing quite a few Etcd nodes with at least one Etcd node alive when in lost of quorum situation.

Otherwise I have to use curl -s http://localhost:8008/patroni command to query PostgreSQL instances' status one by one. I have also use curl -s http://localhost:8008/cluster command, which returns no result in some situation.

Nadot Laurent · Answer 7 · Tue Jan 30 2024 17:49:04 GMT+0800 (China Standard Time)

Hello CyberDem0n,
Did you see the PYTHON errors on my first comment ? : #3014 (comment) with the command "patronictl -c /etc/patroni/patroni.yml list"

RESULT : .... Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 808, in _load_cluster
cluster = loader(path)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 790, in _cluster_loader
for node in self._client.get_cluster(path)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd3.py", line 626, in get_cluster
ret = self._etcd3.retry(self.prefix, path, serializable).get('kvs', [])
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 495, in retry
return retry(method, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/utils.py", line 612, in call ....

Best regards

Nadot Laurent · Answer 8 · Tue Jan 30 2024 17:52:15 GMT+0800 (China Standard Time)

Why isn't there a clean exit ?

Alexander Kukushkin · Answer 9 · Tue Jan 30 2024 18:00:19 GMT+0800 (China Standard Time)

etcd cluster lost its quorum
patronictl list does a quorum read, which fails
if command has failed why do you expect a clean exit?

Nadot Laurent · Answer 10 · Tue Jan 30 2024 18:34:08 GMT+0800 (China Standard Time)

Thank you for your answers