failsafe_mode don't work when k8s return 409

Question

failsafe_mode don't work when k8s return 409

ChenChangAo opened this issue 5 months ago · comments

ChenChangAo commented 5 months ago

What happened?

failsafe_mode don't work when k8s return 409

How can we reproduce it (as minimally and precisely as possible)?

I‘m not sure, maybe k8s is overload

What did you expect to happen?

failsafe_mode could work when k8s return 409

Patroni/PostgreSQL/DCS version

Patroni version: 3.1.0
PostgreSQL version:
DCS (and its version): k8s

Patroni configuration file

ttl: 30
loop_wait: 10
retry_timeout: 10
failsafe_mode: true

patronictl show-config

no need

Patroni log files

2024-02-18 07:15:35,544 INFO: no action. I am (node0), the leader with the lock
2024-02-18 07:15:41,360 INFO: Lock owner: node0; I am node0
2024-02-18 07:15:46,373 ERROR: Request to server https://10.59.230.148:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.59.230.148', port=443): Read timed out. (read timeout=4.999478869140148)",)
2024-02-18 07:15:49,314 ERROR: Request to server https://10.59.230.148:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.59.230.148', port=443): Read timed out. (read timeout=2.045989267528057)",)
2024-02-18 07:15:51,369 ERROR: Request to server https://10.59.230.148:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.59.230.148', port=443): Read timed out. (read timeout=1.631961651146412)",)
2024-02-18 07:15:51,369 ERROR: Error communicating with DCS
2024-02-18 07:15:51,377 INFO: Got response from node1 http://10.59.7.15:8009/patroni: Accepted
2024-02-18 07:15:51,471 INFO: continue to run as a leader because failsafe mode is enabled and all members are accessible
2024-02-18 07:15:51,473 WARNING: Loop time exceeded, rescheduling immediately.
2024-02-18 07:15:51,474 INFO: Lock owner: node0; I am node0
2024-02-18 07:15:56,485 ERROR: Request to server https://10.59.230.148:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.59.230.148', port=443): Read timed out. (read timeout=4.989643476903439)",)
2024-02-18 07:15:59,227 ERROR: Request to server https://10.59.230.148:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.59.230.148', port=443): Read timed out. (read timeout=2.2473123595118523)",)
2024-02-18 07:15:59,876 WARNING: Concurrent update of node-leader
2024-02-18 07:16:00,998 ERROR: failed to update leader lock
2024-02-18 07:16:00,998 INFO: Demoting self (immediate-nolock)
2024-02-18 07:16:03,956 INFO: demoted self because failed to update leader lock in DCS

PostgreSQL log files

no need

Have you tried to use GitHub issue search?

Yes

Anything else we need to know?

No response

Alexander Kukushkin · Answer 1 · Mon Mar 18 2024 21:50:13 GMT+0800 (China Standard Time)

409 is a concurrent update.
That is, K8s API is up and running and someone else updated the leader object.
failsafe_mode can't help with it.