zalando / patroni

A template for PostgreSQL High Availability with Etcd, Consul, ZooKeeper, or Kubernetes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

failover to async replica in healthy synchronous_mode cluster return 503

waynerv opened this issue · comments

What happened?

Currently, when a non-sync-standby node is used as failover candidate and current cluster has leader, the result always returns a 503 status code with "Failover failed".

How can we reproduce it (as minimally and precisely as possible)?

[postgres@postgres-f123f207-0-0 /]$ patronictl list
+ Cluster: postgres-f123f207 ---------+--------------+---------+----+-----------+
| Member                | Host        | Role         | State   | TL | Lag in MB |
+-----------------------+-------------+--------------+---------+----+-----------+
| postgres-f123f207-0-0 | 245.0.0.100 | Replica      | running |  5 |         0 |
| postgres-f123f207-1-0 | 245.0.1.30  | Leader       | running |  5 |           |
| postgres-f123f207-2-0 | 245.0.0.70  | Sync Standby | running |  5 |         0 |
+-----------------------+-------------+--------------+---------+----+-----------+
[postgres@postgres-f123f207-0-0 /]$ curl -sv http://localhost:8009/failover -XPOST -d '{"candidate":"postgres-f123f207-0-0"}'
*   Trying ::1...
* TCP_NODELAY set
* connect to ::1 port 8009 failed: Connection refused
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8009 (#0)
> POST /failover HTTP/1.1
> Host: localhost:8009
> User-Agent: curl/7.61.1
> Accept: */*
> Content-Length: 37
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 37 out of 37 bytes
* HTTP 1.0, assume close after body
< HTTP/1.0 503 Service Unavailable
< Server: BaseHTTP/0.6 Python/3.6.8
< Date: Thu, 18 Jan 2024 08:11:52 GMT
< Content-Type: text/html
<
* Closing connection 0
Failover failed

What did you expect to happen?

Tracking its process, found that the api handler wrote failover key to DCS successfully, then it was later cleared up by leader due to mismatch with sync-standby, hence returning 503.

Can we directly do this condition check in api request handler? The implementation is to modify the "is_failover_possible" function to return 412 with a more specific message:

diff --git a/patroni/api.py b/patroni/api.py
index 5c9d6603..dbc84505 100644
--- a/patroni/api.py
+++ b/patroni/api.py
@@ -1036,7 +1036,7 @@ class RestApiHandler(BaseHTTPRequestHandler):
         if leader and (not cluster.leader or cluster.leader.name != leader):
             return 'leader name does not match'
         if candidate:
-            if action == 'switchover' and is_synchronous_mode and not cluster.sync.matches(candidate):
+            if (action == 'switchover' or action == 'failover' and cluster.leader) and is_synchronous_mode and not cluster.sync.matches(candidate):
                 return 'candidate name does not match with sync_standby'
             members = [m for m in cluster.members if m.name == candidate]
             if not members:

Patroni/PostgreSQL/DCS version

  • Patroni version: 3.1.2
  • PostgreSQL version: 14.7
  • DCS (and its version):

Patroni configuration file

-

patronictl show-config

-

Patroni log files

leader log:

2024-01-18 16:11:52,453 INFO: Lock owner: postgres-f123f207-1-0; I am postgres-f123f207-1-0
2024-01-18 16:11:52,500 WARNING: Failover candidate=postgres-f123f207-0-0 does not match with sync_standbys=postgres-f123f207-2-0
2024-01-18 16:11:52,500 WARNING: manual failover: members list is empty
2024-01-18 16:11:52,500 WARNING: manual failover: no healthy members found, failover is not possible
2024-01-18 16:11:52,500 INFO: Cleaning up failover key
2024-01-18 16:11:52,605 INFO: no action. I am (postgres-f123f207-1-0), the leader with the lock


### PostgreSQL log files

```shell
-

Have you tried to use GitHub issue search?

  • Yes

Anything else we need to know?

No response

Patroni version: 3.1.2

please upgrade to the latest version

ok,it seems that this behavior has changed in the newer version(#2980):

[postgres@postgres-6c8b5c3b-2-0 /]$ patronictl list
+ Cluster: postgres-6c8b5c3b (7325362323166597275) -+-----------+----+-----------+
| Member                | Host       | Role         | State     | TL | Lag in MB |
+-----------------------+------------+--------------+-----------+----+-----------+
| postgres-6c8b5c3b-0-0 | 245.0.1.35 | Sync Standby | streaming |  2 |         0 |
| postgres-6c8b5c3b-1-0 | 245.0.0.42 | Replica      | streaming |  2 |         6 |
| postgres-6c8b5c3b-2-0 | 245.0.1.21 | Leader       | running   |  2 |           |
+-----------------------+------------+--------------+-----------+----+-----------+
[postgres@postgres-6c8b5c3b-2-0 /]$ patronictl version
patronictl version 3.2.2
[postgres@postgres-6c8b5c3b-2-0 /]$ patronictl list
+ Cluster: postgres-6c8b5c3b (7325362323166597275) -+-----------+----+-----------+
| Member                | Host       | Role         | State     | TL | Lag in MB |
+-----------------------+------------+--------------+-----------+----+-----------+
| postgres-6c8b5c3b-0-0 | 245.0.1.35 | Sync Standby | streaming |  2 |         0 |
| postgres-6c8b5c3b-1-0 | 245.0.0.42 | Replica      | streaming |  2 |         0 |
| postgres-6c8b5c3b-2-0 | 245.0.1.21 | Leader       | running   |  2 |           |
+-----------------------+------------+--------------+-----------+----+-----------+
[postgres@postgres-6c8b5c3b-2-0 /]$ patronictl failover --candidate postgres-6c8b5c3b-1-0 --force
Current cluster topology
+ Cluster: postgres-6c8b5c3b (7325362323166597275) -+-----------+----+-----------+
| Member                | Host       | Role         | State     | TL | Lag in MB |
+-----------------------+------------+--------------+-----------+----+-----------+
| postgres-6c8b5c3b-0-0 | 245.0.1.35 | Sync Standby | streaming |  2 |         0 |
| postgres-6c8b5c3b-1-0 | 245.0.0.42 | Replica      | streaming |  2 |         0 |
| postgres-6c8b5c3b-2-0 | 245.0.1.21 | Leader       | running   |  2 |           |
+-----------------------+------------+--------------+-----------+----+-----------+
2024-01-18 17:13:36.78836 Successfully failed over to "postgres-6c8b5c3b-1-0"
+ Cluster: postgres-6c8b5c3b (7325362323166597275) --------+----+-----------+
| Member                | Host       | Role    | State     | TL | Lag in MB |
+-----------------------+------------+---------+-----------+----+-----------+
| postgres-6c8b5c3b-0-0 | 245.0.1.35 | Replica | streaming |  2 |         0 |
| postgres-6c8b5c3b-1-0 | 245.0.0.42 | Leader  | running   |  2 |           |
| postgres-6c8b5c3b-2-0 | 245.0.1.21 | Replica | stopped   |    |   unknown |
+-----------------------+------------+---------+-----------+----+-----------+