HAProxy node in Patroni does not failing over with Zookeeper HA in Docker Swarm
ykun91 opened this issue · comments
What happened?
I using a HA cluster build with Patroni + Zookeeper in Docker Swarm
Zookeeper
zookeeper-1 (on worker-1)
zookeeper-2 (on worker-2)
zookeeper-3 (on worker-3)
Patoni
postgres1 (HAProxy node, automatically deploy between workers)
patroni1 (Postgres node, on worker-1)
patroni2 (Postgres node, on worker-2)
patroni3 (Postgres node, on worker-3)
and I noticed when the worker-1 failed(and cause the zookeeper-1
, patroni1
, postgres1
failed) the postgres1
was automatically deploy to another worker, but it failed to startup since it failed resolving zookeeper-1
********-postgres_postgres1.1.j42ys4960ssz@worker-4 | goroutine 1 [running]:
********-postgres_postgres1.1.j42ys4960ssz@worker-4 | github.com/kelseyhightower/confd/backends/zookeeper.NewZookeeperClient(0xc420078d00, 0x3, 0x4, 0x0, 0x0, 0xc42003a410)
********-postgres_postgres1.1.j42ys4960ssz@worker-4 | /go/src/github.com/kelseyhightower/confd/backends/zookeeper/client.go:20 +0xe0
********-postgres_postgres1.1.j42ys4960ssz@worker-4 | github.com/kelseyhightower/confd/backends.New(0x0, 0x0, 0x0, 0x0, 0x7ffd7ec25c99, 0x9, 0x0, 0x0, 0x0, 0x0, ...)
********-postgres_postgres1.1.j42ys4960ssz@worker-4 | /go/src/github.com/kelseyhightower/confd/backends/client.go:57 +0x107e
********-postgres_postgres1.1.j42ys4960ssz@worker-4 | main.main()
********-postgres_postgres1.1.j42ys4960ssz@worker-4 | /go/src/github.com/kelseyhightower/confd/confd.go:28 +0xb6
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6 | 2024-01-29T07:29:24Z 7ad0cbdbcaa9 confd[7]: INFO Backend set to zookeeper
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6 | 2024-01-29T07:29:24Z 7ad0cbdbcaa9 confd[7]: INFO Starting confd
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6 | 2024-01-29T07:29:24Z 7ad0cbdbcaa9 confd[7]: INFO Backend source(s) set to zookeeper-1:2181, zookeeper-2:2181, zookeeper-3:2181
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6 | panic: lookup zookeeper-1 on 127.0.0.11:53: no such host
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6 |
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6 | goroutine 1 [running]:
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6 | github.com/kelseyhightower/confd/backends/zookeeper.NewZookeeperClient(0xc420078dc0, 0x3, 0x4, 0x0, 0x0, 0xc42003a410)
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6 | /go/src/github.com/kelseyhightower/confd/backends/zookeeper/client.go:20 +0xe0
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2 | 2024-01-29T07:29:30Z 7eaa0f8427d0 confd[7]: INFO Backend set to zookeeper
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2 | 2024-01-29T07:29:30Z 7eaa0f8427d0 confd[7]: INFO Starting confd
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2 | 2024-01-29T07:29:30Z 7eaa0f8427d0 confd[7]: INFO Backend source(s) set to zookeeper-1:2181, zookeeper-2:2181, zookeeper-3:2181
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2 | panic: lookup zookeeper-1 on 127.0.0.11:53: no such host
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2 |
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2 | goroutine 1 [running]:
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2 | github.com/kelseyhightower/confd/backends/zookeeper.NewZookeeperClient(0xc42026e440, 0x3, 0x4, 0x0, 0x0, 0xc4200da410)
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2 | /go/src/github.com/kelseyhightower/confd/backends/zookeeper/client.go:20 +0xe0
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2 | github.com/kelseyhightower/confd/backends.New(0x0, 0x0, 0x0, 0x0, 0x7ffd5df0ac99, 0x9, 0x0, 0x0, 0x0, 0x0, ...)
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2 | /go/src/github.com/kelseyhightower/confd/backends/client.go:57 +0x107e
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2 | main.main()
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2 | /go/src/github.com/kelseyhightower/confd/confd.go:28 +0xb6
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6 | github.com/kelseyhightower/confd/backends.New(0x0, 0x0, 0x0, 0x0, 0x7ffec957bc99, 0x9, 0x0, 0x0, 0x0, 0x0, ...)
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6 | /go/src/github.com/kelseyhightower/confd/backends/client.go:57 +0x107e
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6 | main.main()
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6 | /go/src/github.com/kelseyhightower/confd/confd.go:28 +0xb6
********-postgres_postgres1.1.x4nswnl6u6qe@worker-5 | 2024-01-29T07:29:36Z 1290adf079f4 confd[7]: INFO Backend set to zookeeper
********-postgres_postgres1.1.x4nswnl6u6qe@worker-5 | 2024-01-29T07:29:36Z 1290adf079f4 confd[7]: INFO Starting confd
********-postgres_postgres1.1.x4nswnl6u6qe@worker-5 | 2024-01-29T07:29:36Z 1290adf079f4 confd[7]: INFO Backend source(s) set to zookeeper-1:2181, zookeeper-2:2181, zookeeper-3:2181
********-postgres_postgres1.1.x4nswnl6u6qe@worker-5 | panic: lookup zookeeper-1 on 127.0.0.11:53: no such host
How can we reproduce it (as minimally and precisely as possible)?
use Docker Swarm to setup a Patroni cluster described above and crash one node of Zookeeper?
What did you expect to happen?
Patroni(HAProxy) should not failed to startup when one of Zookeeper nodes fails, should retry using the remaining nodes if Zookeeper cluster is still running.
Patroni/PostgreSQL/DCS version
- Patroni version: 3.2.0
- PostgreSQL version: 16.1
- DCS (and its version): 3.6.4
Patroni configuration file
# Patroni configuration file
# https://patroni.readthedocs.io/en/latest/yaml_configuration.html#yaml-configuration
restapi:
listen: 0.0.0.0:8008
ctl:
insecure: true
bootstrap:
# this section will be written into /<namespace>/<scope>/config of the given configuration store
# after initializing new cluster and all other cluster members will use it as a `global configuration`
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
pg_hba:
- local all all trust
- host replication replicator all scram-sha-256
- host all all all scram-sha-256
parameters:
max_connections: 1000
# some desired options for 'initdb'
initdb:
- locale: en_US.UTF-8
- encoding: UTF8
- data-checksums
# some additional users which needs to be created after initializing new cluster
users:
admin:
options:
- createrole
- createdb
postgresql:
listen: 0.0.0.0:5432
data_dir: /home/postgres/data
pgpass: /tmp/pgpass0
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
patronictl show-config
postgres@47c360557474:~$ patronictl show-config
loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
parameters:
max_connections: 1000
pg_hba:
- local all all trust
- host replication replicator all scram-sha-256
- host all all all scram-sha-256
use_pg_rewind: true
retry_timeout: 10
ttl: 30
Patroni log files
Don't know where to find it.
PostgreSQL log files
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:17:06,187 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:17:16,187 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:17:26,188 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:17:36,187 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:17:46,187 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:17:56,187 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:06,188 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:16,188 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:26,187 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:36,196 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:46,187 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:53.405 UTC [3262] LOG: replication terminated by primary server
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:53.405 UTC [3262] DETAIL: End of WAL reached on timeline 6 at 0/57B9498.
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:53.405 UTC [3262] FATAL: could not send end-of-streaming message to primary: server closed the connection unexpectedly
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | This probably means the server terminated abnormally
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | before or while processing the request.
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | no COPY in progress
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:53.405 UTC [50] LOG: invalid record length at 0/57B9498: expected at least 24, got 0
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:53.416 UTC [1092732] FATAL: could not connect to the primary server: connection to server at "10.0.4.176", port 5432 failed: server closed the connection unexpectedly
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | This probably means the server terminated abnormally
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | before or while processing the request.
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:53.417 UTC [50] LOG: waiting for WAL to become available at 0/57B94B0
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:54,430 INFO: Got response from patroni3 http://10.0.4.183:8008/patroni: {"state": "running", "postmaster_start_time": "2024-01-10 08:35:23.086102+00:00", "role": "replica", "server_version": 160001, "xlog": {"received_location": 91985048, "replayed_location": 91985048, "replayed_timestamp": "2024-01-29 06:44:54.077171+00:00", "paused": false}, "timeline": 6, "cluster_unlocked": true, "dcs_last_seen": 1706512734, "database_system_identifier": "7300929477418795032", "patroni": {"version": "3.2.0", "scope": "postgres1", "name": "patroni3"}}
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:54,431 WARNING: Request failed to patroni1: GET http://10.0.4.176:8008/patroni (HTTPConnectionPool(host='10.0.4.176', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))))
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:54,467 WARNING: Could not activate Linux watchdog device: Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:54,475 INFO: promoted self to leader by acquiring session lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | server promoting
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:54.477 UTC [50] LOG: received promote request
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:54.477 UTC [50] LOG: redo done at 0/57B9420 system usage: CPU: user: 1.52 s, system: 2.61 s, elapsed: 1642332.62 s
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:54.477 UTC [50] LOG: last completed transaction was at log time 2024-01-29 06:44:54.077171+00
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:54.479 UTC [50] LOG: selected new timeline ID: 7
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:54.530 UTC [50] LOG: archive recovery complete
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:54.531 UTC [48] LOG: checkpoint starting: force
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:54.533 UTC [48] LOG: checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=2, longest=0.001 s, average=0.001 s; distance=0 kB, estimate=36 kB; lsn=0/57B9500, redo lsn=0/57B94C8
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:54.536 UTC [46] LOG: database system is ready to accept connections
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:54.742 UTC [1092748] ERROR: replication slot "patroni3" does not exist
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:54.742 UTC [1092748] STATEMENT: START_REPLICATION SLOT "patroni3" 0/5000000 TIMELINE 6
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:54.769 UTC [1092749] ERROR: replication slot "patroni3" does not exist
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:54.769 UTC [1092749] STATEMENT: START_REPLICATION SLOT "patroni3" 0/5000000 TIMELINE 7
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:18:55,535 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:19:05,524 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:19:15,508 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:19:25,506 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:19:35,504 INFO: Lock owner: patroni2; I am patroni2
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:19:35,505 INFO: Dropped unknown replication slot 'patroni1'
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:19:35,506 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:19:45,505 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:19:55,505 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:20:03,176 WARNING: Connection dropped: socket connection broken
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:20:03,176 WARNING: Transition to CONNECTING
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:20:03,176 INFO: Zookeeper connection lost
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:20:03,176 INFO: Connecting to zookeeper-2(10.0.4.178):2181, use_ssl: False
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:20:03,181 INFO: Zookeeper connection established, state: CONNECTED
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:20:03,247 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:20:13,192 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:20:23,184 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:20:33,185 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:20:43,185 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:20:53,184 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:21:03,185 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:21:13,184 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:21:23,185 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:21:33,185 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2 | 2024-01-29 07:21:43,185 INFO: no action. I am (patroni2), the leader with the lock
Have you tried to use GitHub issue search?
- Yes
Anything else we need to know?
No response
but it failed to startup since it failed resolving zookeeper-1
Why do you think this is a Patroni problem?
@CyberDem0n
Or it is a problem of confd
? Sorry, I am not know much about how Patroni works with confd, may Patroni is controlling the confd to connect to Zookeeper or not.
Patroni doesn't know anything about confd.
The confd works on its own and can talk to etcd/consul/zookeeper and manage config files (for example for HAProxy) based on keys stored there.
@CyberDem0n OK, I got it. Thank you for your reply.
and for confd I may suggest you to build a new version from github, instead if using binaries that weren't updated for a few years.