HAProxy node in Patroni does not failing over with Zookeeper HA in Docker Swarm

Question

HAProxy node in Patroni does not failing over with Zookeeper HA in Docker Swarm

ykun91 opened this issue 4 months ago · comments

Kun Yang commented 4 months ago

What happened?

I using a HA cluster build with Patroni + Zookeeper in Docker Swarm

Zookeeper

zookeeper-1 (on worker-1)
zookeeper-2 (on worker-2)
zookeeper-3 (on worker-3)

Patoni

postgres1 (HAProxy node, automatically deploy between workers)
patroni1 (Postgres node, on worker-1)
patroni2 (Postgres node, on worker-2)
patroni3 (Postgres node, on worker-3)

and I noticed when the worker-1 failed(and cause the zookeeper-1, patroni1, postgres1 failed) the postgres1 was automatically deploy to another worker, but it failed to startup since it failed resolving zookeeper-1

********-postgres_postgres1.1.j42ys4960ssz@worker-4    | goroutine 1 [running]:
********-postgres_postgres1.1.j42ys4960ssz@worker-4    | github.com/kelseyhightower/confd/backends/zookeeper.NewZookeeperClient(0xc420078d00, 0x3, 0x4, 0x0, 0x0, 0xc42003a410)
********-postgres_postgres1.1.j42ys4960ssz@worker-4    |        /go/src/github.com/kelseyhightower/confd/backends/zookeeper/client.go:20 +0xe0
********-postgres_postgres1.1.j42ys4960ssz@worker-4    | github.com/kelseyhightower/confd/backends.New(0x0, 0x0, 0x0, 0x0, 0x7ffd7ec25c99, 0x9, 0x0, 0x0, 0x0, 0x0, ...)
********-postgres_postgres1.1.j42ys4960ssz@worker-4    |        /go/src/github.com/kelseyhightower/confd/backends/client.go:57 +0x107e
********-postgres_postgres1.1.j42ys4960ssz@worker-4    | main.main()
********-postgres_postgres1.1.j42ys4960ssz@worker-4    |        /go/src/github.com/kelseyhightower/confd/confd.go:28 +0xb6
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6    | 2024-01-29T07:29:24Z 7ad0cbdbcaa9 confd[7]: INFO Backend set to zookeeper
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6    | 2024-01-29T07:29:24Z 7ad0cbdbcaa9 confd[7]: INFO Starting confd
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6    | 2024-01-29T07:29:24Z 7ad0cbdbcaa9 confd[7]: INFO Backend source(s) set to zookeeper-1:2181, zookeeper-2:2181, zookeeper-3:2181
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6    | panic: lookup zookeeper-1 on 127.0.0.11:53: no such host
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6    |
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6    | goroutine 1 [running]:
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6    | github.com/kelseyhightower/confd/backends/zookeeper.NewZookeeperClient(0xc420078dc0, 0x3, 0x4, 0x0, 0x0, 0xc42003a410)
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6    |        /go/src/github.com/kelseyhightower/confd/backends/zookeeper/client.go:20 +0xe0
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2    | 2024-01-29T07:29:30Z 7eaa0f8427d0 confd[7]: INFO Backend set to zookeeper
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2    | 2024-01-29T07:29:30Z 7eaa0f8427d0 confd[7]: INFO Starting confd
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2    | 2024-01-29T07:29:30Z 7eaa0f8427d0 confd[7]: INFO Backend source(s) set to zookeeper-1:2181, zookeeper-2:2181, zookeeper-3:2181
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2    | panic: lookup zookeeper-1 on 127.0.0.11:53: no such host
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2    |
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2    | goroutine 1 [running]:
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2    | github.com/kelseyhightower/confd/backends/zookeeper.NewZookeeperClient(0xc42026e440, 0x3, 0x4, 0x0, 0x0, 0xc4200da410)
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2    |        /go/src/github.com/kelseyhightower/confd/backends/zookeeper/client.go:20 +0xe0
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2    | github.com/kelseyhightower/confd/backends.New(0x0, 0x0, 0x0, 0x0, 0x7ffd5df0ac99, 0x9, 0x0, 0x0, 0x0, 0x0, ...)
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2    |        /go/src/github.com/kelseyhightower/confd/backends/client.go:57 +0x107e
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2    | main.main()
********-postgres_postgres1.1.z9lrhl4xy3vc@worker-2    |        /go/src/github.com/kelseyhightower/confd/confd.go:28 +0xb6
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6    | github.com/kelseyhightower/confd/backends.New(0x0, 0x0, 0x0, 0x0, 0x7ffec957bc99, 0x9, 0x0, 0x0, 0x0, 0x0, ...)
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6    |        /go/src/github.com/kelseyhightower/confd/backends/client.go:57 +0x107e
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6    | main.main()
********-postgres_postgres1.1.ldrwhtvfyn8t@worker-6    |        /go/src/github.com/kelseyhightower/confd/confd.go:28 +0xb6
********-postgres_postgres1.1.x4nswnl6u6qe@worker-5    | 2024-01-29T07:29:36Z 1290adf079f4 confd[7]: INFO Backend set to zookeeper
********-postgres_postgres1.1.x4nswnl6u6qe@worker-5    | 2024-01-29T07:29:36Z 1290adf079f4 confd[7]: INFO Starting confd
********-postgres_postgres1.1.x4nswnl6u6qe@worker-5    | 2024-01-29T07:29:36Z 1290adf079f4 confd[7]: INFO Backend source(s) set to zookeeper-1:2181, zookeeper-2:2181, zookeeper-3:2181
********-postgres_postgres1.1.x4nswnl6u6qe@worker-5    | panic: lookup zookeeper-1 on 127.0.0.11:53: no such host

How can we reproduce it (as minimally and precisely as possible)?

use Docker Swarm to setup a Patroni cluster described above and crash one node of Zookeeper?

What did you expect to happen?

Patroni(HAProxy) should not failed to startup when one of Zookeeper nodes fails, should retry using the remaining nodes if Zookeeper cluster is still running.

Patroni/PostgreSQL/DCS version

Patroni version: 3.2.0
PostgreSQL version: 16.1
DCS (and its version): 3.6.4

Patroni configuration file

# Patroni configuration file
# https://patroni.readthedocs.io/en/latest/yaml_configuration.html#yaml-configuration

restapi:
  listen: 0.0.0.0:8008

ctl:
  insecure: true

bootstrap:
  # this section will be written into /<namespace>/<scope>/config of the given configuration store
  # after initializing new cluster and all other cluster members will use it as a `global configuration`
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      pg_hba:
        - local all all trust
        - host replication replicator all scram-sha-256
        - host all all all scram-sha-256
      parameters:
        max_connections: 1000

  # some desired options for 'initdb'
  initdb:
    - locale: en_US.UTF-8
    - encoding: UTF8
    - data-checksums

  # some additional users which needs to be created after initializing new cluster
  users:
    admin:
      options:
        - createrole
        - createdb

postgresql:
  listen: 0.0.0.0:5432
  data_dir: /home/postgres/data
  pgpass: /tmp/pgpass0

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false

patronictl show-config

postgres@47c360557474:~$ patronictl show-config
loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    max_connections: 1000
  pg_hba:
  - local all all trust
  - host replication replicator all scram-sha-256
  - host all all all scram-sha-256
  use_pg_rewind: true
retry_timeout: 10
ttl: 30

Patroni log files

Don't know where to find it.

PostgreSQL log files

********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:17:06,187 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:17:16,187 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:17:26,188 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:17:36,187 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:17:46,187 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:17:56,187 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:06,188 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:16,188 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:26,187 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:36,196 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:46,187 INFO: no action. I am (patroni2), a secondary, and following a leader (patroni1)
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:53.405 UTC [3262] LOG:  replication terminated by primary server
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:53.405 UTC [3262] DETAIL:  End of WAL reached on timeline 6 at 0/57B9498.
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:53.405 UTC [3262] FATAL:  could not send end-of-streaming message to primary: server closed the connection unexpectedly
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 		This probably means the server terminated abnormally
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 		before or while processing the request.
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 	no COPY in progress
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:53.405 UTC [50] LOG:  invalid record length at 0/57B9498: expected at least 24, got 0
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:53.416 UTC [1092732] FATAL:  could not connect to the primary server: connection to server at "10.0.4.176", port 5432 failed: server closed the connection unexpectedly
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 		This probably means the server terminated abnormally
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 		before or while processing the request.
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:53.417 UTC [50] LOG:  waiting for WAL to become available at 0/57B94B0
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:54,430 INFO: Got response from patroni3 http://10.0.4.183:8008/patroni: {"state": "running", "postmaster_start_time": "2024-01-10 08:35:23.086102+00:00", "role": "replica", "server_version": 160001, "xlog": {"received_location": 91985048, "replayed_location": 91985048, "replayed_timestamp": "2024-01-29 06:44:54.077171+00:00", "paused": false}, "timeline": 6, "cluster_unlocked": true, "dcs_last_seen": 1706512734, "database_system_identifier": "7300929477418795032", "patroni": {"version": "3.2.0", "scope": "postgres1", "name": "patroni3"}}
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:54,431 WARNING: Request failed to patroni1: GET http://10.0.4.176:8008/patroni (HTTPConnectionPool(host='10.0.4.176', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))))
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:54,467 WARNING: Could not activate Linux watchdog device: Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:54,475 INFO: promoted self to leader by acquiring session lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | server promoting
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:54.477 UTC [50] LOG:  received promote request
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:54.477 UTC [50] LOG:  redo done at 0/57B9420 system usage: CPU: user: 1.52 s, system: 2.61 s, elapsed: 1642332.62 s
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:54.477 UTC [50] LOG:  last completed transaction was at log time 2024-01-29 06:44:54.077171+00
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:54.479 UTC [50] LOG:  selected new timeline ID: 7
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:54.530 UTC [50] LOG:  archive recovery complete
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:54.531 UTC [48] LOG:  checkpoint starting: force
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:54.533 UTC [48] LOG:  checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=2, longest=0.001 s, average=0.001 s; distance=0 kB, estimate=36 kB; lsn=0/57B9500, redo lsn=0/57B94C8
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:54.536 UTC [46] LOG:  database system is ready to accept connections
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:54.742 UTC [1092748] ERROR:  replication slot "patroni3" does not exist
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:54.742 UTC [1092748] STATEMENT:  START_REPLICATION SLOT "patroni3" 0/5000000 TIMELINE 6
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:54.769 UTC [1092749] ERROR:  replication slot "patroni3" does not exist
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:54.769 UTC [1092749] STATEMENT:  START_REPLICATION SLOT "patroni3" 0/5000000 TIMELINE 7
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:18:55,535 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:19:05,524 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:19:15,508 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:19:25,506 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:19:35,504 INFO: Lock owner: patroni2; I am patroni2
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:19:35,505 INFO: Dropped unknown replication slot 'patroni1'
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:19:35,506 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:19:45,505 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:19:55,505 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:20:03,176 WARNING: Connection dropped: socket connection broken
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:20:03,176 WARNING: Transition to CONNECTING
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:20:03,176 INFO: Zookeeper connection lost
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:20:03,176 INFO: Connecting to zookeeper-2(10.0.4.178):2181, use_ssl: False
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:20:03,181 INFO: Zookeeper connection established, state: CONNECTED
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:20:03,247 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:20:13,192 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:20:23,184 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:20:33,185 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:20:43,185 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:20:53,184 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:21:03,185 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:21:13,184 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:21:23,185 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:21:33,185 INFO: no action. I am (patroni2), the leader with the lock
********-postgres_patroni2.1.jbdzhpgtcaeh@worker-2    | 2024-01-29 07:21:43,185 INFO: no action. I am (patroni2), the leader with the lock

Have you tried to use GitHub issue search?

Yes

Anything else we need to know?

No response

Alexander Kukushkin · Answer 1 · Mon Jan 29 2024 16:29:51 GMT+0800 (China Standard Time)

but it failed to startup since it failed resolving zookeeper-1

Why do you think this is a Patroni problem?

Kun Yang · Answer 2 · Mon Jan 29 2024 16:36:59 GMT+0800 (China Standard Time)

@CyberDem0n
Or it is a problem of confd? Sorry, I am not know much about how Patroni works with confd, may Patroni is controlling the confd to connect to Zookeeper or not.

Alexander Kukushkin · Answer 3 · Mon Jan 29 2024 16:42:20 GMT+0800 (China Standard Time)

Patroni doesn't know anything about confd.
The confd works on its own and can talk to etcd/consul/zookeeper and manage config files (for example for HAProxy) based on keys stored there.

Kun Yang · Answer 4 · Mon Jan 29 2024 16:44:22 GMT+0800 (China Standard Time)

@CyberDem0n OK, I got it. Thank you for your reply.

Alexander Kukushkin · Answer 5 · Mon Jan 29 2024 16:46:29 GMT+0800 (China Standard Time)

and for confd I may suggest you to build a new version from github, instead if using binaries that weren't updated for a few years.