zalando / patroni

A template for PostgreSQL High Availability with Etcd, Consul, ZooKeeper, or Kubernetes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Randomly exception on same patroni configuration

nikodemusP opened this issue · comments

What happened?

I have 3 etcd and 3 patroni container based on this Dockerfiles/Compose-Files but for Alma-Linux 9.3/podman and the RPM-Packages are used from postgres.org.

One container start normally and initialize the database, the other twos are raising "randomly exceptions". At every try from scratch, it is a different one, which is working and the exceptions are not the same everytime.

But one exception
2024-01-11 13:36:55,368 WARNING: Detected Etcd version 3.0.0 is lower than 3.1.0, watches are not supported

shows in most cases, I would guess that this is more then less the source of issue.

How can we reproduce it (as minimally and precisely as possible)?

Hard to say, since everything looks like randomly.

What did you expect to happen?

All three are running.

Patroni/PostgreSQL/DCS version

  • Patroni version: 3.2.1
  • PostgreSQL version: 16
  • etcd version: 3.5.11

Etcd-Nodes:

$ curl -L -s http://10.11.60.151:5001/version | python -m json.tool
{
"etcdserver": "3.5.11",
"etcdcluster": "3.5.0"
}

$ curl -L -s http://10.11.60.152:5001/version | python -m json.tool
{
"etcdserver": "3.5.11",
"etcdcluster": "3.5.0"
}

$ curl -L -s http://10.11.60.153:5001/version | python -m json.tool
{
"etcdserver": "3.5.11",
"etcdcluster": "3.5.0"
}

Patroni configuration file

scope: sg_postgres
name: postgresql

restapi:
  listen: 127.0.0.1:8008
  connect_address: 127.0.0.1:8008

etcd3:


# The bootstrap configuration. Works only when the cluster is not yet initialized.
# If the cluster is already initialized, all changes in the `bootstrap` section are ignored!
bootstrap:
  # This section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
  # and all other cluster members will use it as a `global configuration`.
  # WARNING! If you want to change any of the parameters that were set up
  # via `bootstrap.dcs` section, please use `patronictl edit-config`!
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      pg_hba:
      - host replication replicator 127.0.0.1/32 md5
      - host all all 0.0.0.0/0 md5
      parameters:

  initdb:  # Note: It needs to be a list (some options need values, others are switches)
  - encoding: UTF8
  - data-checksums

postgresql:
  listen: 127.0.0.1:5432
  connect_address: 127.0.0.1:5432

  data_dir: /var/lib/pgsql/16/data
  pgpass: /tmp/pgpass0
  authentication:
    replication:
      username: replicator
      password: rep-pass
    superuser:
      username: postgres
      password: zalando
    rewind:  # Has no effect on postgres 10 and lower
      username: rewind_user
      password: rewind_password
  parameters:
    unix_socket_directories: '..'  # parent directory of data_dir

watchdog:
  mode: off # Allowed values: off, automatic, required

tags:
    # failover_priority: 1
    noloadbalance: false
    clonefrom: false
    nosync: false

patronictl show-config

loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
  parameters: null
  pg_hba:
  - host replication replicator 127.0.0.1/32 md5
  - host all all 0.0.0.0/0 md5
  use_pg_rewind: true
retry_timeout: 10
ttl: 30

Patroni log files

Variante A:

+ exec patroni /opt/sgi/postgresql/cfg/patroni_config.yaml
2024-01-11 13:36:55,368 WARNING: Detected Etcd version 3.0.0 is lower than 3.1.0, watches are not supported
2024-01-11 13:36:55,369 ERROR: Failed to get list of machines from http://10.11.60.153:5001/v3alpha: <Unknown error: '404 page not found', code: 2>
2024-01-11 13:36:55,371 ERROR: Failed to get list of machines from http://10.11.60.151:5001/v3alpha: <Unknown error: '404 page not found', code: 2>
2024-01-11 13:36:55,380 ERROR: Failed to get list of machines from http://10.11.60.152:5001/v3alpha: <Unknown error: '404 page not found', code: 2>

Variante B:
+ exec patroni /opt/sgi/postgresql/cfg/patroni_config.yaml
2024-01-11 13:36:58,372 INFO: Selected new etcd server http://10.11.60.153:5001
2024-01-11 13:36:58,378 INFO: No PostgreSQL configuration items changed, nothing to reload.
2024-01-11 13:36:58,383 INFO: Lock owner: sgalcgms14_2_patroni_2; I am sgalcgms14_2_patroni_3
2024-01-11 13:36:58,389 INFO: trying to bootstrap from leader 'sgalcgms14_2_patroni_2'
pg_basebackup: error: connection to server at "127.0.0.1", port 5432 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?
2024-01-11 13:36:58,396 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2024-01-11 13:36:58,397 WARNING: Trying again in 5 seconds
pg_basebackup: error: connection to server at "127.0.0.1", port 5432 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?
2024-01-11 13:37:03,409 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2024-01-11 13:37:03,409 ERROR: failed to bootstrap from leader 'sgalcgms14_2_patroni_2'
2024-01-11 13:37:03,409 INFO: Removing data directory: /var/lib/pgsql/16/data
2024-01-11 13:37:03,409 ERROR: Could not remove data directory /var/lib/pgsql/16/data
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/patroni/postgresql/__init__.py", line 1350, in remove_data_directory
    shutil.rmtree(self._data_dir)
  File "/usr/lib64/python3.9/shutil.py", line 740, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib64/python3.9/shutil.py", line 738, in rmtree
    os.rmdir(path)
OSError: [Errno 16] Device or resource busy: '/var/lib/pgsql/16/data'
2024-01-11 13:37:03,410 INFO: renaming data directory to /var/lib/pgsql/16/data.failed
2024-01-11 13:37:03,410 ERROR: Could not rename data directory /var/lib/pgsql/16/data
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/patroni/postgresql/__init__.py", line 1350, in remove_data_directory
    shutil.rmtree(self._data_dir)
  File "/usr/lib64/python3.9/shutil.py", line 740, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib64/python3.9/shutil.py", line 738, in rmtree
    os.rmdir(path)
OSError: [Errno 16] Device or resource busy: '/var/lib/pgsql/16/data'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/patroni/postgresql/__init__.py", line 1323, in move_data_directory
    os.rename(self._data_dir, new_name)
OSError: [Errno 16] Device or resource busy: '/var/lib/pgsql/16/data' -> '/var/lib/pgsql/16/data.failed'
2024-01-11 13:37:07,876 INFO: Lock owner: sgalcgms14_2_patroni_2; I am sgalcgms14_2_patroni_3
2024-01-11 13:37:07,877 INFO: trying to bootstrap from leader 'sgalcgms14_2_patroni_2'
pg_basebackup: error: connection to server at "127.0.0.1", port 5432 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?
2024-01-11 13:37:07,885 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2024-01-11 13:37:07,885 WARNING: Trying again in 5 seconds
pg_basebackup: error: connection to server at "127.0.0.1", port 5432 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?
2024-01-11 13:37:12,897 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2024-01-11 13:37:12,897 ERROR: failed to bootstrap from leader 'sgalcgms14_2_patroni_2'
2024-01-11 13:37:12,897 INFO: Removing data directory: /var/lib/pgsql/16/data
2024-01-11 13:37:12,897 ERROR: Could not remove data directory /var/lib/pgsql/16/data
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/patroni/postgresql/__init__.py", line 1350, in remove_data_directory
    shutil.rmtree(self._data_dir)
  File "/usr/lib64/python3.9/shutil.py", line 740, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib64/python3.9/shutil.py", line 738, in rmtree
    os.rmdir(path)
OSError: [Errno 16] Device or resource busy: '/var/lib/pgsql/16/data'

PostgreSQL log files

No Postgres-Log-Files, the database is not initialized on the failed instances.

Have you tried to use GitHub issue search?

  • Yes

Anything else we need to know?

$ podman container ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b4193030078b localhost/postgresql-etcd:1.0.1-RH --name=etcd-1 --i... 14 minutes ago Up 14 minutes 10.11.60.151:5000->2380/tcp, 10.11.60.151:5001->2379/tcp etcd-1
e0d2a09ce5a7 localhost/postgresql-etcd:1.0.1-RH --name=etcd-2 --i... 14 minutes ago Up 14 minutes 10.11.60.152:5000->2380/tcp, 10.11.60.152:5001->2379/tcp etcd-2
11e9f48e5ef3 localhost/postgresql-etcd:1.0.1-RH --name=etcd-3 --i... 14 minutes ago Up 14 minutes 10.11.60.153:5000->2380/tcp, 10.11.60.153:5001->2379/tcp etcd-3
25662e244e7a localhost/postgresql-server:16-1.0.1-RH patroni 13 minutes ago Up 14 minutes 10.11.60.151:5003->5432/tcp patroni_1
c3e29afd1bdf localhost/postgresql-server:16-1.0.1-RH patroni 13 minutes ago Up 13 minutes 10.11.60.152:5003->5432/tcp patroni_2
c576c9d6397d localhost/postgresql-server:16-1.0.1-RH patroni 13 minutes ago Up 13 minutes 10.11.60.153:5003->5432/tcp patroni_3

restapi:
  listen: 127.0.0.1:8008
  connect_address: 127.0.0.1:8008
postgresql:
  listen: 127.0.0.1:5432
  connect_address: 127.0.0.1:5432

The connect_address MUST be accessible from other nodes: https://patroni.readthedocs.io/en/latest/yaml_configuration.html#rest-api, while 127.0.0.1 is definitely not.

Device or resource busy: '/var/lib/pgsql/16/data'

This error could indicate that /var/lib/pgsql/16/data is a mount-point.
And Postgres docs saying that his is a bad practice: it is not advisable to try to use the secondary volume's topmost directory (mount point) as the data directory.

etcd version: 3.5.11

I am more inclined to trust Patroni logs, which says that Etcd cluster version is 3.0. Also, it is not clear where Patroni is connecting to, because the etcd3 section in your config is empty.