Randomly exception on same patroni configuration
nikodemusP opened this issue · comments
What happened?
I have 3 etcd and 3 patroni container based on this Dockerfiles/Compose-Files but for Alma-Linux 9.3/podman and the RPM-Packages are used from postgres.org.
One container start normally and initialize the database, the other twos are raising "randomly exceptions". At every try from scratch, it is a different one, which is working and the exceptions are not the same everytime.
But one exception
2024-01-11 13:36:55,368 WARNING: Detected Etcd version 3.0.0 is lower than 3.1.0, watches are not supported
shows in most cases, I would guess that this is more then less the source of issue.
How can we reproduce it (as minimally and precisely as possible)?
Hard to say, since everything looks like randomly.
What did you expect to happen?
All three are running.
Patroni/PostgreSQL/DCS version
- Patroni version: 3.2.1
- PostgreSQL version: 16
- etcd version: 3.5.11
Etcd-Nodes:
$ curl -L -s http://10.11.60.151:5001/version | python -m json.tool
{
"etcdserver": "3.5.11",
"etcdcluster": "3.5.0"
}
$ curl -L -s http://10.11.60.152:5001/version | python -m json.tool
{
"etcdserver": "3.5.11",
"etcdcluster": "3.5.0"
}
$ curl -L -s http://10.11.60.153:5001/version | python -m json.tool
{
"etcdserver": "3.5.11",
"etcdcluster": "3.5.0"
}
Patroni configuration file
scope: sg_postgres
name: postgresql
restapi:
listen: 127.0.0.1:8008
connect_address: 127.0.0.1:8008
etcd3:
# The bootstrap configuration. Works only when the cluster is not yet initialized.
# If the cluster is already initialized, all changes in the `bootstrap` section are ignored!
bootstrap:
# This section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
# and all other cluster members will use it as a `global configuration`.
# WARNING! If you want to change any of the parameters that were set up
# via `bootstrap.dcs` section, please use `patronictl edit-config`!
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
pg_hba:
- host replication replicator 127.0.0.1/32 md5
- host all all 0.0.0.0/0 md5
parameters:
initdb: # Note: It needs to be a list (some options need values, others are switches)
- encoding: UTF8
- data-checksums
postgresql:
listen: 127.0.0.1:5432
connect_address: 127.0.0.1:5432
data_dir: /var/lib/pgsql/16/data
pgpass: /tmp/pgpass0
authentication:
replication:
username: replicator
password: rep-pass
superuser:
username: postgres
password: zalando
rewind: # Has no effect on postgres 10 and lower
username: rewind_user
password: rewind_password
parameters:
unix_socket_directories: '..' # parent directory of data_dir
watchdog:
mode: off # Allowed values: off, automatic, required
tags:
# failover_priority: 1
noloadbalance: false
clonefrom: false
nosync: false
patronictl show-config
loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
parameters: null
pg_hba:
- host replication replicator 127.0.0.1/32 md5
- host all all 0.0.0.0/0 md5
use_pg_rewind: true
retry_timeout: 10
ttl: 30
Patroni log files
Variante A:
+ exec patroni /opt/sgi/postgresql/cfg/patroni_config.yaml
2024-01-11 13:36:55,368 WARNING: Detected Etcd version 3.0.0 is lower than 3.1.0, watches are not supported
2024-01-11 13:36:55,369 ERROR: Failed to get list of machines from http://10.11.60.153:5001/v3alpha: <Unknown error: '404 page not found', code: 2>
2024-01-11 13:36:55,371 ERROR: Failed to get list of machines from http://10.11.60.151:5001/v3alpha: <Unknown error: '404 page not found', code: 2>
2024-01-11 13:36:55,380 ERROR: Failed to get list of machines from http://10.11.60.152:5001/v3alpha: <Unknown error: '404 page not found', code: 2>
Variante B:
+ exec patroni /opt/sgi/postgresql/cfg/patroni_config.yaml
2024-01-11 13:36:58,372 INFO: Selected new etcd server http://10.11.60.153:5001
2024-01-11 13:36:58,378 INFO: No PostgreSQL configuration items changed, nothing to reload.
2024-01-11 13:36:58,383 INFO: Lock owner: sgalcgms14_2_patroni_2; I am sgalcgms14_2_patroni_3
2024-01-11 13:36:58,389 INFO: trying to bootstrap from leader 'sgalcgms14_2_patroni_2'
pg_basebackup: error: connection to server at "127.0.0.1", port 5432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
2024-01-11 13:36:58,396 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2024-01-11 13:36:58,397 WARNING: Trying again in 5 seconds
pg_basebackup: error: connection to server at "127.0.0.1", port 5432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
2024-01-11 13:37:03,409 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2024-01-11 13:37:03,409 ERROR: failed to bootstrap from leader 'sgalcgms14_2_patroni_2'
2024-01-11 13:37:03,409 INFO: Removing data directory: /var/lib/pgsql/16/data
2024-01-11 13:37:03,409 ERROR: Could not remove data directory /var/lib/pgsql/16/data
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/patroni/postgresql/__init__.py", line 1350, in remove_data_directory
shutil.rmtree(self._data_dir)
File "/usr/lib64/python3.9/shutil.py", line 740, in rmtree
onerror(os.rmdir, path, sys.exc_info())
File "/usr/lib64/python3.9/shutil.py", line 738, in rmtree
os.rmdir(path)
OSError: [Errno 16] Device or resource busy: '/var/lib/pgsql/16/data'
2024-01-11 13:37:03,410 INFO: renaming data directory to /var/lib/pgsql/16/data.failed
2024-01-11 13:37:03,410 ERROR: Could not rename data directory /var/lib/pgsql/16/data
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/patroni/postgresql/__init__.py", line 1350, in remove_data_directory
shutil.rmtree(self._data_dir)
File "/usr/lib64/python3.9/shutil.py", line 740, in rmtree
onerror(os.rmdir, path, sys.exc_info())
File "/usr/lib64/python3.9/shutil.py", line 738, in rmtree
os.rmdir(path)
OSError: [Errno 16] Device or resource busy: '/var/lib/pgsql/16/data'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/patroni/postgresql/__init__.py", line 1323, in move_data_directory
os.rename(self._data_dir, new_name)
OSError: [Errno 16] Device or resource busy: '/var/lib/pgsql/16/data' -> '/var/lib/pgsql/16/data.failed'
2024-01-11 13:37:07,876 INFO: Lock owner: sgalcgms14_2_patroni_2; I am sgalcgms14_2_patroni_3
2024-01-11 13:37:07,877 INFO: trying to bootstrap from leader 'sgalcgms14_2_patroni_2'
pg_basebackup: error: connection to server at "127.0.0.1", port 5432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
2024-01-11 13:37:07,885 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2024-01-11 13:37:07,885 WARNING: Trying again in 5 seconds
pg_basebackup: error: connection to server at "127.0.0.1", port 5432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
2024-01-11 13:37:12,897 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2024-01-11 13:37:12,897 ERROR: failed to bootstrap from leader 'sgalcgms14_2_patroni_2'
2024-01-11 13:37:12,897 INFO: Removing data directory: /var/lib/pgsql/16/data
2024-01-11 13:37:12,897 ERROR: Could not remove data directory /var/lib/pgsql/16/data
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/patroni/postgresql/__init__.py", line 1350, in remove_data_directory
shutil.rmtree(self._data_dir)
File "/usr/lib64/python3.9/shutil.py", line 740, in rmtree
onerror(os.rmdir, path, sys.exc_info())
File "/usr/lib64/python3.9/shutil.py", line 738, in rmtree
os.rmdir(path)
OSError: [Errno 16] Device or resource busy: '/var/lib/pgsql/16/data'
PostgreSQL log files
No Postgres-Log-Files, the database is not initialized on the failed instances.
Have you tried to use GitHub issue search?
- Yes
Anything else we need to know?
$ podman container ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b4193030078b localhost/postgresql-etcd:1.0.1-RH --name=etcd-1 --i... 14 minutes ago Up 14 minutes 10.11.60.151:5000->2380/tcp, 10.11.60.151:5001->2379/tcp etcd-1
e0d2a09ce5a7 localhost/postgresql-etcd:1.0.1-RH --name=etcd-2 --i... 14 minutes ago Up 14 minutes 10.11.60.152:5000->2380/tcp, 10.11.60.152:5001->2379/tcp etcd-2
11e9f48e5ef3 localhost/postgresql-etcd:1.0.1-RH --name=etcd-3 --i... 14 minutes ago Up 14 minutes 10.11.60.153:5000->2380/tcp, 10.11.60.153:5001->2379/tcp etcd-3
25662e244e7a localhost/postgresql-server:16-1.0.1-RH patroni 13 minutes ago Up 14 minutes 10.11.60.151:5003->5432/tcp patroni_1
c3e29afd1bdf localhost/postgresql-server:16-1.0.1-RH patroni 13 minutes ago Up 13 minutes 10.11.60.152:5003->5432/tcp patroni_2
c576c9d6397d localhost/postgresql-server:16-1.0.1-RH patroni 13 minutes ago Up 13 minutes 10.11.60.153:5003->5432/tcp patroni_3
restapi:
listen: 127.0.0.1:8008
connect_address: 127.0.0.1:8008
postgresql:
listen: 127.0.0.1:5432
connect_address: 127.0.0.1:5432
The connect_address
MUST be accessible from other nodes: https://patroni.readthedocs.io/en/latest/yaml_configuration.html#rest-api, while 127.0.0.1 is definitely not.
Device or resource busy: '/var/lib/pgsql/16/data'
This error could indicate that /var/lib/pgsql/16/data is a mount-point.
And Postgres docs saying that his is a bad practice: it is not advisable to try to use the secondary volume's topmost directory (mount point) as the data directory.
etcd version: 3.5.11
I am more inclined to trust Patroni logs, which says that Etcd cluster version is 3.0. Also, it is not clear where Patroni is connecting to, because the etcd3
section in your config is empty.
@nikodemusP issues are for bugs.
For questions please use Slack: https://patroni.readthedocs.io/en/latest/contributing_guidelines.html#chatting