REST API returns unknown after postgres restart
XiuhuaRuan opened this issue · comments
What happened?
First, setup a patroni cluster with three postgres nodes. After restart postgres node, REST API returns "unknown" state
[root@sophia-pghost5 patroni]# curl http://localhost:8008
{"state": "unknown", "role": "master", "dcs_last_seen": 1700478557, "database_system_identifier": "7303378219179270632", "patroni": {"version": "3.2.0", "scope": "postgres-cluster", "name": "postgres_01"}}
How can we reproduce it (as minimally and precisely as possible)?
Running patronictl restart cluster can reproduce it.
[root@sophia-pghost5 patroni]# patronictl -c patroni.yml restart postgres-cluster
+ Cluster: postgres-cluster (7303378219179270632) ---+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgres_01 | 192.168.61.105 | Leader | running | 1 | |
| postgres_02 | 192.168.61.106 | Replica | streaming | 1 | 0 |
| postgres_03 | 192.168.61.107 | Replica | streaming | 1 | 0 |
+-------------+----------------+---------+-----------+----+-----------+
When should the restart take place (e.g. 2023-11-20T20:13) [now]:
Are you sure you want to restart members postgres_01, postgres_02, postgres_03? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2) []:
Success: restart on member postgres_01
Success: restart on member postgres_02
Success: restart on member postgres_03
What did you expect to happen?
REST API should return correct "running" state, even after restart postgres nodes.
Patroni/PostgreSQL/DCS version
- Patroni version: 3.2.0
- PostgreSQL version: 14.0
- DCS (and its version): etcd3.5.9
Patroni configuration file
scope: postgres-cluster
namespace: /service/
name: postgres_01
restapi:
listen: 192.168.61.105:8008
connect_address: 192.168.61.105:8008
etcd:
hosts: 192.168.61.105:2379,192.168.61.106:2379,192.168.61.107:2379
log:
level: INFO
traceback_level: INFO
dir: /home/postgres/patroni
file_num: 10
file_size: 104857600
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
master_start_timeout: 300
synchronous_mode: false
postgresql:
use_pg_rewind: true
parameters:
listen_addresses: "*"
port: 5432
wal_level: replica
hot_standby: "on"
wal_keep_size: 100
max_wal_senders: 10
max_replication_slots: 10
wal_log_hints: "on"
archive_mode: "off"
archive_timeout: 1800s
logging_collector: on
log_destination: 'stderr'
log_truncate_on_rotation: on
log_checkpoints: on
log_connections: on
log_disconnections: on
log_error_verbosity: default
log_lock_waits: on
log_temp_files: 0
log_autovacuum_min_duration: 0
log_min_duration_statement: 50
log_timezone: 'PRC'
log_filename: postgresql-%Y-%m-%d_%H.log
log_line_prefix: '%t [%p]: db=%d,user=%u,app=%a,client=%h '
postgresql:
database: postgres
listen: 0.0.0.0:5432
connect_address: 192.168.61.105:5432
bin_dir: /usr/local/pgsql/bin
data_dir: /usr/local/pgsql/data
pgpass: /home/postgres/tmp/.pgpass
authentication:
replication:
username: postgres
password: postgres
superuser:
username: postgres
password: postgres
rewind:
username: postgres
password: postgres
pg_hba:
- local all all trust
- host all all 0.0.0.0/0 trust
- host all all ::1/128 trust
- local replication all trust
- host replication all 0.0.0.0/0 trust
- host replication all ::1/128 trust
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
patronictl show-config
loop_wait: 10
master_start_timeout: 300
maximum_lag_on_failover: 1048576
postgresql:
parameters:
archive_mode: 'off'
archive_timeout: 1800s
hot_standby: 'on'
listen_addresses: '*'
log_autovacuum_min_duration: 0
log_checkpoints: true
log_connections: true
log_destination: stderr
log_disconnections: true
log_error_verbosity: default
log_filename: postgresql-%Y-%m-%d_%H.log
log_line_prefix: '%t [%p]: db=%d,user=%u,app=%a,client=%h '
log_lock_waits: true
log_min_duration_statement: 50
log_temp_files: 0
log_timezone: PRC
log_truncate_on_rotation: true
logging_collector: true
max_replication_slots: 10
max_wal_senders: 10
port: 5432
wal_keep_size: 100
wal_level: replica
wal_log_hints: 'on'
use_pg_rewind: true
retry_timeout: 10
synchronous_mode: false
ttl: 30
Patroni log files
2023-11-20 19:13:47,546 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:13:57,543 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:14:07,535 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:14:09,367 INFO: Lock owner: postgres_01; I am postgres_01
2023-11-20 19:14:09,406 INFO: closed patroni connections to postgres
2023-11-20 19:14:09,712 INFO: postmaster pid=23713
2023-11-20 19:14:10,728 INFO: Lock owner: postgres_01; I am postgres_01
2023-11-20 19:14:10,728 INFO: establishing a new patroni heartbeat connection to postgres
2023-11-20 19:14:10,742 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:14:20,734 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:14:30,738 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:14:40,735 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:14:50,734 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:15:00,737 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:15:03,096 ERROR: get_postgresql_status
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.11/site-packages/patroni/postgresql/connection.py", line 73, in query
cursor.execute(sql.encode('utf-8'), params or None)
psycopg2.errors.AdminShutdown: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.11/site-packages/patroni/api.py", line 1278, in get_postgresql_status
row = self.query(stmt.format(postgresql.wal_name, postgresql.lsn_name,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/python3/lib/python3.11/site-packages/patroni/api.py", line 1211, in query
return self.server.query(sql, *params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/python3/lib/python3.11/site-packages/patroni/api.py", line 1412, in query
return connection.query(sql, *params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/python3/lib/python3.11/site-packages/patroni/postgresql/connection.py", line 84, in query
raise PostgresConnectionException('connection problems') from exc
patroni.exceptions.PostgresConnectionException: connection problems
2023-11-20 19:15:10,736 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:15:20,735 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:15:30,733 INFO: no action. I am (postgres_01), the leader with the lock
PostgreSQL log files
2023-11-20 19:14:09 CST [23571]: db=postgres,user=postgres,app=Patroni heartbeat,client=127.0.0.1 FATAL: terminating connection due to administrator command
2023-11-20 19:14:09 CST [23571]: db=postgres,user=postgres,app=Patroni heartbeat,client=127.0.0.1 LOG: disconnection: session time: 0:05:01.847 user=postgres database=postgres host=127.0.0.1 port=35922
2023-11-20 19:14:09 CST [23558]: db=,user=,app=,client= LOG: background worker "logical replication launcher" (PID 23566) exited with exit code 1
2023-11-20 19:14:09 CST [23561]: db=,user=,app=,client= LOG: shutting down
2023-11-20 19:14:09 CST [23561]: db=,user=,app=,client= LOG: checkpoint starting: shutdown immediate
2023-11-20 19:14:09 CST [23561]: db=,user=,app=,client= LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.007 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB
2023-11-20 19:14:09 CST [23572]: db=[unknown],user=postgres,app=postgres_02,client=192.168.31.106 LOG: disconnection: session time: 0:05:01.523 user=postgres database= host=192.168.31.106 port=57350
2023-11-20 19:14:09 CST [23576]: db=[unknown],user=postgres,app=postgres_03,client=192.168.31.107 LOG: disconnection: session time: 0:04:57.176 user=postgres database= host=192.168.31.107 port=38988
2023-11-20 19:14:09 CST [23558]: db=,user=,app=,client= LOG: database system is shut down
2023-11-20 19:14:09 CST [23713]: db=,user=,app=,client= LOG: starting PostgreSQL 14.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2023-11-20 19:14:09 CST [23713]: db=,user=,app=,client= LOG: listening on IPv4 address "0.0.0.0", port 5432
2023-11-20 19:14:09 CST [23713]: db=,user=,app=,client= LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-11-20 19:14:09 CST [23716]: db=,user=,app=,client= LOG: database system was shut down at 2023-11-20 19:14:09 CST
2023-11-20 19:14:09 CST [23713]: db=,user=,app=,client= LOG: database system is ready to accept connections
2023-11-20 19:14:10 CST [23724]: db=[unknown],user=[unknown],app=[unknown],client=127.0.0.1 LOG: connection received: host=127.0.0.1 port=36260
2023-11-20 19:14:10 CST [23724]: db=postgres,user=postgres,app=[unknown],client=127.0.0.1 LOG: connection authorized: user=postgres database=postgres application_name=pg_isready
2023-11-20 19:14:10 CST [23724]: db=postgres,user=postgres,app=pg_isready,client=127.0.0.1 LOG: disconnection: session time: 0:00:00.003 user=postgres database=postgres host=127.0.0.1 port=36260
2023-11-20 19:14:10 CST [23726]: db=[unknown],user=[unknown],app=[unknown],client=127.0.0.1 LOG: connection received: host=127.0.0.1 port=36264
2023-11-20 19:14:10 CST [23726]: db=postgres,user=postgres,app=[unknown],client=127.0.0.1 LOG: connection authorized: user=postgres database=postgres application_name=pg_isready
2023-11-20 19:14:10 CST [23726]: db=postgres,user=postgres,app=pg_isready,client=127.0.0.1 LOG: disconnection: session time: 0:00:00.001 user=postgres database=postgres host=127.0.0.1 port=36264
2023-11-20 19:14:10 CST [23727]: db=[unknown],user=[unknown],app=[unknown],client=127.0.0.1 LOG: connection received: host=127.0.0.1 port=36270
2023-11-20 19:14:10 CST [23727]: db=postgres,user=postgres,app=[unknown],client=127.0.0.1 LOG: connection authorized: user=postgres database=postgres application_name=Patroni heartbeat
2023-11-20 19:14:11 CST [23728]: db=[unknown],user=[unknown],app=[unknown],client=192.168.31.106 LOG: connection received: host=192.168.31.106 port=57668
2023-11-20 19:14:11 CST [23728]: db=[unknown],user=postgres,app=[unknown],client=192.168.31.106 LOG: replication connection authorized: user=postgres application_name=postgres_02
2023-11-20 19:14:11 CST [23729]: db=[unknown],user=[unknown],app=[unknown],client=192.168.31.107 LOG: connection received: host=192.168.31.107 port=39296
2023-11-20 19:14:11 CST [23729]: db=[unknown],user=postgres,app=[unknown],client=192.168.31.107 LOG: replication connection authorized: user=postgres application_name=postgres_03
Have you tried to use GitHub issue search?
- Yes
Anything else we need to know?
It seems there is a function misuse when close all named connections. I'm working on a PR to fix it.
Sorry, but I can't reproduce it.
Besides that, logs your provide are heavily inconsistent:
- you query localhost:8008, but according to patroni.yaml it doesn't listen on localhost
- "dcs_last_seen": 1700478557 reported by curl is
Mon Nov 20 12:09:17 2023
. We can skip all timezone difference and look only at minutes and seconds, which are09:17
. Neither Patroni nor Postgres logs you provide include this time. - Patroni and Postgres intersect only partially, and what is also important, Postgres logs don't include time when PostgresConnectionException exception was raised.
Thanks for your reply and sorry for the confusions caused to you. I reproduced this issue several times and maybe didn't copy the latest logs. Besides, in my setup, localhost is corresponding to the local IP configured in /etc/hosts.
I reproduced it again and copied the latest logs as below:
1. First, curl returned correct state.
[postgres@sophia-pghost5 patroni]$ curl -s http://192.168.61.105:8008 | jq .
{
"state": "running",
"postmaster_start_time": "2023-11-21 09:30:20.980955+08:00",
"role": "master",
"server_version": 140000,
"xlog": {
"location": 117441304
},
"timeline": 2,
"replication": [
{
"usename": "postgres",
"application_name": "postgres_02",
"client_addr": "192.168.61.106",
"state": "streaming",
"sync_state": "async",
"sync_priority": 0
},
{
"usename": "postgres",
"application_name": "postgres_03",
"client_addr": "192.168.61.107",
"state": "streaming",
"sync_state": "async",
"sync_priority": 0
}
],
"dcs_last_seen": 1700530302,
"database_system_identifier": "7303378219179270632",
"patroni": {
"version": "3.2.0",
"scope": "postgres-cluster",
"name": "postgres_01"
}
}
2. Then patronictl restart cluster.
[postgres@sophia-pghost5 patroni]$ patronictl -c patroni.yml restart postgres-cluster
+ Cluster: postgres-cluster (7303378219179270632) ---+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgres_01 | 192.168.61.105 | Leader | running | 2 | |
| postgres_02 | 192.168.61.106 | Replica | streaming | 2 | 0 |
| postgres_03 | 192.168.61.107 | Replica | streaming | 2 | 0 |
+-------------+----------------+---------+-----------+----+-----------+
When should the restart take place (e.g. 2023-11-21T10:32) [now]:
Are you sure you want to restart members postgres_01, postgres_02, postgres_03? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2) []:
Success: restart on member postgres_01
Success: restart on member postgres_02
Success: restart on member postgres_03
3. Then curl returns "unknown" state.
[postgres@sophia-pghost5 patroni]$ curl -s http://192.168.61.105:8008 | jq .
{
"state": "unknown",
"role": "master",
"dcs_last_seen": 1700530355,
"database_system_identifier": "7303378219179270632",
"patroni": {
"version": "3.2.0",
"scope": "postgres-cluster",
"name": "postgres_01"
}
}
4. patroni.log:
2023-11-21 09:29:36,809 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:29:46,816 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:29:56,810 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:30:06,809 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:30:16,809 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:30:20,627 INFO: Lock owner: postgres_01; I am postgres_01
2023-11-21 09:30:20,666 INFO: closed patroni connections to postgres
2023-11-21 09:30:20,974 INFO: postmaster pid=2336
2023-11-21 09:30:21,995 INFO: Lock owner: postgres_01; I am postgres_01
2023-11-21 09:30:21,995 INFO: establishing a new patroni heartbeat connection to postgres
2023-11-21 09:30:22,009 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:30:32,002 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:30:42,002 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:30:52,001 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:31:02,011 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:31:12,010 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:31:22,006 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:31:32,002 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:31:42,008 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:31:45,590 INFO: establishing a new patroni restapi connection to postgres
2023-11-21 09:31:52,002 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:32:02,001 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:32:12,001 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:32:22,001 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:32:24,621 INFO: Lock owner: postgres_01; I am postgres_01
2023-11-21 09:32:24,659 INFO: closed patroni connections to postgres
2023-11-21 09:32:24,964 INFO: postmaster pid=2431
2023-11-21 09:32:25,979 INFO: Lock owner: postgres_01; I am postgres_01
2023-11-21 09:32:25,979 INFO: establishing a new patroni heartbeat connection to postgres
2023-11-21 09:32:25,996 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:32:35,987 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:32:44,880 ERROR: get_postgresql_status
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.11/site-packages/patroni/postgresql/connection.py", line 73, in query
cursor.execute(sql.encode('utf-8'), params or None)
psycopg2.errors.AdminShutdown: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/python3/lib/python3.11/site-packages/patroni/api.py", line 1278, in get_postgresql_status
row = self.query(stmt.format(postgresql.wal_name, postgresql.lsn_name,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/python3/lib/python3.11/site-packages/patroni/api.py", line 1211, in query
return self.server.query(sql, *params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/python3/lib/python3.11/site-packages/patroni/api.py", line 1412, in query
return connection.query(sql, *params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/python3/lib/python3.11/site-packages/patroni/postgresql/connection.py", line 84, in query
raise PostgresConnectionException('connection problems') from exc
patroni.exceptions.PostgresConnectionException: connection problems
2023-11-21 09:32:46,009 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:32:55,992 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:33:05,988 INFO: no action. I am (postgres_01), the leader with the lock
I should clarify that the first curl request will return success. After restart postgres node, the second curl request will return unknown as the first REST API connection is not closed properly. You didn't reproduce it maybe it's your first curl request.
I created a pull request #2956 for this issue, Please help review it. Thanks.