zalando / patroni

A template for PostgreSQL High Availability with Etcd, Consul, ZooKeeper, or Kubernetes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

REST API returns unknown after postgres restart

XiuhuaRuan opened this issue · comments

What happened?

First, setup a patroni cluster with three postgres nodes. After restart postgres node, REST API returns "unknown" state

[root@sophia-pghost5 patroni]# curl http://localhost:8008
{"state": "unknown", "role": "master", "dcs_last_seen": 1700478557, "database_system_identifier": "7303378219179270632", "patroni": {"version": "3.2.0", "scope": "postgres-cluster", "name": "postgres_01"}}

How can we reproduce it (as minimally and precisely as possible)?

Running patronictl restart cluster can reproduce it.

[root@sophia-pghost5 patroni]# patronictl -c patroni.yml restart postgres-cluster
+ Cluster: postgres-cluster (7303378219179270632) ---+----+-----------+
| Member      | Host           | Role    | State     | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgres_01 | 192.168.61.105 | Leader  | running   |  1 |           |
| postgres_02 | 192.168.61.106 | Replica | streaming |  1 |         0 |
| postgres_03 | 192.168.61.107 | Replica | streaming |  1 |         0 |
+-------------+----------------+---------+-----------+----+-----------+
When should the restart take place (e.g. 2023-11-20T20:13)  [now]:
Are you sure you want to restart members postgres_01, postgres_02, postgres_03? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []:
Success: restart on member postgres_01
Success: restart on member postgres_02
Success: restart on member postgres_03

What did you expect to happen?

REST API should return correct "running" state, even after restart postgres nodes.

Patroni/PostgreSQL/DCS version

  • Patroni version: 3.2.0
  • PostgreSQL version: 14.0
  • DCS (and its version): etcd3.5.9

Patroni configuration file

scope: postgres-cluster
namespace: /service/
name: postgres_01

restapi:
  listen: 192.168.61.105:8008
  connect_address: 192.168.61.105:8008

etcd:
  hosts: 192.168.61.105:2379,192.168.61.106:2379,192.168.61.107:2379

log:
  level: INFO
  traceback_level: INFO
  dir: /home/postgres/patroni
  file_num: 10
  file_size: 104857600

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    master_start_timeout: 300
    synchronous_mode: false
    postgresql:
      use_pg_rewind: true
      parameters:
        listen_addresses: "*"
        port: 5432
        wal_level: replica
        hot_standby: "on"
        wal_keep_size: 100
        max_wal_senders: 10
        max_replication_slots: 10
        wal_log_hints: "on"
        archive_mode: "off"
        archive_timeout: 1800s
        logging_collector: on
        log_destination: 'stderr'
        log_truncate_on_rotation: on
        log_checkpoints: on
        log_connections: on
        log_disconnections: on
        log_error_verbosity: default
        log_lock_waits: on
        log_temp_files: 0
        log_autovacuum_min_duration: 0
        log_min_duration_statement: 50
        log_timezone: 'PRC'
        log_filename: postgresql-%Y-%m-%d_%H.log
        log_line_prefix: '%t [%p]: db=%d,user=%u,app=%a,client=%h '


postgresql:
  database: postgres
  listen: 0.0.0.0:5432
  connect_address: 192.168.61.105:5432
  bin_dir: /usr/local/pgsql/bin
  data_dir: /usr/local/pgsql/data
  pgpass: /home/postgres/tmp/.pgpass

  authentication:
    replication:
      username: postgres
      password: postgres
    superuser:
      username: postgres
      password: postgres
    rewind:
      username: postgres
      password: postgres

  pg_hba:
  - local   all             all                                     trust
  - host    all             all             0.0.0.0/0               trust
  - host    all             all             ::1/128                 trust
  - local   replication     all                                     trust
  - host    replication     all             0.0.0.0/0               trust
  - host    replication     all             ::1/128                 trust

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

patronictl show-config

loop_wait: 10
master_start_timeout: 300
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    archive_mode: 'off'
    archive_timeout: 1800s
    hot_standby: 'on'
    listen_addresses: '*'
    log_autovacuum_min_duration: 0
    log_checkpoints: true
    log_connections: true
    log_destination: stderr
    log_disconnections: true
    log_error_verbosity: default
    log_filename: postgresql-%Y-%m-%d_%H.log
    log_line_prefix: '%t [%p]: db=%d,user=%u,app=%a,client=%h '
    log_lock_waits: true
    log_min_duration_statement: 50
    log_temp_files: 0
    log_timezone: PRC
    log_truncate_on_rotation: true
    logging_collector: true
    max_replication_slots: 10
    max_wal_senders: 10
    port: 5432
    wal_keep_size: 100
    wal_level: replica
    wal_log_hints: 'on'
  use_pg_rewind: true
retry_timeout: 10
synchronous_mode: false
ttl: 30

Patroni log files

2023-11-20 19:13:47,546 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:13:57,543 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:14:07,535 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:14:09,367 INFO: Lock owner: postgres_01; I am postgres_01
2023-11-20 19:14:09,406 INFO: closed patroni connections to postgres
2023-11-20 19:14:09,712 INFO: postmaster pid=23713
2023-11-20 19:14:10,728 INFO: Lock owner: postgres_01; I am postgres_01
2023-11-20 19:14:10,728 INFO: establishing a new patroni heartbeat connection to postgres
2023-11-20 19:14:10,742 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:14:20,734 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:14:30,738 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:14:40,735 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:14:50,734 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:15:00,737 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:15:03,096 ERROR: get_postgresql_status
Traceback (most recent call last):
  File "/usr/local/python3/lib/python3.11/site-packages/patroni/postgresql/connection.py", line 73, in query
    cursor.execute(sql.encode('utf-8'), params or None)
psycopg2.errors.AdminShutdown: terminating connection due to administrator command
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/python3/lib/python3.11/site-packages/patroni/api.py", line 1278, in get_postgresql_status
    row = self.query(stmt.format(postgresql.wal_name, postgresql.lsn_name,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3/lib/python3.11/site-packages/patroni/api.py", line 1211, in query
    return self.server.query(sql, *params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3/lib/python3.11/site-packages/patroni/api.py", line 1412, in query
    return connection.query(sql, *params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3/lib/python3.11/site-packages/patroni/postgresql/connection.py", line 84, in query
    raise PostgresConnectionException('connection problems') from exc
patroni.exceptions.PostgresConnectionException: connection problems
2023-11-20 19:15:10,736 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:15:20,735 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-20 19:15:30,733 INFO: no action. I am (postgres_01), the leader with the lock

PostgreSQL log files

2023-11-20 19:14:09 CST [23571]: db=postgres,user=postgres,app=Patroni heartbeat,client=127.0.0.1 FATAL:  terminating connection due to administrator command
2023-11-20 19:14:09 CST [23571]: db=postgres,user=postgres,app=Patroni heartbeat,client=127.0.0.1 LOG:  disconnection: session time: 0:05:01.847 user=postgres database=postgres host=127.0.0.1 port=35922
2023-11-20 19:14:09 CST [23558]: db=,user=,app=,client= LOG:  background worker "logical replication launcher" (PID 23566) exited with exit code 1
2023-11-20 19:14:09 CST [23561]: db=,user=,app=,client= LOG:  shutting down
2023-11-20 19:14:09 CST [23561]: db=,user=,app=,client= LOG:  checkpoint starting: shutdown immediate
2023-11-20 19:14:09 CST [23561]: db=,user=,app=,client= LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.007 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB
2023-11-20 19:14:09 CST [23572]: db=[unknown],user=postgres,app=postgres_02,client=192.168.31.106 LOG:  disconnection: session time: 0:05:01.523 user=postgres database= host=192.168.31.106 port=57350
2023-11-20 19:14:09 CST [23576]: db=[unknown],user=postgres,app=postgres_03,client=192.168.31.107 LOG:  disconnection: session time: 0:04:57.176 user=postgres database= host=192.168.31.107 port=38988
2023-11-20 19:14:09 CST [23558]: db=,user=,app=,client= LOG:  database system is shut down
2023-11-20 19:14:09 CST [23713]: db=,user=,app=,client= LOG:  starting PostgreSQL 14.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2023-11-20 19:14:09 CST [23713]: db=,user=,app=,client= LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-11-20 19:14:09 CST [23713]: db=,user=,app=,client= LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-11-20 19:14:09 CST [23716]: db=,user=,app=,client= LOG:  database system was shut down at 2023-11-20 19:14:09 CST
2023-11-20 19:14:09 CST [23713]: db=,user=,app=,client= LOG:  database system is ready to accept connections
2023-11-20 19:14:10 CST [23724]: db=[unknown],user=[unknown],app=[unknown],client=127.0.0.1 LOG:  connection received: host=127.0.0.1 port=36260
2023-11-20 19:14:10 CST [23724]: db=postgres,user=postgres,app=[unknown],client=127.0.0.1 LOG:  connection authorized: user=postgres database=postgres application_name=pg_isready
2023-11-20 19:14:10 CST [23724]: db=postgres,user=postgres,app=pg_isready,client=127.0.0.1 LOG:  disconnection: session time: 0:00:00.003 user=postgres database=postgres host=127.0.0.1 port=36260
2023-11-20 19:14:10 CST [23726]: db=[unknown],user=[unknown],app=[unknown],client=127.0.0.1 LOG:  connection received: host=127.0.0.1 port=36264
2023-11-20 19:14:10 CST [23726]: db=postgres,user=postgres,app=[unknown],client=127.0.0.1 LOG:  connection authorized: user=postgres database=postgres application_name=pg_isready
2023-11-20 19:14:10 CST [23726]: db=postgres,user=postgres,app=pg_isready,client=127.0.0.1 LOG:  disconnection: session time: 0:00:00.001 user=postgres database=postgres host=127.0.0.1 port=36264
2023-11-20 19:14:10 CST [23727]: db=[unknown],user=[unknown],app=[unknown],client=127.0.0.1 LOG:  connection received: host=127.0.0.1 port=36270
2023-11-20 19:14:10 CST [23727]: db=postgres,user=postgres,app=[unknown],client=127.0.0.1 LOG:  connection authorized: user=postgres database=postgres application_name=Patroni heartbeat
2023-11-20 19:14:11 CST [23728]: db=[unknown],user=[unknown],app=[unknown],client=192.168.31.106 LOG:  connection received: host=192.168.31.106 port=57668
2023-11-20 19:14:11 CST [23728]: db=[unknown],user=postgres,app=[unknown],client=192.168.31.106 LOG:  replication connection authorized: user=postgres application_name=postgres_02
2023-11-20 19:14:11 CST [23729]: db=[unknown],user=[unknown],app=[unknown],client=192.168.31.107 LOG:  connection received: host=192.168.31.107 port=39296
2023-11-20 19:14:11 CST [23729]: db=[unknown],user=postgres,app=[unknown],client=192.168.31.107 LOG:  replication connection authorized: user=postgres application_name=postgres_03

Have you tried to use GitHub issue search?

  • Yes

Anything else we need to know?

It seems there is a function misuse when close all named connections. I'm working on a PR to fix it.

Sorry, but I can't reproduce it.

Besides that, logs your provide are heavily inconsistent:

  1. you query localhost:8008, but according to patroni.yaml it doesn't listen on localhost
  2. "dcs_last_seen": 1700478557 reported by curl is Mon Nov 20 12:09:17 2023. We can skip all timezone difference and look only at minutes and seconds, which are 09:17. Neither Patroni nor Postgres logs you provide include this time.
  3. Patroni and Postgres intersect only partially, and what is also important, Postgres logs don't include time when PostgresConnectionException exception was raised.

Thanks for your reply and sorry for the confusions caused to you. I reproduced this issue several times and maybe didn't copy the latest logs. Besides, in my setup, localhost is corresponding to the local IP configured in /etc/hosts.
I reproduced it again and copied the latest logs as below:

1. First, curl returned correct state.
[postgres@sophia-pghost5 patroni]$ curl -s http://192.168.61.105:8008 | jq .
{
  "state": "running",
  "postmaster_start_time": "2023-11-21 09:30:20.980955+08:00",
  "role": "master",
  "server_version": 140000,
  "xlog": {
    "location": 117441304
  },
  "timeline": 2,
  "replication": [
    {
      "usename": "postgres",
      "application_name": "postgres_02",
      "client_addr": "192.168.61.106",
      "state": "streaming",
      "sync_state": "async",
      "sync_priority": 0
    },
    {
      "usename": "postgres",
      "application_name": "postgres_03",
      "client_addr": "192.168.61.107",
      "state": "streaming",
      "sync_state": "async",
      "sync_priority": 0
    }
  ],
  "dcs_last_seen": 1700530302,
  "database_system_identifier": "7303378219179270632",
  "patroni": {
    "version": "3.2.0",
    "scope": "postgres-cluster",
    "name": "postgres_01"
  }
}

2. Then patronictl restart cluster.
[postgres@sophia-pghost5 patroni]$ patronictl -c patroni.yml  restart postgres-cluster
+ Cluster: postgres-cluster (7303378219179270632) ---+----+-----------+
| Member      | Host           | Role    | State     | TL | Lag in MB |
+-------------+----------------+---------+-----------+----+-----------+
| postgres_01 | 192.168.61.105 | Leader  | running   |  2 |           |
| postgres_02 | 192.168.61.106 | Replica | streaming |  2 |         0 |
| postgres_03 | 192.168.61.107 | Replica | streaming |  2 |         0 |
+-------------+----------------+---------+-----------+----+-----------+
When should the restart take place (e.g. 2023-11-21T10:32)  [now]:
Are you sure you want to restart members postgres_01, postgres_02, postgres_03? [y/N]: y
Restart if the PostgreSQL version is less than provided (e.g. 9.5.2)  []:
Success: restart on member postgres_01
Success: restart on member postgres_02
Success: restart on member postgres_03

3. Then curl returns "unknown" state.

[postgres@sophia-pghost5 patroni]$ curl -s http://192.168.61.105:8008 | jq .
{
  "state": "unknown",
  "role": "master",
  "dcs_last_seen": 1700530355,
  "database_system_identifier": "7303378219179270632",
  "patroni": {
    "version": "3.2.0",
    "scope": "postgres-cluster",
    "name": "postgres_01"
  }
}

4. patroni.log:
2023-11-21 09:29:36,809 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:29:46,816 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:29:56,810 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:30:06,809 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:30:16,809 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:30:20,627 INFO: Lock owner: postgres_01; I am postgres_01
2023-11-21 09:30:20,666 INFO: closed patroni connections to postgres
2023-11-21 09:30:20,974 INFO: postmaster pid=2336
2023-11-21 09:30:21,995 INFO: Lock owner: postgres_01; I am postgres_01
2023-11-21 09:30:21,995 INFO: establishing a new patroni heartbeat connection to postgres
2023-11-21 09:30:22,009 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:30:32,002 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:30:42,002 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:30:52,001 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:31:02,011 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:31:12,010 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:31:22,006 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:31:32,002 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:31:42,008 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:31:45,590 INFO: establishing a new patroni restapi connection to postgres
2023-11-21 09:31:52,002 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:32:02,001 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:32:12,001 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:32:22,001 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:32:24,621 INFO: Lock owner: postgres_01; I am postgres_01
2023-11-21 09:32:24,659 INFO: closed patroni connections to postgres
2023-11-21 09:32:24,964 INFO: postmaster pid=2431
2023-11-21 09:32:25,979 INFO: Lock owner: postgres_01; I am postgres_01
2023-11-21 09:32:25,979 INFO: establishing a new patroni heartbeat connection to postgres
2023-11-21 09:32:25,996 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:32:35,987 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:32:44,880 ERROR: get_postgresql_status
Traceback (most recent call last):
  File "/usr/local/python3/lib/python3.11/site-packages/patroni/postgresql/connection.py", line 73, in query
    cursor.execute(sql.encode('utf-8'), params or None)
psycopg2.errors.AdminShutdown: terminating connection due to administrator command
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/python3/lib/python3.11/site-packages/patroni/api.py", line 1278, in get_postgresql_status
    row = self.query(stmt.format(postgresql.wal_name, postgresql.lsn_name,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3/lib/python3.11/site-packages/patroni/api.py", line 1211, in query
    return self.server.query(sql, *params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3/lib/python3.11/site-packages/patroni/api.py", line 1412, in query
    return connection.query(sql, *params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3/lib/python3.11/site-packages/patroni/postgresql/connection.py", line 84, in query
    raise PostgresConnectionException('connection problems') from exc
patroni.exceptions.PostgresConnectionException: connection problems
2023-11-21 09:32:46,009 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:32:55,992 INFO: no action. I am (postgres_01), the leader with the lock
2023-11-21 09:33:05,988 INFO: no action. I am (postgres_01), the leader with the lock

I should clarify that the first curl request will return success. After restart postgres node, the second curl request will return unknown as the first REST API connection is not closed properly. You didn't reproduce it maybe it's your first curl request.

I created a pull request #2956 for this issue, Please help review it. Thanks.