Error when disconnecting the network on postgres and pgpool standby node

Question

Error when disconnecting the network on postgres and pgpool standby node

hernanpezzani opened this issue 3 months ago · comments

The postgres and pgpool configuration is the following one:
version postgresql 14.12 and pgpool 4.4.7
Node 0 is primary postgresql server (backend 0) and standby pgpool_node,
Node 1 is the standby postgresql (backend 1) and standby pgpool_node
Node 2 is LEADER of pgpool with delegate_IP UP and it does not have postgresql.
WatchDog is enabled with hearthbeat

Postgres works perfectly, reads, writes and works with pgpool connecting from port 9999.

Detail of the failure:
When the network of the Node 1 is TURNED DOWN (VMWARE disconnect network), the cluster pgpool delays connection to the database through the delegate_IP and port 9999. We have checked it and it takes 1-2 min to ask for the password.
psql -h delegate_IP -p 9999 -U prueba -d prueba
after 1-2 min prompt password

while the network is DOWN on node 1, direct access to node 0 postgresql through port 5432 is correct, the replication between the nodes becomes asynchronous and postgres operation is as expected, but access through IP delegated by port 9999 is still delayed

When the network is TURNED ON again, the pgpool service continues delaying connections until the service gets restarted on Node 2 (PGPOOL LEADER).

PGPOOL STATE NETWORK DOWN
3 3 YES nodo2.dominio.com:9999 Linux nodo2.dominio.com nodo2.dominio.com
nodo0.dominio.com:9999 Linux nodo0.dominio.com nodo0.dominio.com 9999 90000 7 STANDBY 0 MEMBER
nodo1.dominio.com:9999 Linux nodo1.dominio.com nodo1.dominio.com 9999 90000 8 LOST 0 MEMBER
nodo2.dominio.com:9999 Linux nodo2.dominio.com nodo2.dominio.com 9999 90000 4 LEADER 0 MEMBER

config pgpool
delegate_IP = '172.27.37.56'
if_cmd_path = '/sbin'
if_up_cmd = '/bin/sudo /sbin/ip addr add $IP$/24 dev ens192 label ens192:0'
if_down_cmd = '/bin/sudo /sbin/ip addr del $IP$/24 dev ens192'
arping_path = '/usr/sbin'
arping_cmd = '/bin/sudo /sbin/arping -U $IP$ -w 1 -I ens192'

load_balance_mode = off
connection_cache = on

backend_hostname0 = 'nodo0.dominio.com'
backend_port0 = 5432
backend_data_directory0 = '/postgresql/14/data/mdona-cpdm-pre-cluster07'
backend_weight0 = 1
backend_application_name0 = 'nodo0'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_hostname1 = 'nodo1.dominio.com'
backend_port1 = 5432
backend_data_directory1 = '/postgresql/14/data/mdona-cpdm-pre-cluster07'
backend_weight1 = 1
backend_application_name1 = 'nodo1'
backend_flag1 = 'ALLOW_TO_FAILOVER'

use_watchdog = on
wd_escalation_command = '/etc/pgpool-II/escalation.sh'
hostname0 = 'nodo0.dominio.com'
pgpool_port0 = 9999
wd_port0 = 9000
heartbeat_hostname0 = 'nodo0.dominio.com'
heartbeat_port0 = 9694
hostname1 = 'nodo1.dominio.com'
pgpool_port1 = 9999
wd_port1 = 9000
heartbeat_hostname1 = 'nodo1.dominio.com
heartbeat_port1 = 9694
hostname2 = 'nodo2.dominio.com'
pgpool_port2 = 9999
wd_port2 = 9000
heartbeat_hostname2 = 'nodo2.dominio.com'
heartbeat_port2 = 9694

process_management_mode = dynamic
process_management_strategy = gentle
min_spare_children = 10
max_spare_children = 20

health_check_period = 5
health_check_timeout = 20
health_check_user = 'pgpool2'
health_check_password = ''
health_check_database = ''
health_check_max_retries = 0
health_check_retry_delay = 1
connect_timeout = 10000

num_init_children = 500
max_pool = 2

child_life_time = 120
child_max_connections = 0
connection_life_time = 120
client_idle_limit = 0

backend_clustering_mode = 'streaming_replication'
listen_addresses = '*'
port = 9999
socket_dir = '/tmp'

Hernan Pezzani

Bo Peng · Answer 1 · Tue Jun 25 2024 14:01:38 GMT+0800 (China Standard Time)

@hernanpezzani Thank you for reporting this issue.

I tried to reproduce this issue, but I couldn't reproduce it.

Detail of the failure:
When the network of the Node 1 is TURNED DOWN (VMWARE disconnect network), the cluster pgpool delays connection to the database through the delegate_IP and port 9999. We have checked it and it takes 1-2 min to ask for the password.
psql -h delegate_IP -p 9999 -U prueba -d prueba
after 1-2 min prompt password

Node1 is the standby pgpool and standby postgres.
Does this issue only occur when you TURN DOWN the network on Node1?
If you TURN DOWN the network on other nodes, the same issue occurs?

I could not find the configuration of failover scripts.
Does pgpool manage failover of PostgreSQL?

Could you enable the debug log and reproduce this issue again?

log_min_messages = debug5

Please share:

pgpool.conf
the debug log
time when you turned down the network
time when you ran psql command

hernanpezzani · Answer 2 · Wed Jun 26 2024 17:03:00 GMT+0800 (China Standard Time)

Hi @pengbo0328

Node1 is the standby pgpool and standby postgres. Yes

Does this issue only occur when you TURN DOWN the network on Node1?
If you TURN DOWN the network on other nodes, the same issue occurs?

We have not simulated it because we focused on seeing why it failed when both nodes were standby (postgres and pgpool).
The network loss simulation is done by disconnecting the virtual machine network (uncheck connected), when the services are stopped or the node is restarted or stopped the behavior is as expected.

I could not find the configuration of failover scripts.
Does pgpool manage failover of PostgreSQL? Yes

Could you enable the debug log and reproduce this issue again? Yes

time when you turned down the network: "5 -10 min approximately"

time when you ran psql command: "every 1 sec we try to connect from an external device"

We are preparing (waiting for the OK from security) the file with the scripts.sh, .conf and only the logs with debug5 of the primary pgpool node (node2), the file size is about 50Mb, we attach it to the issue?

Do you need the logs of all 3 nodes?

hernanpezzani · Answer 3 · Wed Jun 26 2024 17:49:32 GMT+0800 (China Standard Time)

In the log with debug5 enabled, the lines with the format of the example, which is what is after "->" (marked in bold)

example: `postgresSELECT count() FROM pg_catalog.pg_proc AS p, pg_catalog.pg_namespace AS n WHERE p.proname = 'pg_read_file' AND n.oid = p.pronamespace AND n.nspname ~ '.' AND p.provolatile = 'v'postgres' -> '5564261919auUa8uyt9ceceb0626896e'.

Is the user's password?

Tatsuo Ishii · Answer 4 · Wed Jun 26 2024 19:00:00 GMT+0800 (China Standard Time)

Is the user's password?

No.

hernanpezzani · Answer 5 · Wed Jun 26 2024 22:44:42 GMT+0800 (China Standard Time)

I cannot upload files larger than 25 mb. If I split it with split -b 20M file.tar.gz "file.part" it does not allow me to upload the extensions partaa, etc.

Bo Peng · Answer 6 · Thu Jun 27 2024 10:12:11 GMT+0800 (China Standard Time)

@hernanpezzani

We have not simulated it because we focused on seeing why it failed when both nodes were standby (postgres and pgpool).
The network loss simulation is done by disconnecting the virtual machine network (uncheck connected), when the services are stopped or the node is restarted or stopped the behavior is as expected.

I tried the same way to disconnect the netwrok (uncheck connected on VMWare settings).
Then the pgpool node is in LOST status, and standby PostgreSQL is in down status.

I cannot upload files larger than 25 mb. If I split it with split -b 20M file.tar.gz "file.part"
it does not allow me to upload the extensions partaa, etc.

Thank you for preparing the debug log.
OK.
Because it is in a large size, could you share the default logs with "log_min_messages = warning"?
I will try to check whether the cause can be identified from the default logs.

If you could share the pgpool.conf and the scripts (failover and escalation scripts), I will try to reproduce this issue and enable debug logs by myself.

hernanpezzani · Answer 7 · Thu Jun 27 2024 15:49:56 GMT+0800 (China Standard Time)

please rename the file to zip

hernanpezzani · Answer 8 · Thu Jun 27 2024 15:52:49 GMT+0800 (China Standard Time)

https://drive.google.com/file/d/1Dso7q_hM6NlT2_7Ybjyd3nH0-nWyuY1t/view?usp=drive_link

link to drive.google

hernanpezzani · Answer 9 · Fri Jul 12 2024 19:25:02 GMT+0800 (China Standard Time)

Hello
Have you seen any error in the configuration that can generate the failure when we lose network in standby?
Can we contribute with any log or local test?
Thanks

Bo Peng · Answer 10 · Tue Jul 16 2024 21:39:35 GMT+0800 (China Standard Time)

@hernanpezzani Sorry for the late response.

From your logs, because failover completed in one second, I think the failover script doesn't cause this issue.

Because the log file is large (1.3G), it is difficult to identify the child process which you connected to using psql command.
If you have a test environment, could you disable the connections from other applications and connect to pgpool only using psql command?

time when you ran psql command: "every 1 sec we try to connect from an external device"

Run psql every second will cause lots of debug logs, could you please try it just once?

Also, it will be appreciated if you could also let me know the time you connected to pgpool using psql command.
It is helpful for identifying the child process which is connected by psql command.

I also noticed that wd_monitoring_interfaces_list caused pgpool shutdown on Node 1 when the network is turned down on Node1. Could you disable this parameter and try again?

hernanpezzani · Answer 11 · Tue Jul 16 2024 21:54:52 GMT+0800 (China Standard Time)

Hello

We are going to perform the tests by limiting the connection attempts to keep the log as simple as possible.

We are also going to perform new tests with the parameter wd_monitoring_interfaces_list deactivated
wd_monitoring_interfaces_list=''

hernanpezzani · Answer 12 · Mon Jul 29 2024 19:44:59 GMT+0800 (China Standard Time)

Hello,
With the parameter wd_monitoring_interfaces_list='' the error is still reproduced.

For the test of disabling all connections, we have modified the listening port of pgpool to 9998 to avoid any external connection. Removing all application connections and only making connections from node3 and from an external node, everything was OK.

We changed the port back to 9999 to allow the application connections to come in again (about 220) and when we did this, again the failure occurred and the connection took about 2 minutes.

we have made the test in another cluster with less load and the error has not been reproduced, but in clusters with load also it is reproduced, of about 35 clusters that we have at the moment about 15 are affected by this case.

Bo Peng · Answer 13 · Tue Jul 30 2024 12:06:05 GMT+0800 (China Standard Time)

@hernanpezzani Thank you for testing.
I will try to increase the load and see if it can be reproduced.

hernanpezzani · Answer 14 · Tue Jul 30 2024 14:20:36 GMT+0800 (China Standard Time)

@pengbo0328 If you need to make any test, configuration change in which we can collaborate, if it is within our reach we can do it.

Bo Peng · Answer 15 · Mon Aug 19 2024 20:19:56 GMT+0800 (China Standard Time)

@hernanpezzani Could you check if the same issue occurs when setting the following parameters?

process_management_mode = static
reserved_connections = 1

The settings above enable static process_management_mode and displays an error message when the connections reaches max connections.
It is possible that new connections were hanging because there were no idle child processes.

If the same issue occurs, could you run the following command to display the status of connections.

pcp_proc_info -p 9898 -h <delegate_IP>

hernanpezzani · Answer 16 · Tue Aug 20 2024 21:25:01 GMT+0800 (China Standard Time)

Hi @pengbo0328
we have changed the configuration and performed the test.
At about 12:56 we disconnected the network and the instability was only about 2min. After this time, the cluster is up, the standby node is down and the pgbench test continues with the test. We observed errors in the log but we are not able to know if it is from the network downtime or from the connected applications.

attach log_prueba.txt

hernanpezzani · Answer 17 · Tue Aug 20 2024 21:27:42 GMT+0800 (China Standard Time)

can you confirm if log_prueba.txt is uploaded?