Intermittent "ERROR: Can not fetch local timeline and lsn from replication connection"
rmelilloii opened this issue · comments
What happened?
Hello, good morning/afternoon/evening and happy Wednesday!
I got an issue that I believe is a bug as, apparently, it does not affect any of my clusters (all of them present the same behaviour). I have a mix of PG v14 (spilo-14:2.1-p7) and PG v15 (spilo-15:3.0-p1). Operator v1.10.1.
I run my Kubernetes (v1.25.9+k3s1) clusters in a Cloud Hosting (Hetzner)
My PG Instance is a 2 node (1 master, 1 replica)
How can we reproduce it (as minimally and precisely as possible)?
look at the logs on a replica pod.
What did you expect to happen?
I expect to see a more descriptive log OR no connection error.
Patroni/PostgreSQL/DCS version
- Patroni version: 1.10.1
- PostgreSQL version: 14 and 15
- DCS (and its version):
Patroni configuration file
bootstrap:
dcs:
failsafe_mode: false
loop_wait: 10
maximum_lag_on_failover: 33554432
postgresql:
parameters:
archive_mode: 'on'
archive_timeout: 1800s
autovacuum_analyze_scale_factor: 0.02
autovacuum_max_workers: 5
autovacuum_vacuum_scale_factor: 0.05
checkpoint_completion_target: 0.9
hot_standby: 'on'
log_autovacuum_min_duration: 0
log_checkpoints: 'on'
log_connections: 'on'
log_disconnections: 'on'
log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
log_lock_waits: 'on'
log_min_duration_statement: 500
log_statement: ddl
log_temp_files: 0
max_connections: '300'
max_replication_slots: 10
max_wal_senders: 10
tcp_keepalives_idle: 900
tcp_keepalives_interval: 100
track_functions: all
wal_compression: 'on'
wal_level: hot_standby
wal_log_hints: 'on'
use_pg_rewind: true
use_slots: true
retry_timeout: 10
ttl: 30
initdb:
- auth-host: md5
- auth-local: trust
post_init: /scripts/post_init.sh "zalandos"
users:
zalandos:
options:
- CREATEDB
- NOLOGIN
password: ''
kubernetes:
bypass_api_service: true
labels:
application: spilo
port: tcp://10.43.0.1:443
port_443_tcp: tcp://10.43.0.1:443
port_443_tcp_addr: 10.43.0.1
port_443_tcp_port: '443'
port_443_tcp_proto: tcp
ports:
- name: postgresql
port: 5432
role_label: spilo-role
scope_label: cluster-name
service_host: 10.43.0.1
service_port: '443'
service_port_https: '443'
use_endpoints: true
postgresql:
authentication:
replication:
password: ******
username: standby
superuser:
password: ******
username: postgres
basebackup_fast_xlog:
command: /scripts/basebackup.sh
retries: 2
bin_dir: /usr/lib/postgresql/15/bin
callbacks:
on_role_change: /scripts/on_role_change.sh zalandos true
connect_address: 10.244.1.17:5432
create_replica_method:
- basebackup_fast_xlog
data_dir: /home/postgres/pgdata/pgroot/data
listen: '*:5432'
name: expireon-postgres-1-1
parameters:
archive_command: /bin/true
bg_mon.history_buckets: 120
bg_mon.listen_address: '::'
extwlist.custom_path: /scripts
extwlist.extensions: btree_gin,btree_gist,citext,extra_window_functions,first_last_agg,hll,hstore,hypopg,intarray,ltree,pgcrypto,pgq,pgq_node,pg_trgm,postgres_fdw,tablefunc,uuid-ossp,pg_partman
log_destination: csvlog
log_directory: ../pg_log
log_file_mode: '0644'
log_filename: postgresql-%u.log
log_rotation_age: 1d
log_truncate_on_rotation: 'on'
logging_collector: 'on'
pg_stat_statements.track_utility: 'off'
shared_buffers: 800MB
shared_preload_libraries: bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,pg_cron,pg_stat_kcache
ssl: 'on'
ssl_cert_file: /run/certs/server.crt
ssl_key_file: /run/certs/server.key
pg_hba:
- local all all trust
- host all all 0.0.0.0/0 md5
- host all all ::1/128 md5
- host replication postgres 127.0.0.1/0 trust
- host replication standby ::1/128 trust
- host replication standby 10.0.0.0/8 trust
pgpass: /run/postgresql/pgpass
use_unix_socket: true
use_unix_socket_repl: true
restapi:
connect_address: 10.244.1.17:8008
listen: :8008
scope: expireon-postgres-1
patronictl show-config
failsafe_mode: false
loop_wait: 10
maximum_lag_on_failover: 33554432
pause: false
pg_hba:
- local all all trust
- host all all 0.0.0.0/0 md5
- host all all ::1/128 md5
- host replication postgres 127.0.0.1/0 trust
- host replication standby ::1/128 trust
- host replication standby 10.0.0.0/8 trust
postgresql:
parameters:
archive_mode: 'on'
archive_timeout: 1800s
autovacuum_analyze_scale_factor: 0.02
autovacuum_max_workers: 5
autovacuum_vacuum_scale_factor: 0.05
checkpoint_completion_target: 0.9
hot_standby: 'on'
log_autovacuum_min_duration: 0
log_checkpoints: 'on'
log_connections: 'on'
log_disconnections: 'on'
log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
log_lock_waits: 'on'
log_min_duration_statement: 500
log_statement: ddl
log_temp_files: 0
max_connections: '300'
max_replication_slots: 10
max_wal_senders: 10
tcp_keepalives_idle: 900
tcp_keepalives_interval: 100
track_functions: all
wal_compression: 'on'
wal_level: hot_standby
wal_log_hints: 'on'
use_pg_rewind: true
use_slots: true
retry_timeout: 10
ttl: 30
Patroni log files
Cluster Tools
v2.7.5
Deployment: postgres-operator Active
Namespace: default
Age: 1.3 hours
Pod Restarts: 0
Image: registry.opensource.zalan.do/acid/postgres-operator:v1.10.0
Ready: 1/1
Up-to-date: 1
Available: 1
Labels:
application : postgres-operator
Annotations: Show 2 annotations
Scale
1
Pods by State
1
Running
Pods
Services
Ingresses
Conditions
Recent Events
Related Resources
State
Name
Image
Ready
Restarts
IP
Node
Age
Running
postgres-operator-57b869fc86-q5kk6 registry.opensource.zalan.do/acid/postgres-operator:v1.10.0
1/1 0 10.244.2.10 rome-pg-cx41-master3 1.3 hours
Container: postgres-operator
Filter
Connected
time="2024-01-03T14:35:57Z" level=info msg="SYNC event has been queued" cluster-name=default/expireon-postgres-1 pkg=controller worker=0
2024-01-03T14:35:57.117892852Z time="2024-01-03T14:35:57Z" level=info msg="there are 1 clusters running" pkg=controller
2024-01-03T14:35:57.127381793Z time="2024-01-03T14:35:57Z" level=info msg="syncing of the cluster started" cluster-name=default/expireon-postgres-1 pkg=controller worker=0
2024-01-03T14:35:57.127971932Z time="2024-01-03T14:35:57Z" level=warning msg="cannot initialize a new manifest robot role with the name of the system user \"postgres\"" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="team API is disabled" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=info msg="syncing secrets" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.167542488Z time="2024-01-03T14:35:57Z" level=debug msg="syncing master service" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.178878542Z time="2024-01-03T14:35:57Z" level=debug msg="syncing replica service" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="syncing volumes using \"pvc\" storage resize mode" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.199962562Z time="2024-01-03T14:35:57Z" level=info msg="volume claims do not require changes" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.200010711Z time="2024-01-03T14:35:57Z" level=debug msg="syncing statefulsets" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="syncing Patroni config" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="making GET http request: http://10.244.2.12:8008/config" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.536190031Z time="2024-01-03T14:35:57Z" level=debug msg="making GET http request: http://10.244.1.17:8008/config" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.543113972Z time="2024-01-03T14:35:57Z" level=debug msg="making GET http request: http://10.244.2.12:8008/patroni" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.568690736Z time="2024-01-03T14:35:57Z" level=debug msg="making GET http request: http://10.244.1.17:8008/patroni" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="syncing pod disruption budgets" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.591258609Z time="2024-01-03T14:35:57Z" level=debug msg="syncing logical backup job" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="Generating logical backup pod template" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.597749923Z time="2024-01-03T14:35:57Z" level=info msg="Mount additional volumes: []" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="syncing roles" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="closing database connection" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.682488600Z time="2024-01-03T14:35:57Z" level=debug msg="syncing databases" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.755958382Z time="2024-01-03T14:35:57Z" level=debug msg="closing database connection" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.756012016Z time="2024-01-03T14:35:57Z" level=debug msg="syncing prepared databases with schemas" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="syncing connection pooler (master, replica) from (false, nil) to (false, nil)" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="making GET http request: http://10.244.2.12:8008/patroni" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.774303310Z time="2024-01-03T14:35:57Z" level=debug msg="making GET http request: http://10.244.1.17:8008/patroni" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.781864307Z time="2024-01-03T14:35:57Z" level=info msg="healthy cluster ready to upgrade, current: 150002 desired: 150000" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=info msg="cluster has been synced" cluster-name=default/expireon-postgres-1 pkg=controller worker=0
PostgreSQL log files
2024-01-03 13:50:28,838 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:50:38,842 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:50:48,847 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:50:58,837 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:51:08,836 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:51:18,836 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:51:18,955 ERROR: Can not fetch local timeline and lsn from replication connection
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 904, in get_replica_timeline
with self.get_replication_connection_cursor(**self.config.local_replication_address) as cur:
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 899, in get_replication_connection_cursor
with get_connection_cursor(**conn_kwargs) as cur:
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/connection.py", line 43, in get_connection_cursor
conn = psycopg.connect(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/patroni/psycopg.py", line 42, in connect
ret = _connect(*args, **kwargs)
File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: no pg_hba.conf entry for replication connection from host "[local]", user "standby", no encryption
2024-01-03 13:51:28,843 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:51:38,835 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:51:48,840 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:51:58,842 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:52:08,837 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:52:18,852 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:52:19,001 ERROR: Can not fetch local timeline and lsn from replication connection
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 904, in get_replica_timeline
with self.get_replication_connection_cursor(**self.config.local_replication_address) as cur:
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 899, in get_replication_connection_cursor
with get_connection_cursor(**conn_kwargs) as cur:
File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/connection.py", line 43, in get_connection_cursor
conn = psycopg.connect(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/patroni/psycopg.py", line 42, in connect
ret = _connect(*args, **kwargs)
File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: no pg_hba.conf entry for replication connection from host "[local]", user "standby", no encryption
Have you tried to use GitHub issue search?
- Yes
Anything else we need to know?
My 2 nodes PG cluster is always in "synced state" without any lag. Failover works fine without data loss.
I was missing on my pg_hba:
local replication all trust
Thanks for you help and extensive explanation @CyberDem0n. 🍻🙇🏽