Intermittent "ERROR: Can not fetch local timeline and lsn from replication connection"

Question

Intermittent "ERROR: Can not fetch local timeline and lsn from replication connection"

rmelilloii opened this issue 5 months ago · comments

What happened?

Hello, good morning/afternoon/evening and happy Wednesday!

I got an issue that I believe is a bug as, apparently, it does not affect any of my clusters (all of them present the same behaviour). I have a mix of PG v14 (spilo-14:2.1-p7) and PG v15 (spilo-15:3.0-p1). Operator v1.10.1.

I run my Kubernetes (v1.25.9+k3s1) clusters in a Cloud Hosting (Hetzner)
My PG Instance is a 2 node (1 master, 1 replica)

How can we reproduce it (as minimally and precisely as possible)?

look at the logs on a replica pod.

What did you expect to happen?

I expect to see a more descriptive log OR no connection error.

Patroni/PostgreSQL/DCS version

Patroni version: 1.10.1
PostgreSQL version: 14 and 15
DCS (and its version):

Patroni configuration file

bootstrap:
  dcs:
    failsafe_mode: false
    loop_wait: 10
    maximum_lag_on_failover: 33554432
    postgresql:
      parameters:
        archive_mode: 'on'
        archive_timeout: 1800s
        autovacuum_analyze_scale_factor: 0.02
        autovacuum_max_workers: 5
        autovacuum_vacuum_scale_factor: 0.05
        checkpoint_completion_target: 0.9
        hot_standby: 'on'
        log_autovacuum_min_duration: 0
        log_checkpoints: 'on'
        log_connections: 'on'
        log_disconnections: 'on'
        log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
        log_lock_waits: 'on'
        log_min_duration_statement: 500
        log_statement: ddl
        log_temp_files: 0
        max_connections: '300'
        max_replication_slots: 10
        max_wal_senders: 10
        tcp_keepalives_idle: 900
        tcp_keepalives_interval: 100
        track_functions: all
        wal_compression: 'on'
        wal_level: hot_standby
        wal_log_hints: 'on'
      use_pg_rewind: true
      use_slots: true
    retry_timeout: 10
    ttl: 30
  initdb:
  - auth-host: md5
  - auth-local: trust
  post_init: /scripts/post_init.sh "zalandos"
  users:
    zalandos:
      options:
      - CREATEDB
      - NOLOGIN
      password: ''
kubernetes:
  bypass_api_service: true
  labels:
    application: spilo
  port: tcp://10.43.0.1:443
  port_443_tcp: tcp://10.43.0.1:443
  port_443_tcp_addr: 10.43.0.1
  port_443_tcp_port: '443'
  port_443_tcp_proto: tcp
  ports:
  - name: postgresql
    port: 5432
  role_label: spilo-role
  scope_label: cluster-name
  service_host: 10.43.0.1
  service_port: '443'
  service_port_https: '443'
  use_endpoints: true
postgresql:
  authentication:
    replication:
      password: ******
      username: standby
    superuser:
      password: ******
      username: postgres
  basebackup_fast_xlog:
    command: /scripts/basebackup.sh
    retries: 2
  bin_dir: /usr/lib/postgresql/15/bin
  callbacks:
    on_role_change: /scripts/on_role_change.sh zalandos true
  connect_address: 10.244.1.17:5432
  create_replica_method:
  - basebackup_fast_xlog
  data_dir: /home/postgres/pgdata/pgroot/data
  listen: '*:5432'
  name: expireon-postgres-1-1
  parameters:
    archive_command: /bin/true
    bg_mon.history_buckets: 120
    bg_mon.listen_address: '::'
    extwlist.custom_path: /scripts
    extwlist.extensions: btree_gin,btree_gist,citext,extra_window_functions,first_last_agg,hll,hstore,hypopg,intarray,ltree,pgcrypto,pgq,pgq_node,pg_trgm,postgres_fdw,tablefunc,uuid-ossp,pg_partman
    log_destination: csvlog
    log_directory: ../pg_log
    log_file_mode: '0644'
    log_filename: postgresql-%u.log
    log_rotation_age: 1d
    log_truncate_on_rotation: 'on'
    logging_collector: 'on'
    pg_stat_statements.track_utility: 'off'
    shared_buffers: 800MB
    shared_preload_libraries: bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,pg_cron,pg_stat_kcache
    ssl: 'on'
    ssl_cert_file: /run/certs/server.crt
    ssl_key_file: /run/certs/server.key
  pg_hba:
  - local     all  all  trust
  - host      all  all  0.0.0.0/0   md5
  - host      all  all  ::1/128     md5
  - host    replication    postgres             127.0.0.1/0          trust
  - host    replication    standby             ::1/128              trust
  - host    replication    standby             10.0.0.0/8              trust
  pgpass: /run/postgresql/pgpass
  use_unix_socket: true
  use_unix_socket_repl: true
restapi:
  connect_address: 10.244.1.17:8008
  listen: :8008
scope: expireon-postgres-1

patronictl show-config

failsafe_mode: false
loop_wait: 10
maximum_lag_on_failover: 33554432
pause: false
pg_hba:
- local     all  all  trust
- host      all  all  0.0.0.0/0   md5
- host      all  all  ::1/128     md5
- host    replication    postgres             127.0.0.1/0          trust
- host    replication    standby             ::1/128              trust
- host    replication    standby             10.0.0.0/8              trust
postgresql:
  parameters:
    archive_mode: 'on'
    archive_timeout: 1800s
    autovacuum_analyze_scale_factor: 0.02
    autovacuum_max_workers: 5
    autovacuum_vacuum_scale_factor: 0.05
    checkpoint_completion_target: 0.9
    hot_standby: 'on'
    log_autovacuum_min_duration: 0
    log_checkpoints: 'on'
    log_connections: 'on'
    log_disconnections: 'on'
    log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
    log_lock_waits: 'on'
    log_min_duration_statement: 500
    log_statement: ddl
    log_temp_files: 0
    max_connections: '300'
    max_replication_slots: 10
    max_wal_senders: 10
    tcp_keepalives_idle: 900
    tcp_keepalives_interval: 100
    track_functions: all
    wal_compression: 'on'
    wal_level: hot_standby
    wal_log_hints: 'on'
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
ttl: 30

Patroni log files

Cluster Tools
v2.7.5
Deployment: postgres-operator Active
Namespace: default
Age: 1.3 hours
Pod Restarts: 0
Image: registry.opensource.zalan.do/acid/postgres-operator:v1.10.0
Ready: 1/1
Up-to-date: 1
Available: 1
Labels:
application : postgres-operator
Annotations: Show 2 annotations
Scale
1
Pods by State
1
Running
Pods
Services
Ingresses
Conditions
Recent Events
Related Resources
State
Name
Image
Ready
Restarts
IP
Node
Age
Running
postgres-operator-57b869fc86-q5kk6	registry.opensource.zalan.do/acid/postgres-operator:v1.10.0
1/1	0	10.244.2.10	rome-pg-cx41-master3	1.3 hours	
Container: postgres-operator
Filter
Connected
time="2024-01-03T14:35:57Z" level=info msg="SYNC event has been queued" cluster-name=default/expireon-postgres-1 pkg=controller worker=0
2024-01-03T14:35:57.117892852Z time="2024-01-03T14:35:57Z" level=info msg="there are 1 clusters running" pkg=controller
2024-01-03T14:35:57.127381793Z time="2024-01-03T14:35:57Z" level=info msg="syncing of the cluster started" cluster-name=default/expireon-postgres-1 pkg=controller worker=0
2024-01-03T14:35:57.127971932Z time="2024-01-03T14:35:57Z" level=warning msg="cannot initialize a new manifest robot role with the name of the system user \"postgres\"" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="team API is disabled" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=info msg="syncing secrets" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.167542488Z time="2024-01-03T14:35:57Z" level=debug msg="syncing master service" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.178878542Z time="2024-01-03T14:35:57Z" level=debug msg="syncing replica service" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="syncing volumes using \"pvc\" storage resize mode" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.199962562Z time="2024-01-03T14:35:57Z" level=info msg="volume claims do not require changes" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.200010711Z time="2024-01-03T14:35:57Z" level=debug msg="syncing statefulsets" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="syncing Patroni config" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="making GET http request: http://10.244.2.12:8008/config" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.536190031Z time="2024-01-03T14:35:57Z" level=debug msg="making GET http request: http://10.244.1.17:8008/config" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.543113972Z time="2024-01-03T14:35:57Z" level=debug msg="making GET http request: http://10.244.2.12:8008/patroni" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.568690736Z time="2024-01-03T14:35:57Z" level=debug msg="making GET http request: http://10.244.1.17:8008/patroni" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="syncing pod disruption budgets" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.591258609Z time="2024-01-03T14:35:57Z" level=debug msg="syncing logical backup job" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="Generating logical backup pod template" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.597749923Z time="2024-01-03T14:35:57Z" level=info msg="Mount additional volumes: []" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="syncing roles" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="closing database connection" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.682488600Z time="2024-01-03T14:35:57Z" level=debug msg="syncing databases" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.755958382Z time="2024-01-03T14:35:57Z" level=debug msg="closing database connection" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.756012016Z time="2024-01-03T14:35:57Z" level=debug msg="syncing prepared databases with schemas" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="syncing connection pooler (master, replica) from (false, nil) to (false, nil)" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=debug msg="making GET http request: http://10.244.2.12:8008/patroni" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.774303310Z time="2024-01-03T14:35:57Z" level=debug msg="making GET http request: http://10.244.1.17:8008/patroni" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
2024-01-03T14:35:57.781864307Z time="2024-01-03T14:35:57Z" level=info msg="healthy cluster ready to upgrade, current: 150002 desired: 150000" cluster-name=default/expireon-postgres-1 pkg=cluster worker=0
time="2024-01-03T14:35:57Z" level=info msg="cluster has been synced" cluster-name=default/expireon-postgres-1 pkg=controller worker=0

PostgreSQL log files

2024-01-03 13:50:28,838 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:50:38,842 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:50:48,847 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:50:58,837 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:51:08,836 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:51:18,836 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:51:18,955 ERROR: Can not fetch local timeline and lsn from replication connection
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 904, in get_replica_timeline
    with self.get_replication_connection_cursor(**self.config.local_replication_address) as cur:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 899, in get_replication_connection_cursor
    with get_connection_cursor(**conn_kwargs) as cur:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/connection.py", line 43, in get_connection_cursor
    conn = psycopg.connect(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/patroni/psycopg.py", line 42, in connect
    ret = _connect(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  no pg_hba.conf entry for replication connection from host "[local]", user "standby", no encryption

2024-01-03 13:51:28,843 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:51:38,835 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:51:48,840 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:51:58,842 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:52:08,837 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:52:18,852 INFO: no action. I am (expireon-postgres-1-0), a secondary, and following a leader (expireon-postgres-1-1)
2024-01-03 13:52:19,001 ERROR: Can not fetch local timeline and lsn from replication connection
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 904, in get_replica_timeline
    with self.get_replication_connection_cursor(**self.config.local_replication_address) as cur:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/__init__.py", line 899, in get_replication_connection_cursor
    with get_connection_cursor(**conn_kwargs) as cur:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/patroni/postgresql/connection.py", line 43, in get_connection_cursor
    conn = psycopg.connect(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/patroni/psycopg.py", line 42, in connect
    ret = _connect(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL:  no pg_hba.conf entry for replication connection from host "[local]", user "standby", no encryption

Have you tried to use GitHub issue search?

Yes

Anything else we need to know?

My 2 nodes PG cluster is always in "synced state" without any lag. Failover works fine without data loss.

rmelilloii · Answer 1 · Wed Jan 03 2024 23:43:52 GMT+0800 (China Standard Time)

I was missing on my pg_hba:
local replication all trust

Thanks for you help and extensive explanation @CyberDem0n. 🍻🙇🏽