FATAL: could not connect to the primary server: connection to server at "x.x.x.x", port 5432 failed: session is read-only

Question

FATAL: could not connect to the primary server: connection to server at "x.x.x.x", port 5432 failed: session is read-only

novanbramantya opened this issue 2 months ago · comments

What happened?

When i setup the new patroni cluster for rollback needed, i want to replicate it from standby patroni cluster which is not promoted yet. after basebackup section somehow it not continuing the replication process and got fatal said session is read-only

How can we reproduce it (as minimally and precisely as possible)?

i have existing patroni postgres cluster in 1 Cloud provider, i need to migrate it to another cloud provider and the state is done replication and set to standby leader. after that i need to setup the rollback patroni cluster. But when i start the patroni service somehow got this error
FATAL: could not connect to the primary server: connection to server at "x.x.x.x", port 5432 failed: session is read-only
most likely patroni doesn't allow replication from replica instead of leader

What did you expect to happen?

i need to replicate the data from standby patroni cluster to rollback cluster which is still in read-only session until it got promoted

Patroni/PostgreSQL/DCS version

Patroni version: patroni 3.2.2
PostgreSQL version: 12.18
DCS (and its version): 3.4

Patroni configuration file

namespace: patroni
scope: test_rollback_postgres
name: test-rollback-postgres-n1
etcd3:
  url: https://test-patroni-etcd.service.consul:2379
  cert: /etc/etcd/etcd.pem
  key: /etc/etcd/etcd-key.pem
  cacert: /etc/etcd/root-ca.pem
postgresql:
  bin_dir: /usr/lib/postgresql/12/bin
  use_unix_socket: true
  listen: 0.0.0.0:5432
  config_dir: /etc/postgresql/patroni
  data_dir: /data/postgres/data/
  pgpass: /var/lib/postgresql/.pgpass
  pg_ctl_timeout: 60
  connect_address: x.x.x.x:5432
  authentication:
    superuser:
      username: admin_user
      password: admin_pass
    replication:
      username: repl_user
      password: repl_pass
    rewind:
      username: repl_user
      password: repl_pass
  basebackup:
    - progress
    - slot: "rollback_master"
    - verbose
  parameters:
    work_mem: 8MB
    archive_timeout: 60
    archive_command: "/usr/local/bin/wal-g-push-wal.sh %p"
    checkpoint_completion_target: 0.7
    checkpoint_timeout: 15min
    hot_standby_feedback: on
    log_checkpoints: on
    log_destination: 'stderr'
    log_directory: '/var/log/postgresql'
    log_file_mode: 0600
    log_filename: 'postgresql-patroni.log'
    log_line_prefix: '%t [%p] %q%u@%d %h '
    log_rotation_age: 0
    log_rotation_size: 0
    log_min_duration_statement: 100
    maintenance_work_mem: 512MB
    log_statement: 'ddl'
    log_timezone: 'Asia/Jakarta'
    timezone: 'Asia/Jakarta'
    max_wal_size: 1GB
    min_wal_size: 80MB
    ssl: off
  pg_hba:
bootstrap:
  dcs:
    standby_cluster:
      host: x.x.x.x
      port: 5432
      primary_slot_name: rollback_master
      create_replica_methods:
        - basebackup
        - slot: "rollback_master"
        - progress
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 50000000
    master_start_timeout: 300
    synchronous_mode: false
    synchronous_mode_strict: false
    check_timeline: false
    ignore_slots:
      - type: logical
        plugin: wal2json
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        max_connections: 1024
        max_locks_per_transaction: 64
        max_worker_processes: 8
        max_prepared_transactions: 0
        wal_level: logical
        wal_log_hints: on
        track_commit_timestamp: off
        shared_preload_libraries: pg_stat_statements,pglogical,wal2json,pg_partman_bgw
        archive_mode: on
        shared_buffers: 1GB
        pg_stat_statements.track: all
        hot_standby: on
        logging_collector: on
        log_truncate_on_rotation: off
        log_lock_waits: on
        wal_keep_segments: 100
        max_wal_senders: 15
        max_replication_slots: 20
        pg_partman_bgw.interval: 3600
        pg_partman_bgw.dbname: 'testdb'
  initdb:
    - encoding: UTF8
    - data-checksums
  users:
    admin_user:
      password: admin_pass
    repl_user:
      password: repl_pass
      options:
        - replication
watchdog:
  mode: automatic
  device: /dev/watchdog
  safety_margin: 5
log:
  level: INFO
  traceback_level: ERROR
restapi:
  listen: 0.0.0.0:8008
  connect_address: x.x.x.x:8008
tags:
  nofailover: false
  noloadbalance: false

patronictl show-config

check_timeline: false
ignore_slots:
- plugin: wal2json
  type: logical
loop_wait: 10
master_start_timeout: 300
maximum_lag_on_failover: 50000000
postgresql:
  parameters:
    archive_mode: true
    hot_standby: true
    log_lock_waits: true
    log_truncate_on_rotation: false
    logging_collector: true
    max_connections: 1024
    max_locks_per_transaction: 64
    max_prepared_transactions: 0
    max_replication_slots: 20
    max_wal_senders: 15
    max_worker_processes: 8
    pg_partman_bgw.dbname: testdb
    pg_partman_bgw.interval: 3600
    pg_stat_statements.track: all
    shared_buffers: 1GB
    shared_preload_libraries: pg_stat_statements,pglogical,wal2json,pg_partman_bgw
    track_commit_timestamp: false
    wal_keep_segments: 100
    wal_level: logical
    wal_log_hints: true
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
standby_cluster:
  create_replica_methods:
  - basebackup
  - slot: rollback_master
  - progress
  host: x.x.x.x
  port: 5432
  primary_slot_name: rollback_master
synchronous_mode: false
synchronous_mode_strict: false
ttl: 30

Patroni log files

:     self.handle_one_request()
:   File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 1338, in handle_one_request
:     BaseHTTPRequestHandler.handle_one_request(self)
:   File "/usr/lib/python3.10/http/server.py", line 421, in handle_one_request
:     method()
:   File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 446, in do_GET_patroni
:     self._write_status_response(200, response)
:   File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 218, in _write_status_response
:     self._write_json_response(status_code, response)
:   File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 167, in _write_json_response
:     self.write_response(status_code, json.dumps(response, default=str), content_type='application/json')
:   File "/usr/local/lib/python3.10/dist-packages/patroni/api.py", line 157, in write_response
:     self.wfile.write(body.encode('utf-8'))
:   File "/usr/lib/python3.10/socketserver.py", line 826, in write
:     self._sock.sendall(b)
: BrokenPipeError: [Errno 32] Broken pipe

PostgreSQL log files

2024-03-28 12:25:14 WIB [23648] LOG:  database system is ready to accept read only connections
2024-03-28 12:25:15 WIB [23686] FATAL:  could not connect to the primary server: connection to server at "x.x.x.x", port 5444 failed: session is read-only
2024-03-28 12:25:16 WIB [23747] FATAL:  could not connect to the primary server: connection to server at "x.x.x.x", port 5444 failed: session is read-only

Have you tried to use GitHub issue search?

Yes

Anything else we need to know?

No response

Alexander Kukushkin · Answer 1 · Thu Mar 28 2024 15:13:27 GMT+0800 (China Standard Time)

most likely patroni doesn't allow replication from replica instead of leader

To be precise, standby-leader wants to replicate from the primary. Yes.

i need to replicate the data from standby patroni cluster to rollback cluster which is still in read-only session until it got promoted

There is no real need to do it. What you should better do - gracefully convert cluster in the DC1 to standby as it is described here: #1660 (comment)
In fact, you can immediately put host and port to standby_cluster section in DC1, so that after promoting standby cluster in DC2 the cluster from DC1 will start replicating.

novanbramantya · Answer 2 · Thu Mar 28 2024 15:52:56 GMT+0800 (China Standard Time)

Thanks for the explanation! really appreciate it

But in my use case, the service has very big transaction. if i immediately create fail forward replication from DC2 to DC1, it maybe works but maybe not as well, hard to guarantee it.

So instead of doing this, we create separate rollback cluster in DC1 separate from existing cluster in DC1
so actually we do have migration project between DC 1 and DC2

so the topology is
existing patroni cluster DC1 (leader) -> standby patroni cluster DC2 (standby leader / replica from existing leader DC1) -> rollback patroni cluster DC1 (standby leader / replica from standby leader DC2)
those 2 DC has separate dcs.

with above topology, if anything goes wrong in DC2, we can immediately promote rollback cluster in DC1 and we don't need to worries about the transaction WAL because at the beginning we do have replication slot between DC2 and rollback cluster DC1.

Thanks

novanbramantya · Answer 3 · Mon Apr 01 2024 10:48:40 GMT+0800 (China Standard Time)

hi @CyberDem0n sorry to tag you again, about this PR, is there any information about when will it get the approval? Thanks!