Master Postgres Pod Was Out Of Memory - Postgres Was Inaccessible.

Question

Master Postgres Pod Was Out Of Memory - Postgres Was Inaccessible.

avi1818 opened this issue 3 months ago · comments

What happened?

Hi,

Master Postgres pod was out of memory and as a result Postgres was inaccessible. I thought that Patroni would either promote one of the slaves or restart the pod that was master.
What is the expected behavior in such a case?

kernel: postgres invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=993
kernel: [] oom_kill_process+0x2cd/0x490
kernel: Task in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7eab2e7a_917c_4996_bb0f_0ccbb75719ce.slice/docker-df58adbf214854f03fbdef15ac336f1b319038a93995a993a43abe286bbca815.scope killed as a result of limit of /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod7eab2e7a_917c_4996_bb0f_0ccbb75719ce.slice
kernel: Memory cgroup out of memory: Kill process 23316 (postgres) score 1033 or sacrifice child
kernel: Killed process 23316 (postgres), UID 101, total-vm:8221844kB, anon-rss:1195384kB, file-rss:19160kB, shmem-rss:112164kB
abrt-hook-ccpp: Process 438 (postgres) of user 101 killed by SIGABRT - dumping core

How can we reproduce it (as minimally and precisely as possible)?

none

What did you expect to happen?

none

Patroni/PostgreSQL/DCS version

Patroni version: 2.0.1
PostgreSQL version: 13
DCS (and its version):

Patroni configuration file

none

patronictl show-config

2024-02-29 19:23:51,905 - WARNING - Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
loop_wait: 10
maximum_lag_on_failover: 33554432
postgresql:
  parameters:
    archive_mode: false
    archive_timeout: 1800s
    autovacuum_analyze_scale_factor: 0.02
    autovacuum_max_workers: 5
    autovacuum_vacuum_scale_factor: 0.05
    checkpoint_completion_target: 0.9
    hot_standby: 'on'
    log_autovacuum_min_duration: 0
    log_checkpoints: 'on'
    log_connections: 'on'
    log_destination: stderr
    log_disconnections: 'on'
    log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
    log_lock_waits: 'on'
    log_min_duration_statement: 500
    log_statement: ddl
    log_temp_files: 0
    logging_collector: false
    max_connections: 533
    max_logical_replication_workers: 90
    max_replication_slots: 90
    max_slot_wal_keep_size: 5000
    max_wal_senders: 90
    max_worker_processes: '90'
    tcp_keepalives_idle: 900
    tcp_keepalives_interval: 100
    track_commit_timestamp: 'on'
    track_functions: all
    wal_level: logical
    wal_log_hints: 'on'
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
ttl: 30

Patroni log files

none

PostgreSQL log files

none

Have you tried to use GitHub issue search?

Yes

Anything else we need to know?

none

Alexander Kukushkin · Answer 1 · Thu Mar 07 2024 16:35:48 GMT+0800 (China Standard Time)

I thought that Patroni would either promote one of the slaves or restart the pod that was master.

There are no slaves in the PostgreSQL world. What would Patroni do depends on many things. By default it will start the failed postgres up. And I am quite confident that it did that, the proof should be in Patroni logs, but you didn't bother to check/provide them.

Patroni version: 2.0.1

This is veeeery old version, please update to the latest (3.2.2 atm) ASAP.

In general, OOM is not a Patroni problem. It is your task to give Patroni/Postgres enough resources to work.