openshift / kubernetes

This is the repo that tracks all patches to the OpenShift distribution of Kubernetes on branches corresponding to OpenShift releases. See https://github.com/openshift/kubernetes/blob/master/README.openshift.md for more

Home Page:http://kubernetes.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SystemReservedMemory default value insufficient on clusters under load.

pbmoses opened this issue · comments

commented

What happened:

OpenShift 4.6 introduced an alert of SystemMemoryExceedsReservation . The default memory in openShift looks to be 1Gi for systemReserved on nodes. I am seeing a pattern of this being insufficient memory with clusters under load (on VMware) and immediately following upgrades, the alert is firing. (unless, these is an actual issue with the alert )

/etc/kubernetes/kubelet.conf
...
systemReserved:
cpu: 500m
memory: 1Gi
ephemeral-storage: 1Gi
...

What you expected to happen:

Following an upgrade, I would expect an unmodified cluster to not alert due to insufficient systemReserved memory. This can be alleviated by modifying the kubelet.config via a custom kubelet CR, however I believe the default systemReserved should be re-evaluated to appropriately accomodate clusters under load.

How to reproduce it (as minimally and precisely as possible):

Upgrade from 4.5 to 4.6+ with monitoring stack enabled. Watch for alert "SystemMemoryExceedsReservation"

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
    server: 4.7.13
    kubernetes: v1.20.0+df9c838
  • Cloud provider or hardware configuration:
    Vmware UPI
  • OS (e.g: cat /etc/os-release):
  • RHCOS 4.7
  • Kernel (e.g. uname -a):
  • 4.18.0-240.22.1.el8_3.x86_64
  • Others:

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.