SystemReservedMemory default value insufficient on clusters under load.

Question

SystemReservedMemory default value insufficient on clusters under load.

pbmoses opened this issue 3 years ago · comments

What happened:

OpenShift 4.6 introduced an alert of SystemMemoryExceedsReservation . The default memory in openShift looks to be 1Gi for systemReserved on nodes. I am seeing a pattern of this being insufficient memory with clusters under load (on VMware) and immediately following upgrades, the alert is firing. (unless, these is an actual issue with the alert )

/etc/kubernetes/kubelet.conf
...
systemReserved:
cpu: 500m
memory: 1Gi
ephemeral-storage: 1Gi
...

What you expected to happen:

Following an upgrade, I would expect an unmodified cluster to not alert due to insufficient systemReserved memory. This can be alleviated by modifying the kubelet.config via a custom kubelet CR, however I believe the default systemReserved should be re-evaluated to appropriately accomodate clusters under load.

How to reproduce it (as minimally and precisely as possible):

Upgrade from 4.5 to 4.6+ with monitoring stack enabled. Watch for alert "SystemMemoryExceedsReservation"

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
server: 4.7.13
kubernetes: v1.20.0+df9c838
Cloud provider or hardware configuration:
Vmware UPI
OS (e.g: cat /etc/os-release):
RHCOS 4.7
Kernel (e.g. uname -a):
4.18.0-240.22.1.el8_3.x86_64
Others:

OpenShift Bot · Answer 1 · Fri Oct 01 2021 04:54:37 GMT+0800 (China Standard Time)

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

OpenShift Bot · Answer 2 · Sun Oct 31 2021 05:25:14 GMT+0800 (China Standard Time)

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

OpenShift Bot · Answer 3 · Tue Nov 30 2021 05:53:17 GMT+0800 (China Standard Time)

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · Answer 4 · Tue Nov 30 2021 05:53:46 GMT+0800 (China Standard Time)

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.