SystemReservedMemory default value insufficient on clusters under load.
pbmoses opened this issue · comments
What happened:
OpenShift 4.6 introduced an alert of SystemMemoryExceedsReservation . The default memory in openShift looks to be 1Gi for systemReserved on nodes. I am seeing a pattern of this being insufficient memory with clusters under load (on VMware) and immediately following upgrades, the alert is firing. (unless, these is an actual issue with the alert )
/etc/kubernetes/kubelet.conf
...
systemReserved:
cpu: 500m
memory: 1Gi
ephemeral-storage: 1Gi
...
What you expected to happen:
Following an upgrade, I would expect an unmodified cluster to not alert due to insufficient systemReserved memory. This can be alleviated by modifying the kubelet.config via a custom kubelet CR, however I believe the default systemReserved should be re-evaluated to appropriately accomodate clusters under load.
How to reproduce it (as minimally and precisely as possible):
Upgrade from 4.5 to 4.6+ with monitoring stack enabled. Watch for alert "SystemMemoryExceedsReservation"
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
):
server: 4.7.13
kubernetes: v1.20.0+df9c838 - Cloud provider or hardware configuration:
Vmware UPI - OS (e.g:
cat /etc/os-release
): - RHCOS 4.7
- Kernel (e.g.
uname -a
): - 4.18.0-240.22.1.el8_3.x86_64
- Others:
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen
.
Mark the issue as fresh by commenting/remove-lifecycle rotten
.
Exclude this issue from closing again by commenting/lifecycle frozen
./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.