gardener / gardener

Homogeneous Kubernetes clusters at scale on any infrastructure using hosted control planes.

Home Page:https://gardener.cloud

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Flaky Test] HA Multi zones tests timeout frequently

MichaelEischer opened this issue · comments

How to categorize this issue?

/area testing
/kind flake

Which test(s)/suite(s) are flaking:

The tests run by the ProwJob pull-gardener-e2e-kind-ha-multi-zone. More specifically [It] Shoot Tests Shoot with workers Create, Update, Delete [Shoot, default, basic, simple].

CI link:

https://prow.gardener.cloud/view/gs/gardener-prow/pr-logs/pull/gardener_gardener/9449/pull-gardener-e2e-kind-ha-multi-zone/1779763882739372032
https://testgrid.k8s.io/gardener-gardener#ci-gardener-e2e-kind-ha-multi-zone , for example https://prow.gardener.cloud/view/gs/gardener-prow/logs/ci-gardener-e2e-kind-ha-multi-zone/1779501481980858368

Reason for failure:

Apparently there are too many machines: {"level":"info","ts":"2024-04-14T18:03:00.661Z","logger":"shoot-test.test","msg":"Shoot is not yet reconciled","shoot":{"name":"e2e-default","namespace":"garden-local"},"reason":"condition type EveryNodeReady is not true yet, had message too many worker nodes are registered. Exceeding maximum desired machine count (4/3) with reason NodesScalingDown"}

I've noticed the flaky test as part of #9449, but the test runs on testgrid also contain the exact same error message.

The Gardener project currently lacks enough active contributors to adequately respond to all issues.
This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Mark this issue as rotten with /lifecycle rotten
  • Close this issue with /close

/lifecycle stale