There might be a bug which leads to ClusterAutoScaler fails to handle the ResourceExhausted situation

Question

There might be a bug which leads to ClusterAutoScaler fails to handle the ResourceExhausted situation

dongyingbo opened this issue 9 months ago · comments

What happened:

We are creating multiple worker groups for every kind of workload so that we can fallback to other worker groups when the major one is unhealthy, resource exhausted for example.
It is hard for me to reproduce it in the real world.
But if I read the source codes correctly there might be a defect for the situation.

The provider's CreateMachine function returns error code codes.Internal for all kind of errors.
But cluster auto scaler needs error code codes.ResourceExhausted here, so that it can backoff the group when handleInstanceCreationErrorsForNodeGroup
Here is how aws provider handle with it.

What you expected to happen:
When resource exhausted, it should return correct error code to cluster auto scaler.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:

Environment:

Albert Dong · Answer 1 · Fri Oct 20 2023 19:08:22 GMT+0800 (China Standard Time)

Similar to gardener/machine-controller-manager-provider-alicloud#57

Rishabh Patel · Answer 2 · Fri Oct 20 2023 19:34:52 GMT+0800 (China Standard Time)

Hi @dongyingbo, with #105, we move to use the latest azure API. With this change we also introduce the ResourceExhausted error code.

Madhav Bhargava · Answer 3 · Sun Oct 22 2023 01:37:58 GMT+0800 (China Standard Time)

In the first increment we only map ZoneAllocationFailed. See here (This link is from the PR @rishabh-11 pointed to above, which is current under review). As and when we learn of other error codes we will keep adding to this list.

Rishabh Patel · Answer 4 · Fri Mar 29 2024 13:29:03 GMT+0800 (China Standard Time)

/close

#105 is already merged and released.