There might be a bug which leads to ClusterAutoScaler fails to handle the ResourceExhausted situation
dongyingbo opened this issue · comments
What happened:
We are creating multiple worker groups for every kind of workload so that we can fallback to other worker groups when the major one is unhealthy, resource exhausted for example.
It is hard for me to reproduce it in the real world.
But if I read the source codes correctly there might be a defect for the situation.
The provider's CreateMachine function returns error code codes.Internal
for all kind of errors.
But cluster auto scaler needs error code codes.ResourceExhausted
here, so that it can backoff the group when handleInstanceCreationErrorsForNodeGroup
Here is how aws provider handle with it.
What you expected to happen:
When resource exhausted, it should return correct error code to cluster auto scaler.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know:
Environment:
Hi @dongyingbo, with #105, we move to use the latest azure API. With this change we also introduce the ResourceExhausted
error code.
In the first increment we only map ZoneAllocationFailed
. See here (This link is from the PR @rishabh-11 pointed to above, which is current under review). As and when we learn of other error codes we will keep adding to this list.
/close
#105 is already merged and released.