gardener / machine-controller-manager-provider-azure

This repository is the out of tree implementation of the machine driver for Azure cloud provider

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

There might be a bug which leads to ClusterAutoScaler fails to handle the ResourceExhausted situation

dongyingbo opened this issue · comments

What happened:

We are creating multiple worker groups for every kind of workload so that we can fallback to other worker groups when the major one is unhealthy, resource exhausted for example.
It is hard for me to reproduce it in the real world.
But if I read the source codes correctly there might be a defect for the situation.

The provider's CreateMachine function returns error code codes.Internal for all kind of errors.
But cluster auto scaler needs error code codes.ResourceExhausted here, so that it can backoff the group when handleInstanceCreationErrorsForNodeGroup
Here is how aws provider handle with it.

What you expected to happen:
When resource exhausted, it should return correct error code to cluster auto scaler.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:

Environment:

Hi @dongyingbo, with #105, we move to use the latest azure API. With this change we also introduce the ResourceExhausted error code.

In the first increment we only map ZoneAllocationFailed. See here (This link is from the PR @rishabh-11 pointed to above, which is current under review). As and when we learn of other error codes we will keep adding to this list.

/close

#105 is already merged and released.