gardener / machine-controller-manager-provider-azure

This repository is the out of tree implementation of the machine driver for Azure cloud provider

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

VMSS support is incomplete and not tested

unmarshall opened this issue · comments

What happened:

Currently VirtualMachineScaleSet feature is supported via MachineSet configuration. This feature was introduced via issue.
At the time of introducing this feature there were two variants - flexible and uniform. Under the flexible variant distribution of VM's across fault domains was possible without any auto-scaling capabilities.

However now auto-scaling capability is available via flexible variant of VMSS as well. See here. This will introduce yet another actor (apart from CA + MCM) which will try to autonomously scale the VMs in the VMSS without the knowledge of CA or MCM.

Missing Validations

  • Even though we only support Flexible variant there is no validation done anywhere to enforce this constraint. We also do not have any documentation as part of the API as well.
  • There is also missing validation to check if VMSS Flexible is chosen then it should not have any auto-scaling configured for it.

Side effect: If there are issues in VMSS configuration then during the creation of the VM it fails and since our error handling is not comprehensive this results in retries from MCM even though all retries would have the same failure and it will never succeed.

Additionally AvailabilitySet was marked as deprecated. This should not be marked as deprecated as it is still completely supported by Azure and they have no plans to deprecate this feature. In fact AvailabilitySet also offer distribution of VMs across fault and update domains and does not have any option to auto-scale which works quite well for MCM + CA combination.

Reference: Comparison between VMSS and AvailabilitySet is best described here.

What you expected to happen:

  • Check new Azure Go SDK and check for existing APIs to find out:
    • Variant - flexible or uniform and add this validation.
    • Auto-scaling configuration - if there is an auto-scaling configuration set then it should ideally be disallowed. However this is still error prone. Consider a case where VMSS was first created with no-autoscaling policy and then the VMSS ID was specified when creating a VM. Then later the customer changes the VMSS configuration and enables auto-scaling policy. While we might prevent new VM launches, for existing VMs we will still have an issue and in general there will be an issue with VMs started for that MachineSet/MachineDeployment.
  • Create VMSS using Azure portal and experiment with/without auto-scaling and document all issues that we foresee. Based on these issues we should be able to take a call to continue support for VMSS (with additional validations) or completely drop the support.