VMSS support is incomplete and not tested
unmarshall opened this issue · comments
What happened:
Currently VirtualMachineScaleSet feature is supported via MachineSet configuration. This feature was introduced via issue.
At the time of introducing this feature there were two variants - flexible
and uniform
. Under the flexible
variant distribution of VM's across fault domains was possible without any auto-scaling capabilities.
However now auto-scaling capability is available via flexible
variant of VMSS as well. See here. This will introduce yet another actor (apart from CA + MCM) which will try to autonomously scale the VMs in the VMSS without the knowledge of CA or MCM.
Missing Validations
- Even though we only support
Flexible
variant there is no validation done anywhere to enforce this constraint. We also do not have any documentation as part of the API as well. - There is also missing validation to check if VMSS
Flexible
is chosen then it should not have any auto-scaling configured for it.
Side effect: If there are issues in VMSS configuration then during the creation of the VM it fails and since our error handling is not comprehensive this results in retries from MCM even though all retries would have the same failure and it will never succeed.
Additionally AvailabilitySet
was marked as deprecated. This should not be marked as deprecated as it is still completely supported by Azure and they have no plans to deprecate this feature. In fact AvailabilitySet
also offer distribution of VMs across fault and update domains and does not have any option to auto-scale which works quite well for MCM + CA combination.
Reference: Comparison between VMSS and AvailabilitySet is best described here.
What you expected to happen:
- Check new Azure Go SDK and check for existing APIs to find out:
- Variant -
flexible
oruniform
and add this validation. - Auto-scaling configuration - if there is an auto-scaling configuration set then it should ideally be disallowed. However this is still error prone. Consider a case where VMSS was first created with no-autoscaling policy and then the VMSS ID was specified when creating a VM. Then later the customer changes the VMSS configuration and enables auto-scaling policy. While we might prevent new VM launches, for existing VMs we will still have an issue and in general there will be an issue with VMs started for that MachineSet/MachineDeployment.
- Variant -
- Create VMSS using Azure portal and experiment with/without auto-scaling and document all issues that we foresee. Based on these issues we should be able to take a call to continue support for VMSS (with additional validations) or completely drop the support.