gardener / machine-controller-manager-provider-azure

This repository is the out of tree implementation of the machine driver for Azure cloud provider

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Upgrade to latest Azure Go SDK from out-of-support go-autorest

himanshu-kun opened this issue · comments

What would you like to be added:

Currently the go-autorest version we are using is out of support as of March 31 2023. Also there is no way to re-generate mocks using Makefile, which makes it tough for external contributors to contribute.

Tasks:
API Changes

CRD Changes

  • Change the CRD and mark MachineSet as deprecated and introduce VirtualMachineScaleSet

Implementation Changes

  • Refactor validation code and utils code
  • Upgrade go.mod to use the latest Azure GO SDK.
  • new go-sdk provides fake package use that instead of using mocks. See issue
  • Refactor SPI package
  • When creating VirtualMachine set DeletionOption as Delete for both NIC and Disks. This will ensure that when the VM gets deleted the associated disks and nics will also be deleted.
  • Adapt to all new APIs (there has been a significant change in the APIs)
  • In the existing code SDK defaults the VM creation timeout to 15 mins. Check if there is a default timeout in the new SDK, if there is none then create a child context with timeout set to 15 mins to ensure that it is also in line what we have today. Check for operations - delete VM, NIC, Disk and create NIC what are the timeouts (either default or explicitly specified) and for now ensure that in the new code they are the same. We can revise the timeouts once we have metrics.

VMSS Support
During refactoring an issue w.r.t VMSS support was identified. It was collectively decided that we will not handle this issue as part of this issue so a separate issue has been created to fix VMSS support.

Metrics

  • Introduce metrics to capture:
    • API request duration - captured for all APIs made to azure.
    • Time taken to complete driver operations (e.g total time take to complete CreateMachine, DeleteMachine)
    • Status of an operation - capture failures/success

Misc

  • Convert the new API test code into OQ tests project and schedule it for regular run in CI/CD.
  • Take a look at #41
  • GetMachineStatus - the only usage of this is in triggerCreationFlow (MCM). We should just get machine names from VMs and not from NICs and Disks. This will resolve the issue of create machine not called if only a NIC is present. Currently GetMachineStatus leverages ListMachines which should no longer be the case.
  • ListMachines - we collect VM names from NICs and VMs (no need for disks as these are created and deleted along with the VM) since it impacts orphan collection and any change should be handled in a separate PR.
  • Implement error code mapping for Resource exhaustion.
  • #111

Tests

  • Remove ginkgo based unit tests and use native golang tests instead for better readability and maintainability.
  • Update integration tests to use the new API. (There is currently no stand-alone integration tests for a MCM-provider)

Out of Scope

  • AzureNetworkProfile.NetworkInterfaces is plural and should have had a type []AzureNetworkInterfaceReference but instead it only has a single AzureNetworkInterfaceReference. The provider API also provides an ability to set more than one NICs for a VM. So either the name of the API property needs to be change and reflect that it only allows one NIC per VM or it should make it a slice. This will not be handled as part of this issue.
  • DataDisk#Lun is currently a pointer which indicates that this field is optional and can accept a nil value. This is not true. It is a mandatory field and there is a explicit validation to ensure that it is always set. This is a API design mistake and will not be corrected as part of this PR. A separate Issue/PR should be created if there is a need to change this.

Why is this needed:
Current go-autorest is no longer supported by Azure/MS. It is therefore recommended by Azure to move to their new set of APIs.

A VMImage used today is a marketplace image. This requires a purchase plan to exist and the customer is expected to create an agreement. If an agreement GET returns an error then we just return the error. In a conversation with @dkistner and @kon-angelo we agreed on the following:

  • Keep the creation of agreement as a prerequisite for the customer.
  • Handle 404 error and have a log explicitly stating that agreement needs to be created. Still return the error and do not continue VM creation.
  • Keep the existing behavior of checking if agreement is not accepted then update the agreement to be Accepted= true