microsoft / CromwellOnAzure

Microsoft Genomics implementation of the Broad Institute's Cromwell workflow engine on Azure

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Create and pass vm worker vnet with Service Endpoints to Batch to avoid Batch LB traffic bottleneck

jlester-msft opened this issue · comments

Problem:
Currently, Cromwell on Azure does not provide a vnet to Azure Batch to run workers in. The default Azure Batch vnet does not have Service Endpoints configured (as of Q2/2023), this results in traffic going to the ACR or Storage Account to flow through the Azure Batch load balancer. This creates additional costs and a bandwidth bottleneck. The correct behavior is for CoA to provide a vnet that has a Service Endpoint configured so that traffic to the SA/ACR is routed directly to the Azure resource.

Solution:
The overview of changes is:

  1. Add appropriate CoA managed identity permissions to the Azure Batch vm worker subnet
  2. Configure the vnet address space to be larger and create a subnet for Batch vm workers that has Service Endpoints configured
  3. Change CoA defaults/deployer to use this vnet by default

Managed identity permissions
The CoA managed identity needs Contributor permissions to access the vnet. Contributor is likely too broad an a job specific permission would be better. TBD what the minimal set of permissions are.

Configure the CoA Virtual Network

  1. Increase the address space for the default CoA vnet. The current address space is 10.0.0.0/16 change this to 10.0.0.0/13. This allows for the creation of the 3 required vnets while leaving enough address space for a really large vm worker subnet (approximately 262k IPs). The subnet can be larger (if needed) but this seems a reasonable size.
    See this solarwinds calculator FREE Advanced Subnet Calculator - IP Calculator | SolarWinds
    image

  2. Add a new subnet batch_worker_subnet that will be provided to Azure Batch. Assign this subnet the address space 10.4.0.0/14 which has approximately 262k IPs. Provide at least the following 3 Service Endpoints:
    image
    Microsoft.ContainerRegistry, Microsoft.Sql, Microsoft.Storage

Have CoA use the provided vm work subnet
The batch_worker_subnet has a resource name that follows this pattern:
/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Network/virtualNetworks/{virtualNetworkName}/subnets/batch_worker_subnet

This can be provided to the BatchNodesSubnetId parameter. Testing with the deployer can be done by adding the following updater arguments:
--BatchNodesSubnetId "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RESOURCE_GROUP_NAME}/providers/Microsoft.Network/virtualNetworks/${COA_VNET_NAME}/subnets/batch_worker_subnet"

After these changes are made Batch should assign workers inside the 10.4.0.0/14 address space, and each node should be able to communicate directly with the Azure Storage backend + ACR + Sql as needed.

@jsaun please also consider what impact this has on Terra and also existing Terra Cromwell deployments and what upgrading entails