Create and pass vm worker vnet with Service Endpoints to Batch to avoid Batch LB traffic bottleneck
jlester-msft opened this issue · comments
Problem:
Currently, Cromwell on Azure does not provide a vnet to Azure Batch to run workers in. The default Azure Batch vnet does not have Service Endpoints configured (as of Q2/2023), this results in traffic going to the ACR or Storage Account to flow through the Azure Batch load balancer. This creates additional costs and a bandwidth bottleneck. The correct behavior is for CoA to provide a vnet that has a Service Endpoint configured so that traffic to the SA/ACR is routed directly to the Azure resource.
Solution:
The overview of changes is:
- Add appropriate CoA managed identity permissions to the Azure Batch vm worker subnet
- Configure the vnet address space to be larger and create a subnet for Batch vm workers that has Service Endpoints configured
- Change CoA defaults/deployer to use this vnet by default
Managed identity permissions
The CoA managed identity needs Contributor
permissions to access the vnet. Contributor is likely too broad an a job specific permission would be better. TBD what the minimal set of permissions are.
Configure the CoA Virtual Network
-
Increase the address space for the default CoA vnet. The current address space is 10.0.0.0/16 change this to 10.0.0.0/13. This allows for the creation of the 3 required vnets while leaving enough address space for a really large vm worker subnet (approximately 262k IPs). The subnet can be larger (if needed) but this seems a reasonable size.
See this solarwinds calculator FREE Advanced Subnet Calculator - IP Calculator | SolarWinds
-
Add a new subnet
batch_worker_subnet
that will be provided to Azure Batch. Assign this subnet the address space 10.4.0.0/14 which has approximately 262k IPs. Provide at least the following 3 Service Endpoints:
Microsoft.ContainerRegistry, Microsoft.Sql, Microsoft.Storage
Have CoA use the provided vm work subnet
The batch_worker_subnet
has a resource name that follows this pattern:
/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Network/virtualNetworks/{virtualNetworkName}/subnets/batch_worker_subnet
This can be provided to the BatchNodesSubnetId
parameter. Testing with the deployer can be done by adding the following updater arguments:
--BatchNodesSubnetId "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RESOURCE_GROUP_NAME}/providers/Microsoft.Network/virtualNetworks/${COA_VNET_NAME}/subnets/batch_worker_subnet"
After these changes are made Batch should assign workers inside the 10.4.0.0/14 address space, and each node should be able to communicate directly with the Azure Storage backend + ACR + Sql as needed.
@jsaun please also consider what impact this has on Terra and also existing Terra Cromwell deployments and what upgrading entails