Flexible HPC Cluster

Modular Microsoft Azure HPC infrastructure deployment ARM template.

Key Features of this ARM template collection

Choose between multiple CentOS, Ubuntu, SUSE or RedHat Linux Images, or use your own image.
RDMA (FDR, QDR Infinband), GPU (NVIDIA K80) and CPU only compute nodes are all supported.
All appropriate hardware drivers are installed and configured for you via the installation scripts.
NFS Server with up to 32TB of Standard_LRS storage attached (defaults to 10TB) built with azure managed disks
Dynamically add or remove nodes from your cluster (built with azure scale sets).
Add Head nodes or fat nodes to your cluster(s), or simply build standalone nodes.
Append your own scripts to install applications or customize the nodes further.

If you find a problem, please report it here.

1. Deploy a Complete Cluster with Head Node & NFS Server.

*** Under Maintenance DO NOT USE - 9/9/2017 ***

This template deploys a complete cluster composed of a head node + nfs server (combined on the same VM), and a cluster of a selectable number of nodes (1-100), built as a scale-set.

Deployment takes around 12 minutes. Login is disabled during deployment to prevent conflicts.
Head node & Compute nodes will be the same VM type (use the below modular template if you don't want this)

2. Modular Step-by-Step Deployment

This section allows you to deploy the cluster infrastructure step-by-step. You will need to deploy the components of your infrastructure into the same VNET in order for them to connect to each other.

Example usage of this is so that you can setup a "permanent" Head node and NFS AND/OR BeeGFS Server with your application software and data stored safely, and then tear-up and down compute nodes (Fat Nodes & Scale Sets) as you require.

2a. [Mandatory] Create the Network Infrastructure & Head Node

This template will create the main VNET & Subnets for the cluster - deploy this template first. You can treat this system purely as a standalone Head/Master/JumpBox node, or as a combined NFS server & Head/MasterJumpBox node.

2b. [Optional] Deploy a Standalone Linux NFS Server

Standalone Linux NFS Server

2c. [Optional] Deploy a Standalone BeeGFS Storage Cluster

This template deploys a BeeGFS Storage Cluster built using a VM ScaleSet with mixed data + metadata capability on each node. Number of storage/metadata disks and their sizes are configurable. Premium_LRS storage is recommended.

2d. Deploy a Scale Set of Linux Compute Nodes

Deploy a scale set with N nodes into the same existing VNET as your NFS Server + Head Node.

Ensure your Headnode & Network is deployed first as per the step 2a above.
The compute node install script will mount the home directory and other shares from the head node automatically.
The NFS server is currently assumed to be 10.0.0.4.
The scale set instances will record their hostnames and IP addresses into the /clustermap mount on the NFS server.
VM scaleset overprovisioning is disabled in this version for now to keep things predictable.

2d. Deploy Fat Node(s) VM(s) with optional storage attached.

TBD

3. Manually Increase or Decrease The Number of Compute Nodes in a Scale Set Cluster

The advantage of scale sets is that you can easily grow or shrink the amount of compute nodes as you need them. You can either do this automatically, or you can do this manually using this template - just enter the number of nodes you want to end up with (higher or lower than the current number). Additional compute instances will be configured exactly the same as the existing compute instances using the same cn-setup.sh installation script. Do it here:

4. Cluster Access Instructions

To ssh into the headnode or NFS server after deployment: ssh username@headnode-public-ip-address
username is the cluster admin username you entered into the template when you deployed.
The homedirectory is NFS automounted from the headnode onto all the compute nodes in the scale set.
The ssh keys are stored for your user in /share/home/username/.ssh, so passwordless ssh works across the cluster.
You will find the private IP addresses for the scaleset nodes in /share/clustermap/hosts (head nodes) or /clustermap/hosts (compute nodes).
Upload your data & applications to /share/data with scp or rsync.

Linux Image Support Matrix

You can mix and match VM sku types & linux versions on your head node, NFS server, scaleset compute nodes and fat nodes.
The table below documents the hardware support with the various Linux distributions & versions. YES means the relevant RDMA or GPU drivers are included in the image or added dynamically during deployment by this template.

OS Image	RDMA Support	GPU Support
Canonical:UbuntuServer:16.04-LTS	NO	YES*
Canonical:UbuntuServer:16.10	NO	YES*
OpenLogic:CentOS-HPC:6.5	YES	TBD
OpenLogic:CentOS:6.8	NO	TBD
OpenLogic:CentOS-HPC:7.1	YES	TBD
OpenLogic:CentOS:7.2	NO	TBD
OpenLogic:CentOS:7.3	NO	TBD
RedHat:RHEL:7.3	NO	YES
SUSE:SLES-HPC:12-SP2	YES*	TBD

(*added by the installation scripts from this template at time of deployment)

Cluster Topology Overview

Credit: Taylor Newill, Xavier Pillons & Thomas Varlet for original base templates.

hyviquel / FlexHPC