Terraform module for scalable self hosted GitHub action runners

This Terraform modules create the required infra structure needed to host GitHub Actions self hosted auto scaling runners on AWS spot instances. By default the GitHub action runner, runs on a single host and does provide scaling. This module also provide the required logic to handle the life cycle for scaling up and down by a set of AWS Lambda functions. The module scaled the runners back to zero to avoid any costs when no workflows are active.

Motivation
Overview
- ARM64 support via Graviton/Graviton2 instance-types
Usages
Examples
Sub modules
- ARM64 configuration for submodules
Requirements
Providers
Inputs
Outputs
Philips Forest

Motivation

GitHub Actions self hosted runners provides you with a flexible option to run your CI workloads on compute of your choice. Currently there is no option provided to automate the creation and scaling of action runners. This module takes care of creating the AWS infra structure to host action runners on spot instances. And provides lambda modules to orchestrate the life cycle of the action runners.

Lambda is chosen as runtime for two major reasons. First it allows to create small components with minimal access to AWS and GitHub. Secondly it provides a scalable setup for minimal costs that works on repo level and scales to organization level. The lambdas will create Linux based EC2 instances with Docker to serve CI workloads that can run on Linux and/or Docker. The main goal is here to support Docker based workloads.

A logical question would be why not Kubernetes? In the current approach we stay close to the way the GitHub action runners are available today. The approach is to install the runner on a host where the required software is available. With this setup we stay quite close to the current GitHub approach. Another logical choice would be AWS Auto Scaling groups. This choice would typically require much more permissions on instance level to GitHub. And besides that, scaling up and down is not trivial.

Overview

The moment a GitHub action workflow requiring a self-hosted runner is triggered, GitHub will try to find a runner which can execute the workload. This is the moment this module hooks in. GitHub created a check run event for the triggered workflow. This is the event that is used to decide if the creation of a new runner is necessary.

For receiving the check run event a GItHub App needs to be created with a webhook to which the event will be published. Installing the GitHub App to a specific repository or all repositories ensures the check run event will be sent to the webhook.

In AWS a API gateway endpoint is created that is able to receive the GitHub webhook events via HTTP post. The gateway triggers the webhook lambda which will verify the signature of the event, this check guarantees the event is sent by the GitHub App. The lambda only handles check run events with status created. The accepted events are posted on a SQS queue. Messages on this queue will be delayed for a configurable amount of seconds (default 30 seconds) to give the available runners time to pick up this build.

The scale up runner is listening to the SQS queue and picks up events. The lambda runs checks like, is the build not started? Is the maximum number of runners not reached? In case one of checks fails, the event is dropped and it is assumed an available runner already has finished the build or will do this once a previous workload is finished. In case the build is not picked up and the maximum amount of runners is not reached, the lambda will create an EC2 spot instance to host the action runner for executing the workflow.

The Lambda first requests a registration token from GitHub, the token is needed later by the runner to register itself. This also avoids that the EC2 instance, that later in the process will install the agent, needs administration permissions to register the runner. Next the EC2 spot instance is created via the launch template. The launch template defines the specifications of the required instance and contains a user_data script. This script will install the required software and configure it. The registration token for the action runner is stored in the parameter store (SSM) from which the user data script will fetch it and delete it once it has been retrieved. Once the user data script is finished the action runner should be online and the workflow will start in seconds.

Scaling down the runners is at the moment brute-forced, every configurable amount of minutes a lambda will check every runner (instance) if it is busy. In case the runner is not busy it will be removed from GitHub and the instance terminated in AWS. At the moment there seems no other option to scale down more smoothly.

Downloading the GitHub Action Runner distribution can be occasionally slow (more than 10 minutes). Therefore a lambda is introduced that synchronizes the action runner binary from GitHub to an S3 bucket. The EC2 instance will fetch the distribution from the S3 bucket instead of the internet.

Secrets and private keys which are passed the Lambda's as environment variables are encrypted by default by a KMS key managed by the module. Alternatively you can pass your own KMS key. Encryption via KMS can be complete disabled by setting encrypt_secrets to false.

Permission are managed on several places. Below the most important ones. For details check the Terraform sources.

The GitHub App requires access to actions and publish check_run events to AWS.
The scale up lambda should have access to EC2 for creating and tagging instances.
The scale down lambda should have access to EC2 to terminate instances.

Besides these permissions, the lambdas also need permission to CloudWatch (for logging and scheduling), SSM and S3. For more details about the required permissions see the documentation of the IAM module which uses permission boundaries.

ARM64 support via Graviton/Graviton2 instance-types

When using the default example or top-level module, specifying an instance_type that matches a Graviton/Graviton 2 (ARM64) architecture (e.g. a1 or any 6th-gen g or gd type), the sub-modules will be automatically configured to provision with ARM64 AMIs and leverage GitHub's ARM64 action runner. See below for more details.

Usages

Examples are provided in the example directory. Please ensure you have installed the following tools.

Terraform, or tfenv.
Bash shell or compatible
Docker (optional, to build lambdas without node).
AWS cli (optional)
Node and yarn (for lambda development).

The module support two main scenarios for creating runners. On repository level a runner will be dedicated to only one repository, no other repository can use the runner. On organization level you can use the runner(s) for all the repositories within the organization. See https://help.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners for more information. Before starting the deployment you have to choose one option.

GitHub workflows will fail immediately if there is no action runner available for your builds. Since this module supports to scale from 0 and up, your builds will fail in case there is no active runner available. So we recommend to create an offline runner with matching labels to the configuration. Create this runner by following the GitHub instruction on your local machine. You can stop the process after the step of running the config.sh. This offline runner will ensure your builds will not fail immediately and stay queued until there is a runner to pick it up.

The setup consists of running Terraform to create all AWS resources and configure the GitHub App. The Terraform module requires configuration from the GitHub App and the GitHub app requires output from Terraform. Therefore you should first create the GitHub App, configure the basics. Then run Terraform and finalize the configuration of the GitHub App afterwards.

Setup GitHub App (part 1)

Go to GitHub and create a new app. Beware you can create apps your organization or for a user. For now we handle only the organization level app.

Create app in Github
Choose a name
Choose a website (mandatory, not required for the module).
Disable the webhook for now (we will configure this later).
Repository permissions, enable Checks to receive events for new builds.
Only for repo level runners! - Repository permissions, Administration - Read and Write (to register runner)
Only for organization level runners! - Organization permissions, Administration - Read and Write (to register runner)
Save the new app.
Next generate a private key on the General page.
Make a note of the following app parameters: app id , client ID, and client secret

Setup terraform module

First you need to download the lambda releases. The lambda code is available as a GitHub release asset. Downloading can be done with the provided terraform module for example. Note that this requires curl to be installed on your machine. Create an empty workspace with the following terraform code:

module "lambdas" {
  source  = "philips-labs/github-runner/aws//modules/download-lambda"
  version = "0.2.0"

  lambdas = [
    {
      name = "webhook"
      tag  = "v0.4.0"
    },
    {
      name = "runners"
      tag  = "v0.4.0"
    },
    {
      name = "runner-binaries-syncer"
      tag  = "v0.4.0"
    }
  ]
}

output "files" {
  value = module.lambdas.files
}

Next run terraform init && terraform apply as result the lambdas will be download to the same directory. Alternatively you can download the zip artifacts with any other tool of you favour.

For local development you can build all the lambda's at once using .ci/build.sh or per lambda using yarn dist.

Next create a second terraform workspace and initiate the module, see the examples for more details.

module "github-runner" {
  source  = "philips-labs/github-runner/aws"
  version = "0.2.0"

  aws_region = "eu-west-1"
  vpc_id     = "vpc-123"
  subnet_ids = ["subnet-123", "subnet-456"]

  environment = "gh-ci"

  github_app = {
    key_base64     = "base64string"
    id             = "1"
    client_id      = "c-123"
    client_secret  = "client_secret"
    webhook_secret = "webhook_secret"
  }

  webhook_lambda_zip                = "lambdas-download/webhook.zip"
  runner_binaries_syncer_lambda_zip = "lambdas-download/runner-binaries-syncer.zip"
  runners_lambda_zip                = "lambdas-download/runners.zip"
  enable_organization_runners = true
}

ARM64 support: Specify an a1 or *6g* (6th-gen Graviton2) instance type to stand up an ARM64 runner, otherwise the default is x86_64.

Run terraform by using the following commands

terraform init
terraform apply

Check the terraform output for the API gateway url (endpoint), which you need in the next step. The lambda for syncing the GitHub distribution will be executed by a trigger via CloudWatch. After deployment the function is triggered via S3 to ensure the distribution is cached.

Setup GitHub App (part 2)

Go back to the GitHub App and update the following settings.

Enable the webhook.
Provide the webhook url, should be part of the output of terraform.
Provide the webhook secret.
Enable the Check run event for the webhook.

You are now ready to run action workloads on self hosted runner, remember builds will fail if there is no (offline) runner available with matching labels.

Encryption

The module support 3 scenario's to manage environment secrets and private key of the Lambda functions.

Encrypted via a module managed KMS key (default)

This is the default, no additional configuration is required.

Encrypted via a provided KMS key

You have to create an configure you KMS key. The module will use the context with key: Environment and value var.environment as encryption context.

resource "aws_kms_key" "github" {
  is_enabled = true
}

module "runners" {

  ...
  manage_kms_key = false
  kms_key_id     = aws_kms_key.github.key_id
  ...

No encryption

Not advised but you can disable the encryption as by setting the variable encrypt_secrets to false.

Idle runners

The module will scale down to zero runners be default, by specifying a idle_config config idle runners can be kept active. The scale down lambda checks if any of the cron expressions matches the current time with a marge of 5 seconds. When there is a match the number of runners specified in the idle config will be kept active. In case multiple cron expressions matches only the first one is taken in to account. Below a idle configuration for keeping runners active from 9 to 5 on working days.

idle_config = [{
   cron      = "* * 9-17 * * 1-5"
   timeZone  = "Europe/Amsterdam"
   idleCount = 2
}]

Supported config

Cron expressions are parsed by cron-parser. The supported syntax.

*    *    *    *    *    *
┬    ┬    ┬    ┬    ┬    ┬
│    │    │    │    │    |
│    │    │    │    │    └ day of week (0 - 7) (0 or 7 is Sun)
│    │    │    │    └───── month (1 - 12)
│    │    │    └────────── day of month (1 - 31)
│    │    └─────────────── hour (0 - 23)
│    └──────────────────── minute (0 - 59)
└───────────────────────── second (0 - 59, optional)

For time zones please check TZ database name column for the supported values.

Examples

Examples are located in the examples directory. The following examples are provided:

Default: The default example of the module
Permissions boundary: Example usages of permissions boundaries.

Sub modules

The module contains several submodules, you can use the module via the main module or assemble your own setup by initializing the submodules yourself.

The following submodules are the core of the module and are mandatory:

runner-binaries-syncer - Syncs the action runner distribution.
runners - Scales the action runners up and down
webhook - Handles GitHub webhooks

The following sub modules are optional and are provided as example or utility:

download-lambda - Utility module to download lambda artifacts from GitHub Release
setup-iam-permissions - Example module to setup permission boundaries

ARM64 configuration for submodules

When not using the top-level module and specifying an a1 or *6g* (6th-gen Graviton2) instance_type, the runner-binaries-syncer and runners submodules need to be configured appropriately for pulling the ARM64 GitHub action runner binary and leveraging the arm64 AMI for the runners.

When configuring runner-binaries-syncer

runner_architecture - set to arm64, defaults to x64

When configuring runners

ami_filter - set to ["amzn2-ami-hvm-2*-arm64-gp2"], defaults to ["amzn2-ami-hvm-2.*-x86_64-ebs"]

Requirements

No requirements.

Providers

Name	Version
aws	n/a
random	n/a

Inputs

Name	Description	Type	Default	Required
aws_region	AWS region.	`string`	n/a	yes
enable_organization_runners	n/a	`bool`	n/a	yes
encrypt_secrets	Encrypt secret variables for lambda's such as secrets and private keys.	`bool`	`true`	no
environment	A name that identifies the environment, used as prefix and for tagging.	`string`	n/a	yes
github_app	GitHub app parameters, see your github app. Ensure the key is base64 encoded.	object({ key_base64 = string id = string client_id = string client_secret = string webhook_secret = string })	n/a	yes
idle_config	List of time period that can be defined as cron expression to keep a minimum amount of runners active instead of scaling down to 0. By defining this list you can ensure that in time periods that match the cron expression within 5 seconds a runner is kept idle.	list(object({ cron = string timeZone = string idleCount = number }))	`[]`	no
instance_profile_path	The path that will be added to the instance_profile, if not set the environment name will be used.	`string`	`null`	no
instance_type	Instance type for the action runner.	`string`	`"m5.large"`	no
kms_key_id	Custom KMS key to encrypted lambda secrets, if not provided and `encrypt_secrets` = `true` a KMS key will be created by the module. Secrets will be encrypted with a context `Environment = var.environment`.	`string`	`null`	no
logging_retention_in_days	Specifies the number of days you want to retain log events for the lambda log group. Possible values are: 0, 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, and 3653.	`number`	`7`	no
manage_kms_key	Let the module manage the KMS key.	`bool`	`true`	no
minimum_running_time_in_minutes	The time an ec2 action runner should be running at minimum before terminated if non busy.	`number`	`5`	no
role_path	The path that will be added to role path for created roles, if not set the environment name will be used.	`string`	`null`	no
role_permissions_boundary	Permissions boundary that will be added to the created roles.	`string`	`null`	no
runner_as_root	Run the action runner under the root user.	`bool`	`false`	no
runner_binaries_syncer_lambda_timeout	Time out of the binaries sync lambda in seconds.	`number`	`300`	no
runner_binaries_syncer_lambda_zip	File location of the binaries sync lambda zip file.	`string`	`null`	no
runner_extra_labels	Extra labels for the runners (GitHub). Separate each label by a comma	`string`	`""`	no
runners_lambda_zip	File location of the lambda zip file for scaling runners.	`string`	`null`	no
runners_maximum_count	The maximum number of runners that will be created.	`number`	`3`	no
runners_scale_down_lambda_timeout	Time out for the scale up lambda in seconds.	`number`	`60`	no
runners_scale_up_lambda_timeout	Time out for the scale down lambda in seconds.	`number`	`60`	no
scale_down_schedule_expression	Scheduler expression to check every x for scale down.	`string`	`"cron(/5 * * ? *)"`	no
subnet_ids	List of subnets in which the action runners will be launched, the subnets needs to be subnets in the `vpc_id`.	`list(string)`	n/a	yes
tags	Map of tags that will be added to created resources. By default resources will be tagged with name and environment.	`map(string)`	`{}`	no
userdata_post_install	Script to be ran after the GitHub Actions runner is installed on the EC2 instances	`string`	`""`	no
userdata_pre_install	Script to be ran before the GitHub Actions runner is installed on the EC2 instances	`string`	`""`	no
vpc_id	The VPC for security groups of the action runners.	`string`	n/a	yes
webhook_lambda_timeout	Time out of the webhook lambda in seconds.	`number`	`10`	no
webhook_lambda_zip	File location of the webhook lambda zip file.	`string`	`null`	no

Outputs

Name	Description
binaries_syncer	n/a
runners	n/a
webhook	n/a

Philips Forest

This module is part of the Philips Forest.


                                                     ___                   _
                                                    / __\__  _ __ ___  ___| |_
                                                   / _\/ _ \| '__/ _ \/ __| __|
                                                  / / | (_) | | |  __/\__ \ |_
                                                  \/   \___/|_|  \___||___/\__|

                                                                 Infrastructure

Talk to the forestkeepers in the forest-channel on Slack.

AntonPuko / terraform-aws-github-runner