gudata / terraform-aws-ecs-airflow

A terraform module that creates an airflow instance in AWS ECS.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Maintained by dataroots Airflow version Terraform 0.15 Terraform Registry Tests Go Report Card

DEPRECATED: We are no longer actively maintaining this module, instead we recommend that you look into AWS MWAA as a replacement.

Terraform module Airflow on AWS ECS

This is a module for Terraform that deploys Airflow in AWS.

Setup

  • An ECS Cluster with:
    • Sidecar injection container
    • Airflow init container
    • Airflow webserver container
    • Airflow scheduler container
  • An ALB
  • A RDS instance (optional but recommended)
  • A DNS Record (optional but recommended)
  • A S3 Bucket (optional)

Average cost of the minimal setup (with RDS): ~50$/Month

Why do I need a RDS instance?

  1. This makes Airflow statefull, you will be able to rerun failed dags, keep history of failed/succeeded dags, ...
  2. It allows for dags to run concurrently, otherwise two dags will not be able to run at the same time
  3. The state of your dags persists, even if the Airflow container fails or if you update the container definition (this will trigger an update of the ECS task)

Intend

The Airflow setup provided with this module, is a setup where the only task of Airflow is to manage your jobs/workflows. So not to do actually heavy lifting like SQL queries, Spark jobs, ... . Offload as many task to AWS Lambda, AWS EMR, AWS Glue, ... . If you want Airflow to have access to these services, use the output role and give it permissions to these services through IAM.

Usage

module "airflow" {
    source = "datarootsio/ecs-airflow/aws"

    resource_prefix = "my-awesome-company"
    resource_suffix = "env"

    vpc_id             = "vpc-123456"
    public_subnet_ids  = ["subnet-456789", "subnet-098765"]

    rds_password = "super-secret-pass"
}

(This will create Airflow, backed up by an RDS (both in a public subnet) and without https)

Press here to see more examples

Note: After that Terraform is done deploying everything, it can take up to a minute for Airflow to be available through HTTP(S)

Adding DAGs

To add dags, upload them to the created S3 bucket in the subdir "dags/". After you uploaded them run the seed dag. This will sync the s3 bucket with the local dags folder of the ECS container.

Authentication

For now the only authentication option is 'RBAC'. When enabling this, this module will create a default admin role (only if there are no users in the database). This default role is just a one time entrypoint in to the airflow web interface. When you log in for the first time immediately change the password! Also with this default admin role you can create any user you want.

Todo

  • RDS Backup options
  • Option to use SQL instead of Postgres
  • Add a Lambda function that triggers the sync dag (so that you can auto sync through ci/cd)
  • RBAC
  • Support for Google OAUTH

Requirements

Name Version
terraform ~> 0.15
aws ~> 3.12.0

Providers

Name Version
aws ~> 3.12.0

Inputs

Name Description Type Default Required
airflow_authentication Authentication backend to be used, supported backends ["", "rbac"]. When "rbac" is selected an admin role is create if there are no other users in the db, from here you can create all the other users. Make sure to change the admin password directly upon first login! (if you don't change the rbac_admin options the default login is => username: admin, password: admin) string "" no
airflow_container_home Working dir for airflow (only change if you are using a different image) string "/opt/airflow" no
airflow_example_dag Add an example dag on startup (mostly for sanity check) bool true no
airflow_executor The executor mode that airflow will use. Only allowed values are ["Local", "Sequential"]. "Local": Run DAGs in parallel (will created a RDS); "Sequential": You can not run DAGs in parallel (will NOT created a RDS); string "Local" no
airflow_image_name The name of the airflow image string "apache/airflow" no
airflow_image_tag The tag of the airflow image string "2.0.1" no
airflow_log_region The region you want your airflow logs in, defaults to the region variable string "" no
airflow_log_retention The number of days you want to keep the log of airflow container string "7" no
airflow_py_requirements_path The relative path to a python requirements.txt file to install extra packages in the container that you can use in your DAGs. string "" no
airflow_variables The variables passed to airflow as an environment variable (see airflow docs for more info https://airflow.apache.org/docs/). You can not specify "AIRFLOW__CORE__SQL_ALCHEMY_CONN" and "AIRFLOW__CORE__EXECUTOR" (managed by this module) map(string) {} no
certificate_arn The ARN of the certificate that will be used string "" no
dns_name The DNS name that will be used to expose Airflow. Optional if not serving over HTTPS. Will be autogenerated if not provided string "" no
ecs_cpu The allocated cpu for your airflow instance number 1024 no
ecs_memory The allocated memory for your airflow instance number 2048 no
extra_tags Extra tags that you would like to add to all created resources map(string) {} no
ip_allow_list A list of ip ranges that are allowed to access the airflow webserver, default: full access list(string)
[
"0.0.0.0/0"
]
no
postgres_uri The postgres uri of your postgres db, if none provided a postgres db in rds is made. Format "<db_username>:<db_password>@<db_endpoint>:<db_port>/<db_name>" string "" no
private_subnet_ids A list of subnet ids of where the ECS and RDS reside, this will only work if you have a NAT Gateway in your VPC list(string) [] no
public_subnet_ids A list of subnet ids of where the ALB will reside, if the "private_subnet_ids" variable is not provided ECS and RDS will also reside in these subnets list(string) n/a yes
rbac_admin_email RBAC Email (only when airflow_authentication = 'rbac') string "admin@admin.com" no
rbac_admin_firstname RBAC Firstname (only when airflow_authentication = 'rbac') string "admin" no
rbac_admin_lastname RBAC Lastname (only when airflow_authentication = 'rbac') string "airflow" no
rbac_admin_password RBAC Password (only when airflow_authentication = 'rbac') string "admin" no
rbac_admin_username RBAC Username (only when airflow_authentication = 'rbac') string "admin" no
rds_allocated_storage The allocated storage for the rds db in gibibytes number 20 no
rds_availability_zone Availability zone for the rds instance string "eu-west-1a" no
rds_deletion_protection Deletion protection for the rds instance bool false no
rds_engine The database engine to use. For supported values, see the Engine parameter in API action CreateDBInstance string "postgres" no
rds_instance_class The class of instance you want to give to your rds db string "db.t2.micro" no
rds_password Password of rds string "" no
rds_skip_final_snapshot Whether or not to skip the final snapshot before deleting (mainly for tests) bool false no
rds_storage_type One of "standard" (magnetic), "gp2" (general purpose SSD), or "io1" (provisioned IOPS SSD) string "standard" no
rds_username Username of rds string "airflow" no
rds_version The DB version to use for the RDS instance string "12.7" no
region The region to deploy your solution to string "eu-west-1" no
resource_prefix A prefix for the create resources, example your company name (be aware of the resource name length) string n/a yes
resource_suffix A suffix for the created resources, example the environment for airflow to run in (be aware of the resource name length) string n/a yes
route53_zone_name The name of a Route53 zone that will be used for the certificate validation. string "" no
s3_bucket_name The S3 bucket name where the DAGs and startup scripts will be stored, leave this blank to let this module create a s3 bucket for you. WARNING: this module will put files into the path "dags/" and "startup/" of the bucket string "" no
use_https Expose traffic using HTTPS or not bool false no
vpc_id The id of the vpc where you will run ECS/RDS string n/a yes

Outputs

Name Description
airflow_alb_dns The DNS name of the ALB, with this you can access the Airflow webserver
airflow_connection_sg The security group with which you can connect other instance to Airflow, for example EMR Livy
airflow_dns_record The created DNS record (only if "use_https" = true)
airflow_task_iam_role The IAM role of the airflow task, use this to give Airflow more permissions

Makefile Targets

Available targets:

  tools                             Pull Go and Terraform dependencies
  fmt                               Format Go and Terraform code
  lint/lint-tf/lint-go              Lint Go and Terraform code
  test/testverbose                  Run tests

Contributing

Contributions to this repository are very welcome! Found a bug or do you have a suggestion? Please open an issue. Do you know how to fix it? Pull requests are welcome as well! To get you started faster, a Makefile is provided.

Make sure to install Terraform, Go (for automated testing) and Make (optional, if you want to use the Makefile) on your computer. Install tflint to be able to run the linting.

  • Setup tools & dependencies: make tools
  • Format your code: make fmt
  • Linting: make lint
  • Run tests: make test (or go test -timeout 2h ./... without Make)

Make sure you branch from the 'open-pr-here' branch, and submit a PR back to the 'open-pr-here' branch.

License

MIT license. Please see LICENSE for details.

About

A terraform module that creates an airflow instance in AWS ECS.

License:MIT License


Languages

Language:HCL 58.5%Language:Go 33.2%Language:Python 3.9%Language:Shell 3.8%Language:Makefile 0.6%