This repository contains resources for repeatable deployments of Hail on AWS EMR and an AWS Sagemaker notebook instance configuration to interaction with the Hail cluster.
This template is deployed once and creates 3 S3 buckets with SSE encryption.
- Hail Bucket - contains VEP cache and configuration JSON for Hail
- Log Bucket - EMR logs will be written here
- SageMaker Jupyter Bucket - Users notebooks will be backed up here, and common/example notebooks will be stored here as well.
This template can be deployed multiple times (one per user). It will deploy a SageMaker notebook instance for operations against the Hail EMR cluster. The user's /home/ec2-user/SageMaker
directory will be backed up via crontab to the SageMaker Jupyter bucket created in the hail-s3
CloudFormation template. The user's SageMaker notebook instance will have full S3 CLI control over their respective subdirectory.
For example, if a notebook instance is named aperry
the user could open a terminal on that instance and have full AWS CLI control on objects under s3://YOUR_JUPYTER_BUCKET/aperry/
.
When a new SageMaker instance launches it will sync in any scripts in the following directories located in the root of the bucket to the noted locations.
- common-notebooks => /home/ec2-user/SageMaker/common-notebooks
- scripts => /home/ec2-user/SageMaker/bin
You may wish to seed those directories in S3 with the identically named directories under jupyter
in this repository. Doing so will allow for a working Hail Plotting example.
CLI Example from repository root directory:
aws --profile <PROFILE> s3 sync jupyter/ s3://<YOUR_JUPYTER_BUCKET>/ --acl bucket-owner-full-control
Post upload, the bucket contents should look similar to this:
14:16 $ aws --profile <PROFILE> s3 ls --recursive s3://<YOUR_JUPYTER_BUCKET>/
2019-09-30 14:14:36 13025 common-notebooks/hail-plotting-example.ipynb
2019-09-30 14:14:36 1244 scripts/list-clusters
2019-09-30 14:14:36 1244 scripts/ssm
This template leverages Packer in AWS CodeBuild to create AMIs for use with EMR. You can specify a specific Hail Version, VEP version, and target VPC and subnet.
Review the expanded documentation for further details.
This template deploys the EMR cluster using the custom Hail AMI. There is a single master node, a minimum of 1 core node, and optional autoscaling task nodes.
Task nodes can be set to 0
to omit them. The target market, SPOT
or ON_DEMAND
, is also set via parameters. If SPOT
is selected, the bid price is set to the current on demand price of the selected instance type.
The following scaling actions are set by default:
- +2 instances when YARNMemoryAvailablePercentage < 15 % over 5 min
- +2 instances when ContainerPendingRatio > .75 over 5 min
- -2 instances when YARNMemoryAvailablePercentage > 75 % over 5 min
EMR steps are used to add a location for Livy to output Hail plots directly to files on the master node. Once written there those plots can be retrieved in the Sagemaker notebook instance and plotted inline. See the jupyter/common-notebooks/hail-plotting-example.ipynb for an example.
The plotting pass through is required because the Sparkmagic/Livy can only pass spark and pandas dataframes back to the notebook.
The AWS Systems Manager Agent can be used to gain ingress to the EMR nodes. This agent is pre-installed on the AMI. CloudFormation parameters exist on both the EMR stack and the Jupyter stack to optionally allow notebook IAM roles shell access to the EMR nodes via SSM. The respective stack parameters must both be set to true
to allow proper IAM access.
Example connection from Jupyter Lab shell:
For expected results, deploy the templates in the following order. Resources created by one stack may be used as parameter entries to later stacks.
- hail-s3.yml
- hail-jupyter.yml
- hail-ami.yml
- hail-emr.yml
Public AMIs are available in specific regions. Select the AMI for your target region and deploy with the noted version of EMR for best results.
Region | Hail Version | VEP Version | EMR Version | AMI ID |
---|---|---|---|---|
us-east-1 | 0.2.29 | 98 | 5.28.0 | ami-0b016dfca524fec33 |
us-east-2 | 0.2.29 | 98 | 5.28.0 | ami-082b3c5dadecc4a87 |
us-west-2 | 0.2.29 | 98 | 5.28.0 | ami-0aa2d49e3149759e9 |
us-east-1 | 0.2.27 | 98 | 5.28.0 | ami-0eff76d452e943507 |
us-east-2 | 0.2.27 | 98 | 5.28.0 | ami-074bd78cf15dce0a5 |
us-west-2 | 0.2.27 | 98 | 5.28.0 | ami-010e68c2c559b37cf |
us-east-1 | 0.2.25 | 98 | 5.27.0 | ami-0b16f8ef3418e707a |
us-east-2 | 0.2.25 | 98 | 5.27.0 | ami-0fc5abc51396918fd |
us-west-2 | 0.2.25 | 98 | 5.27.0 | ami-0feddab8068926b24 |
Region | Hail Version | EMR Version | AMI ID |
---|---|---|---|
us-east-1 | 0.2.29 | 5.28.0 | ami-05e440db5d3e3bcba |
us-east-2 | 0.2.29 | 5.28.0 | ami-064ce48aad3e10749 |
us-west-2 | 0.2.29 | 5.28.0 | ami-0d8c99d07ae2ebc5b |
us-east-1 | 0.2.27 | 5.28.0 | ami-038d051a8baaf60ff |
us-east-2 | 0.2.27 | 5.28.0 | ami-0b6d8fea9018ff7ac |
us-west-2 | 0.2.27 | 5.28.0 | ami-096d1b6615904cbe0 |
us-east-1 | 0.2.25 | 5.27.0 | ami-073f98d578b35345d |
us-east-2 | 0.2.25 | 5.27.0 | ami-0c2ab8dbb74c44e36 |
us-west-2 | 0.2.25 | 5.27.0 | ami-0842116d93dd08609 |