Exasol EMR Cluster Setup on AWS using Terraform
This is an infrastructure as a code Terraform project that creates Exasol and EMR clusters on Amazon AWS.
Motivation
Exasol has a nice CloudFormation templates where you can build Exasol clusters. Please have look at SOL-605. However, if you want something to work from command line and do not want to click around in AWS, then this project might help you.
Here I also added EMR cluster which I need usually to test the integration of Exasol with Hadoop tools. This allows me to automatically create them on demand and terminate when I finish working.
Still there are manual steps required for setting up Exasol Buckets.
Prerequisites
Several tools and accounts should be available before using the project.
AWS Account
Create an AWS account if you do not have one already. You can sign-up here. The account should have admin access and secret keys in order to use the AWS Command Line tools.
AWS CLI
Install aws command line interface. You can follow instructions provided at aws-cli in order to install it.
AWS CLI Profile
Create a credentials profile for aws-cli
with access and secret keys of your
account.
$ aws configure --profile my-user-profile
AWS Access Key ID [None]: <Your AWS Account Access Key>
AWS Secret Access Key [None]: <Your AWS Account Secret Key>
Default region name [None]:
Default output format [None]:
We keep region and output formats empty.
You can manually edit credentials file, ~/.aws/credentials
, anytime if you
want to update it later.
Install Terraform
In order to install Terraform, you can follow the instructions from here.
Usage
Please follow these steps for quick start usage.
Update Configuration File
Copy the configuration file config.tfvars.example
to config.tfvars
and modify the parameters inside it. Make sure you provide
the correct aws profile name and other variables.
An example configurations:
profile = "exasol"
project = "SPRKCT"
environment = "staging"
exa_image_id = "EXASOL-6.0.6-4-BYOL"
exa_license_file_path = "./mor_byol_license.xml"
exa_db_password = "my-awesome-password"
exa_db_node_count = "3"
exa_db_node_type = "m4.2xlarge"
exa_db_replication_factor = "1"
exa_db_standby_node = "0"
emr_release_label = "emr-5.19.0"
emr_master_type = "m4.xlarge"
emr_master_count = "1"
emr_core_type = "m4.2xlarge"
emr_core_count = "3"
User Public SSH Keys
Additionally you can add public ssh keys so that you can ssh to EMR master node without providing private pem file.
Edit file bootstrap_user_keys.sh
as follows:
#!/bin/bash
cat <<EOT >> ~/.ssh/authorized_keys
ssh-rsa SSH_PUBLIC_KEY <username>
#
# ADD MORE HERE
#
EOT
Once you have clusters running this makes it easy to ssh into emr master node:
ssh hadoop@$(terraform output out-emr-master-dns)
Similarly with socks proxy enabled:
ssh -D 8157 hadoop@$(terraform output out-emr-master-dns)
Run
To start setting up clusters run:
terraform init
terraform get -update
terraform plan -var-file config.tfvars -out terraform.tfplan
terraform apply -auto-approve -var-file config.tfvars
This will take some time until everything is setup. So you can go and grab a coffee.
When you want to destroy the clusters please run:
terraform plan -destroy -var-file config.tfvars -out terraform.tfplan
terraform apply terraform.tfplan
Makefile
You can also use Makefile commands to create the clusters.
Command | Description |
---|---|
make |
runs terraform init , plan and apply |
make init |
terraform init , run this if it is the first run |
make update |
terraform update |
make plan |
terraform plan |
make apply |
terraform apply , create both clusters |
make destroy |
terraform destroy , destroy everything |
make exasol |
create only Exasol cluster |
make emr |
create only EMR cluster |
make clean |
remove plan or generated files |
make run-hive |
creates hive tables in EMR Hive using HDFS |
make run-etl-import |
runs etl loader scripts to populate Exasol tables |
Configuration Variables
The following Terraform configuration variables should be provided.
Configuration | Default | Description |
---|---|---|
profile |
An aws-cli profile name defined in ~/.aws/credentials |
|
project |
An identifier string for project name used in tagging resources | |
environment |
An identifier string for environment used in tagging resources | |
exa_image_id |
An AWS AMI image id to for creating an Exasol cluster | |
exa_license_file_path |
A path to license file if BYOL (Bring Your Own License) image id is used | |
exa_license_file_path |
A path to license file if BYOL (Bring Your Own License) image id is used | |
exa_db_password |
A password to use for authentication of admin and sys users |
|
exa_db_node_count |
3 |
The number nodes for Exasol cluster |
exa_db_replication_factor |
1 |
A replication factor for Exasol cluster |
exa_db_standby_node |
0 |
The number of standby nodes for Exasol cluster |
emr_release_label |
emr-5.19.0 |
A release version for EMR cluster |
emr_master_type |
m4.xlarge |
An EC2 instance type for EMR cluster master node |
emr_master_count |
1 |
The number of master nodes for EMR cluster |
emr_core_type |
m4.2xlarge |
An EC2 instance type for EMR cluster core nodes |
emr_core_count |
3 |
The number of core nodes for EMR cluster |
The project
configuration variable is also used to create a exa:project
tag.
Manual Steps
This is not fully automated yet, there are still some manual steps you need to follow. Some of them are:
- Open Exasol BucketFS http & https ports
- Create an Exasol bucket
- Upload jars to Exasol buckets
- Run Hive tables creations using
make run-hive
. This creates hive tables in HDFS that will be loaded to Exasol later. - Run ETL loader scripts to populate Exasol tables
make run-etl-import
; however, for this to work ETL jars should be uploaded to bucket/buckets/bfsdefault/bucket1/
.