Hadoop/Spark with Terraform on AWS

This project create an Hadoop and Spark cluster on Amazon AWS with Terraform

Variables
Software version
Project Structure
How to
See also

Variables

Name	Description	Default
region	AWS region	us-east-1
access_key	AWS access key
secret_key	AWS secret key
token	AWS token	null
instance_type	AWS instance type	m5.xlarge
ami_image	AWS AMI image	ami-0885b1f6bd170450c
key_name	Name of the key pair used between nodes	localkey
key_path	Path of the key pair used between nodes	.
aws_key_name	AWS key pair used to connect to nodes	amzkey
amz_key_path	AWS key pair path used to connect to nodes	amzkey.pem
namenode_count	Namenode count	1
datanode_count	Datanode count	3
ips	Default private ips used for nodes	See variables.tf
hostnames	Default private hostnames used for nodes	See variables.tf

Software version

Default AMI image: ami-0885b1f6bd170450c (Ubuntu 20.04, amd64, hvm-ssd)
Spark: 3.0.1
Hadoop: 2.7.7
Python: last available (currently 3.8)
Java: openjdk 8u275 jdk

Project Structure

app/: folder where you can put your application, it will copied to the namenode
install-all.sh: script which is executed in every node, it install hadoop/spark and do all the configuration for you
main.tf: definition of the resources
output.tf: terraform output declaration
variables.tf: terraform variable declaration

How to

Download and install Terraform
Download the project and unzip it
Open the terraform project folder "spark-terraform-master/"
Create a file named "terraform.tfvars" and paste this:

access_key="<YOUR AWS ACCESS KEY>"
secret_key="<YOUR AWS SECRET KEY>"
token="<YOUR AWS TOKEN>"

Note: without setting the other variables (you can find it on variables.tf), terraform will create a cluster on region "us-east-1", with 1 namenode, 3 datanode and with an instance type of m5.xlarge.

Put your application files into the "app" terraform project folder
Open a terminal and generate a new ssh-key

ssh-keygen -f <PATH_TO_SPARK_TERRAFORM>/spark-terraform-master/localkey

Where <PATH_TO_SPARK_TERRAFORM> is the path to the /spark-terraform-master/ folder (e.g. /home/user/)

Login to AWS and create a key pairs named amzkey in PEM file format. Follow the guide on AWS DOCS. Download the key and put it in the spark-terraform-master/ folder.
Open a terminal and go to the spark-terraform-master/ folder, execute the command

terraform init
terraform apply

After a while (wait!) it should print some public DNS in a green color, these are the public dns of your instances.

Connect via ssh to all your instances via

ssh -i <PATH_TO_SPARK_TERRAFORM>/spark-terraform-master/amzkey.pem ubuntu@<PUBLIC DNS>

Execute on the master (one by one):

$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver' > /home/ubuntu/hadoop-start-master.sh
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-slaves.sh spark://s01:7077

You are ready to execute your app! Execute this command on the master

/opt/spark-3.0.1-bin-hadoop2.7/bin/spark-submit --master spark://s01:7077  --executor-cores 2 --executor-memory 2g yourfile.py

Based on what machine you chose you will be able to change the number of cores used and the amount of ram memory allocated for the tasks. If you would like to use a dataset different from the ENGB pay attention from the output of this command; if you get this warn message:

WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

That means that you have allocated an insufficient amount of resources or others task have a lock on them. You can check the jobs being executed with the spark UI at the following link:

<PUBLIC DNS OF YOUR MASTER NODE>:8080

Remember to do terraform destroy to delete your EC2 instances

Note: The steps from 0 to 5 (included) are needed only on the first execution ever

About

This project create an Hadoop and Spark cluster on Amazon AWS with Terraform

aws terraform deploy

Other

Languages

Language:Shell 52.9%Language:HCL 47.1%