This project create an Hadoop and Spark cluster on Amazon AWS with Terraform
Name | Description | Default |
---|---|---|
region | AWS region | us-east-1 |
access_key | AWS access key | |
secret_key | AWS secret key | |
token | AWS token | null |
instance_type | AWS instance type | m5.xlarge |
ami_image | AWS AMI image | ami-0885b1f6bd170450c |
key_name | Name of the key pair used between nodes | localkey |
key_path | Path of the key pair used between nodes | . |
aws_key_name | AWS key pair used to connect to nodes | amzkey |
amz_key_path | AWS key pair path used to connect to nodes | amzkey.pem |
namenode_count | Namenode count | 1 |
datanode_count | Datanode count | 3 |
ips | Default private ips used for nodes | See variables.tf |
hostnames | Default private hostnames used for nodes | See variables.tf |
- Default AMI image: ami-0885b1f6bd170450c (Ubuntu 20.04, amd64, hvm-ssd)
- Spark: 3.0.1
- Hadoop: 2.7.7
- Python: last available (currently 3.8)
- Java: openjdk 8u275 jdk
- app/: folder where you can put your application, it will copied to the namenode
- install-all.sh: script which is executed in every node, it install hadoop/spark and do all the configuration for you
- main.tf: definition of the resources
- output.tf: terraform output declaration
- variables.tf: terraform variable declaration
- Download and install Terraform
- Download the project and unzip it
- Open the terraform project folder "spark-terraform-master/"
- Create a file named "terraform.tfvars" and paste this:
access_key="<YOUR AWS ACCESS KEY>"
secret_key="<YOUR AWS SECRET KEY>"
token="<YOUR AWS TOKEN>"
Note: without setting the other variables (you can find it on variables.tf), terraform will create a cluster on region "us-east-1", with 1 namenode, 3 datanode and with an instance type of m5.xlarge.
- Put your application files into the "app" terraform project folder
- Open a terminal and generate a new ssh-key
ssh-keygen -f <PATH_TO_SPARK_TERRAFORM>/spark-terraform-master/localkey
Where <PATH_TO_SPARK_TERRAFORM>
is the path to the /spark-terraform-master/ folder (e.g. /home/user/)
-
Login to AWS and create a key pairs named amzkey in PEM file format. Follow the guide on AWS DOCS. Download the key and put it in the spark-terraform-master/ folder.
-
Open a terminal and go to the spark-terraform-master/ folder, execute the command
terraform init
terraform apply
After a while (wait!) it should print some public DNS in a green color, these are the public dns of your instances.
- Connect via ssh to all your instances via
ssh -i <PATH_TO_SPARK_TERRAFORM>/spark-terraform-master/amzkey.pem ubuntu@<PUBLIC DNS>
- Execute on the master (one by one):
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver' > /home/ubuntu/hadoop-start-master.sh
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-slaves.sh spark://s01:7077
- You are ready to execute your app! Execute this command on the master
/opt/spark-3.0.1-bin-hadoop2.7/bin/spark-submit --master spark://s01:7077 --executor-cores 2 --executor-memory 2g yourfile.py
Based on what machine you chose you will be able to change the number of cores used and the amount of ram memory allocated for the tasks. If you would like to use a dataset different from the ENGB pay attention from the output of this command; if you get this warn message:
WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
That means that you have allocated an insufficient amount of resources or others task have a lock on them. You can check the jobs being executed with the spark UI at the following link:
<PUBLIC DNS OF YOUR MASTER NODE>:8080
- Remember to do
terraform destroy
to delete your EC2 instances
Note: The steps from 0 to 5 (included) are needed only on the first execution ever
- GraphComparison PySpark: an application using this project
- TransE PySpark: the first application using this project
- hadoop-spark-cluster-deployment: the starting point of this project