Apache Spark Cluster

1. Pull the image

$ sudo docker pull seedotech/spark:2.4.0

2. Create a spark network

$ sudo docker network create --driver=bridge spark

3. Create and run docker containers

# The 1st argument (CREATE_SPARK_MASTER) is set to YES if you want to create a spark master, the default is YES
# The 2nd argument (SPARK_MASTER_IP) is the spark master ip (the ip of the host machine of the spark master docker container), the default is spark-master
# The 3rd argument (SPARK_WORKER_NUMBER) is the number of workers you want to create, the default is 2

# The default spark cluster has 3 nodes includes 1 master and 2 workers
$ start_containers.sh

# If you want to create 1 spark worker in other machine (assume that the ip of the spark master is 192.168.1.10), you can do like that:
$ start_containers.sh NO 192.168.1.10 1

4. Get into the spark master container

$ sudo docker exec -it spark-master bash

5. Monitor spark cluster

Access to the http://spark-master:8080 to monitor the spark cluster.

6. Run spark application with hadoop

6.1. Create a shared network for hadoop cluster and spark cluster be able to connect together

$ sudo docker network create --driver=bridge spark-hadoop

6.2. Create a hadoop cluster and use the spark-hadoop network

e.g
$ sudo docker run -itd \
					--net=spark-hadoop \
					-p 50070:50070 \
					-p 8088:8088 \
					-e HADOOP_SLAVE_NUMBER=$HADOOP_SLAVE_NUMBER \
					--name hadoop-master \
					--hostname hadoop-master \
					seedotech/hadoop:2.9.2 &> /dev/null

6.3. Create a spark cluster and use the spark-hadoop network

e.g
$ sudo docker run -itd \
					--net=spark-hadoop \
					-p 6066:6066 -p 7077:7077 -p 8080:8080 \
					-e IS_SPARK_MASTER=YES \
					--name spark-master \
					--hostname spark-master \
					seedotech/spark:2.4.0 &> /dev/null

6.4. Execute spark application in other machine

$ spark-submit --class com.seedotech.spark.SparkJavaApp --master spark://spark-master:7077 --deploy-mode cluster hdfs://hadoop-master:9000/apps/spark/spark-java-1.0.jar hdfs://hadoop-master:9000/apps/spark/demo.txt hdfs://hadoop-master:9000/apps/spark/out

6.5. Debug spark application

Use logs to trace bugs/issues

$ sudo docker logs -f spark-master
$ sudo docker logs -f spark-worker1

NOTE: you might run failed in the second time, it's because of the folder hdfs://hadoop-master:9000/apps/spark/out has been created. To run successfully please remove it before running (use http://localhost:50070/explorer.html to upload/delete folders/files).

raycad / spark