Docker hadoop yarn cluster for spark 2.3.2

docker-spark-yarn-cluster

This application allows to deploy multi-nodes hadoop cluster with spark 2.3.2 on yarn.

Setup

Clone the repo
cd inside ../docker-spark-yarn-cluster
Follow the next instructions

Run docker-compose

docker-compose build
docker-compose up -d

Stop

docker-compose down

Setup

On Linux machines add ip addresses to "/etc/hosts":

10.7.0.2 mycluster-master
10.7.0.3 mycluster-slave-1
10.7.0.4 mycluster-slave-2

Ssh access:

sh setup_ssh_access_root.sh

Access

Docker
```
docker exec -it mycluster-master bash
```
Ssh
```
ssh root@mycluster-master
```

Run spark applications from host :

WARNING: Must have tuned the /etc/hosts file as described above before

First of all, create user home dir in HDFS to allow spark uploading jar files. Get a shell on the cluster (see above) then, say your user name on your laptop (the docker host) is foo:

hdfs dfs -mkdir -p /user/foo
hdfs dfs -chown foo:foo /user/foo

Then run for instance PI example:

# Point to spark install on your host, matching the workspace spark version
export SPARK_HOME=<path to spark home>
# Need to point to cluster config files so that spark-submit knows how to connect to the cluster
export HADOOP_CONF_DIR=${PWD}/config
# Run spark PI example. Could also use --deploy-mode cluster. Jar file name must match your spark version.
# If want to see history of run in History Server, add: --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs:///tmp/spark/history
$SPARK_HOME/bin/spark-submit --master yarn --deploy-mode client --num-executors 2 --executor-memory 2G --executor-cores 4 --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.3.2.jar

Run spark applications from cluster :

spark-shell :

 spark-shell --master yarn --deploy-mode client

Examples :

echo "hello world from cluster" > hello_world.txt

hadoop fs -put -f hello_world.txt /apps/hello_world.txt 

echo 'spark.read.text("hdfs://mycluster-master:9000/apps/hello_world.txt").show(1000,1000)' > HelloWorld.scala
echo 'sys.exit' >> HelloWorld.scala

spark-shell --master yarn --deploy-mode client -i HelloWorld.scala

spark-shell -i /app/workspace/files/examples/HelloWorld.scala

spark-submit --deploy-mode cluster --master yarn /app/workspace/files/examples/hello_world.py

should output something similar to :

1 and 2

20/05/19 15:08:36 INFO Client: Application report for application_1589900555706_0002 (state: RUNNING)
20/05/19 15:08:36 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.7.0.3
	 ApplicationMaster RPC port: -1
	 queue: default
	 start time: 1589900911818
	 final status: UNDEFINED
	 tracking URL: http://mycluster-master:8088/proxy/application_1589900555706_0002/
	 user: root
20/05/19 15:08:36 INFO YarnClientSchedulerBackend: Application application_1589900555706_0002 has started running.
20/05/19 15:08:42 INFO Main: Created Spark session with Hive support
Spark context Web UI available at http://mycluster-master:4040
Spark context available as 'sc' (master = yarn, app id = application_1589900555706_0002).
Spark session available as 'spark'.

20/05/19 15:08:53 INFO DAGScheduler: Job 1 finished: show at HelloWorld.scala:24, took 0.102030 s
+------------------------+
|                   value|
+------------------------+
|hello world from cluster|
+------------------------+

20/05/19 15:08:53 INFO SparkContext: Invoking stop() from shutdown hook
20/05/19 15:08:53 INFO SparkUI: Stopped Spark web UI at http://mycluster-master:4040

Access to Hadoop cluster Web UI : http://mycluster-master:8088
Access to spark Web UI : http://mycluster-master:8080
Access to hdfs Web UI : http://mycluster-master:50070
Access to spark History Web UI : http://mycluster-master:18080