Spark client docker image

This repository contains a docker image to run Apache Spark client.

To run simple spark shell :

docker run -it epahomov/docker-spark:lightweighted /spark/bin/spark-shell

To run simple python spark shell (known as pyspark) :

docker run -it epahomov/docker-spark:lightweighted /spark/bin/pyspark

Examples before used lightweighted version of this image. It's very small, so it would download very fast, but it's not very flexible. All next examples would be with default version

To run simple spark R shell :

docker run -it epahomov/docker-spark /spark/bin/sparkR

To run simple spark sql shell :

docker run -it epahomov/docker-spark /spark/bin/spark-sql

To run simple spark shell with some changed properties like here :

docker run -it epahomov/docker-spark /spark/bin/spark-shell  --master local[4]

To run simple spark shell with changed spark-defaults.conf do:

printf "spark.master local[4] \nspark.executor.cores 4" > spark-defaults.conf
sudo docker run -v $(pwd)/spark-defaults.conf:/spark/conf/spark-defaults.conf -it epahomov/docker-spark /spark/bin/spark-shell

First line write conf into file spark-defaults.conf, and second line mount this file from host file system to filesystem in container and puts it in conf directory.

To be able to use spark ui, add " -p 4040:4040 " argument:

docker run -ti -p 4040:4040 epahomov/docker-spark /spark/bin/spark-shell

To run some python script do:

echo "import pyspark\nprint(pyspark.SparkContext().parallelize(range(0, 10)).count())" > count.py
docker run -it -p 4040:4040 -v $(pwd)/count.py:/count.py epahomov/docker-spark /spark/bin/spark-submit /count.py

Hadoop

With this image you can connect to Hadoop cluster from spark. All you need is specify HADOOP_CONF_DIR and pass directory with hadoop configs as volume

docker run -v $(pwd)/hadoop:/etc/hadoop/conf -e "HADOOP_CONF_DIR=/etc/hadoop/conf" --net=host  -it epahomov/docker-spark /spark/bin/spark-shell --master yarn-client

Versions

This container exists in next versions:

spark_2.0_hadoop_2.7
spark_2.0_hadoop_2.6
spark_2.1_hadoop_2.7
spark_2.1_hadoop_2.6
lightweighted - lightweighted version of this image. It's based on alpine linux and downloaded binary, not build from source with all possible plags(like -Pyarn).
old-spark - Old functionality with setting up spark cluster. Not supported, not recommended to use.

Master has version spark_2.1_hadoop_2.7

Zeppelin

This image is base image for Apache Zeppelin Image

CarlosLopezRoa / docker-spark

Spark client docker image

Hadoop

Versions

Zeppelin

About