This project lets you create a Spark cluster based on ubuntu:18.04
, OpenJDK 8
, Spark 2.4.5
, Hadoop 2.8.5
, Python 3.6
in a few easy steps. Highlights:
- Python
3.6
by default - It creates three Docker containers based on the same image:
spark-master
,spark-slave
, andspark-driver
- PySpark jobs are executed on the
spark-driver
container in client mode - The
spark-driver
container provides aJupyter
web interface to the cluster, altogether withPySpark
helper functions and demo notebooks for Python+Spark and Scala+Spark - The
HDFS
filesystem and thenotebooks
directory are persisted in the host-mapped/data
directory - It matches the setup of AWS EMR version
emr-5.28.1
(this uses Spark 2.4.4, we're already on 2.4.5) - Python packages for data science and AWS integration already installed:
numpy
,scipy
,pandas
,matplotlib
,jupyterlab
,boto3
,awscli
,s3fs
- Jupyter extensions enabled by default:
jupyter-spark
,toc2
,python-markdown
,code_prettify
,execute_time
Install pyenv:
brew update
brew install pyenv
brew install pyenv-virtualenv
Add to ~.zshrc
:
if command -v pyenv 1>/dev/null 2>&1; then
eval "$(pyenv init -)"
export PYENV_VIRTUALENV_DISABLE_PROMPT=1
fi
if which pyenv-virtualenv-init > /dev/null; then eval "$(pyenv virtualenv-init -)"; fi
Create environment:
pyenv virtualenv 3.6.10 spark_cluster
pyenv activate spark_cluster
pip install Fabric3==1.14.post1
The cluster runs on Docker containers orchestrated with docker-compose
. All tasks are automated using Fabric.
Provide at least 4 Gb
of memory to the Docker engine: "Preferences" > "Advanced" > "Memory"
These steps apply to both Linux and MacOS environments, and let you build, start and stop the cluster:
- Install
docker
: https://docs.docker.com/install/ - Install
docker-compose
: https://docs.docker.com/compose/install/ - Install the
Fabric3
Python package:pip install Fabric3==1.14.post1
- Generate a SSH key pair with the command
fab generate_ssh_keys
- Build Docker image:
fab build
- Start cluster:
fab start
- Stop cluster:
fab stop
In Linux, you might need to add your user to the
docker
group to be able to executedocker
commands:sudo gpasswd -a $USER docker ; newgrp docker
To remove all existing Docker images and caches:
docker system prune -a
PySpark configuration profiles are defined in spark_cluster/spark_cluster/config.py
and define the PySpark configuration.
By default, two profiles are defined: profile_local
and profile_server
. You can add more, depending on your needs and environment(s).
Changes to the spark_cluster
Python package on the host are immediately reflected inside the spark-driver
container, including the contents of spark_config.py
.
By default, the cluster starts using the "profile_local"
PySpark configuration profile and ./data
as /data
directory. These default values can be changed in config.py
.
You can start the cluster specifying different parameter values as follows:
fab start:profile=profile_server,data_dir=/data/spark_cluster
Web access to the cluster is handled with FoxyProxy:
- Install FoxyProxy plugin in your browser
- Import
foxyproxy-settings.xml
from FoxyProxy - Activate proxy configuration
- Start SSH tunnel:
fab tunnel
You can now access these web services inside the cluster (make sure to start the tunnel first with fab tunnel
):
- Jupyter lab: http://spark-driver:8888/lab
- Jupyter notebook: http://spark-driver:8888/tree
- Spark monitoring dashboard: http://spark-master:8080
- Hadoop monitoring dashboard: http://spark-master:8088/cluster
- HDFS monitoring dashboard: http://spark-master:50070/explorer.html
To SSH into the master node: fab ssh_master
.
In some situations, you might want to have a local cluster to experiment quickly and a remote cluster for bigger experiments. The following steps let you manage faster the additional remote instance.
- Configure the access to the remote host by editing the variables
srv_*
in theconfig.py
file - SSH into remote server with
fab ssh
and create the destination directory as specified by thesrv_dir
config variable - Sync the
spark_cluster
directory to the remote server:fab rsync
- Copy SSH keys from local cluster setup to remote server:
fab copy_keys
- SSH into remote server with
fab ssh
and start the cluster withfab start
You can access the web dashboards and Jupyter on the remote cluster by specifying a custom host/address XYZ
for the SSH tunnel: fab tunnel:XYZ
.
The same applies to access the remote Spark master node:fab ssh_master:XYZ
.
The following fab
tasks are defined:
- build: stop containers (if running) and build image
copy_keys
: copy local SSH keys to server hostfix_pep8
: fix a few common and easy-to-fix PEP8 mistakesgenerate_ssh_keys
: generate SSH key pair to be used within cluster and to SSH into the spark-master containerkill
: terminate containers as fast as possible, might cause corrupted files in the /data directorylogs
: print live stdout/stderr from Docker containersps
: list active containersrsync
: synchronise local spark_cluster directory to remote servershell
: open shell on docker-driver containerssh
: ssh into remote serverssh_master
: ssh into the spark-master Docker containerstart
: start Docker containers: spark-master, spark-slave(s), spark-driverstop
: terminate Docker containers nicely and cleanlytest
: run all teststest_pep8
: run only pep8 testtest_sx
: run all tests printing output and terminating tests at first failuretunnel
: create SSH port forwarding tunnel from localhost to spark-driver container network
You can access the HDFS filesystem from any container. Examples:
- Copy file to rood directory:
hdfs dfs -put /shared/README.md /
- List contents of root directory:
hdfs dfs -ls /
- Remove directory:
hdfs dfs -rm -r /out/
- Move directory:
hdfs dfs -mv /out /data
The following code demonstrates:
- How to add a directory to
PYTHONPATH
to load your own packages - How to create PySpark dataframes
- How to perform a join
# Add directory of spark_cluster repository, containing spark_cluster Python package
import sys
sys.path = ["/shared"] + sys.path
# Imports
from random import randint
from pyspark.sql import Row
from spark_helpers.spark_helpers import create_df
# Create two dataframes and join them
N = 10000
df1 = create_df([Row(id=randint(0, N-1)) for i in range(N)])
df2 = create_df([Row(id=randint(0, N-1)) for i in range(N)])
df1.join(df2, df1.id == df2.id).distinct().count()
This example is also included in the included notebook demo.
- Docker image:
docker/spark
defines the image for containersspark-master
,spark-slave
, andspark-driver
- Container entrypoints, dependencies, and health checks are defined in
docker-compose.yml
:spark-driver
depends onspark-slave
andspark-master
,spark-slave
depends onspark-master
- Entrypoints differentiate the behavior of the containers and are defined in
docker/spark/entrypoint_*.sh
scripts - Entrypoints initialize the required environmental variables sourcing
/shared/docker/spark/init_env.sh
before starting their services:dfs
,yarn
,spark
, andjupyter
- Script
init_env.sh
sources Spark, Hadoop and Python paths from/env.sh
, built during the construction of the image
Tests verify the PEP8 compliance of the Python code and the correct functioning of HDFS and Spark.
Tests are executed inside the spark-driver
container.
To run all tests, ignoring some innocuous "RuntimeWarning: numpy.dtype size changed" warnings:
fab test:--disable-pytest-warnings
To run tests printing output and stopping at first error:
fab test_sx
To run the pep8 test:
fab test_pep8
To fix some common pep8 errors in the code:
fab fix_pep8
The spark_cluster project is released under the MIT license. Please see LICENSE.txt.
Contributors include:
- Michele Dallachiesa: https://github.com/elehcimd
- Fork it
- Create your feature branch:
git checkout -b my-new-feature
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Create a new Pull Request