Use Docker Compose to start the setup
docker compose up
This will start a setup of
- Spark Master (at localhost:8090)
- Spark Worker with 2 CPUs and 4 GB RAM (at localhost:8081)
- Spark Worker with 4 CPUs and 4 GB RAM (at localhost:8082)
- Spark History Server (at localhost:18081)
and
- Jupyter Lab (at localhost:8888)
Open JupyterLab here
or connect to the Jupyter server at 127.0.0.1:8888
and use the following token:
5f69150501c3c0c4f94f5d4ae38123e2f556777f794bf48b
Use the Aggegation Pipelines notebook as a starting point.
The Dockerfile (as used in docker-compose.yml)
provides three different Docker targets, namely master
, worker
and jupyter
.
All three targets share the same base
images consisting of:
- Spark 3.4.1 (Scala 2.12 + Hadoop 3.3) + PySpark 3.4.1 + MongoDB Connector for Spark 10.2 + SingleStore JDBC 1.1.9
- Ubuntu 23.04 with Java/OpenJDK 17 and Python 3.11
Using the same base image for Jupyter Lab and Spark was the only way to
get this setup working; specifically, having only master
and worker
images
and a predefined PySpark image would consistently fail with either JARs not being
found or serialization issues happening when running PySpark programs.