docker mongodb spark jupyter-notebook mongodb-atlas pyspark python data-processing

PySpark + MongoDB + SingleStore

Use Docker Compose to start the setup

docker compose up

This will start a setup of

Spark Master (at localhost:8090)
Spark Worker with 2 CPUs and 4 GB RAM (at localhost:8081)
Spark Worker with 4 CPUs and 4 GB RAM (at localhost:8082)
Spark History Server (at localhost:18081)

and

Jupyter Lab (at localhost:8888)

Open JupyterLab here or connect to the Jupyter server at 127.0.0.1:8888 and use the following token:

5f69150501c3c0c4f94f5d4ae38123e2f556777f794bf48b

Use the Aggegation Pipelines notebook as a starting point.

About the Dockerfile

The Dockerfile (as used in docker-compose.yml) provides three different Docker targets, namely master, worker and jupyter. All three targets share the same base images consisting of:

Spark 3.4.1 (Scala 2.12 + Hadoop 3.3) + PySpark 3.4.1 + MongoDB Connector for Spark 10.2 + SingleStore JDBC 1.1.9
Ubuntu 23.04 with Java/OpenJDK 17 and Python 3.11

Using the same base image for Jupyter Lab and Spark was the only way to get this setup working; specifically, having only master and worker images and a predefined PySpark image would consistently fail with either JARs not being found or serialization issues happening when running PySpark programs.

About

Spark vs. MongoDB Atlas

docker mongodb spark jupyter-notebook mongodb-atlas pyspark python data-processing

Languages

Language:Jupyter Notebook 59.7%Language:Dockerfile 35.5%Language:Shell 4.8%