It's been a while since I've used PySpark. In this repo I'm giving myself a quick refresher on the basics. I'm setting up a spark cluster with docker, and running code against it through a jupyterlab container. Every bit of this setup should be completely reproducible.
I'll be working through a lot of the examples/functionality from the docs, exploring some standard datasets, and finally building a simple ML pipeline. The code itself lives in the jupyter/notebooks directory.
To start the jupyter notebook container with a connection to the standalone spark cluster, run
docker compose up --build jupyter
You might not want to interact with spark through notebooks, preferring to work via a shell on a standalone spark container. To start the spark cluster, run
docker compose up --build -d spark
To open a scala spark shell, run
docker exec -it pyspark-refresher_spark_1 /opt/bitnami/spark/bin/spark-shell spark://spark:7077
To open a python spark shell, run
docker exec -it pyspark-refresher_spark_1 /opt/bitnami/spark/bin/pyspark --master spark://spark:7077
For anything else, you can open a regular bash shell in the container with
docker exec -it pyspark-refresher_spark_1 /bin/bash
I've used a couple of standard datasets in this repo, which live in the data/ directory. To get them yourself, run
docker compose run get_data
By default, everything above will run with a single worker node. To add more workers, run eg.
docker compose up --build jupyter --scale spark-worker=3