PySpark / Jupyter Notebook Demo

Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks. Complete information for this project can be found by reading the related blog post, Getting Started with PySpark for Big Data Analytics, using Jupyter Notebooks and Docker

Set-up

git clone this project from GitHub
Create $HOME/data/postgres directory for PostgreSQL files
For local development, install Python packages with pip3 install -r requirements.txt or python3 -m pip install -r requirements.txt
Deploy Docker Stack: docker stack deploy -c stack.yml pyspark
~~Download 'BreadBasket*DMS.csv' from kaggle~~ to the work/ subdirectory *This dataset was recently removed from kaggle. However, a copy is included as part of this project, 'BreadBasket_DMS.csv', or is available elsewhere on GitHub, for example, 'BreadBasket_DMS.csv'. Thanks, wsargent for this update!
From the Jupyter terminal, install Psycopg Python PostgreSQL adapter: pip install psycopg2-binary

Demo

From a Jupyter terminal window:

Sample Python script: python3 ./01_simple_script.py
Sample PySpark script: $SPARK_HOME/bin/spark-submit 02_bakery_dataframes.py
Load PostgreSQL sample data: python3 ./03_load_sql.py
Sample Jupyter Notebook: open 04_pyspark_demo_notebook.ipynb from Jupyter Console

Misc. Commands

docker pull jupyter/all-spark-notebook:latest
docker stack ps pyspark --no-trunc
docker logs $(docker ps | grep pyspark_pyspark | awk '{print $NF}') --follow

docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"

apt-get update -y && apt-get upgrade -y
apt-get install htop
htop --sort-key help
htop --sort-key

# optional from Jupyter terminal if not part of SparkSession spark.driver.extraClassPath
cp postgresql-42.2.8.jar /usr/local/spark/jars

References

About

Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks

MIT License

Languages

Language:Jupyter Notebook 93.2%Language:Python 3.2%Language:Shell 2.7%Language:TSQL 0.9%