Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks. Complete information for this project can be found by reading the related blog post, Getting Started with PySpark for Big Data Analytics, using Jupyter Notebooks and Docker
git clone
this project from GitHub- Create
$HOME/data/postgres
directory for PostgreSQL files - For local development, install Python packages with
pip3 install -r requirements.txt
orpython3 -m pip install -r requirements.txt
- Deploy Docker Stack:
docker stack deploy -c stack.yml pyspark
Download 'BreadBasket*DMS.csv' from kaggleto thework/
subdirectory *This dataset was recently removed from kaggle. However, a copy is included as part of this project, 'BreadBasket_DMS.csv', or is available elsewhere on GitHub, for example, 'BreadBasket_DMS.csv'. Thanks, wsargent for this update!- From the Jupyter terminal, install Psycopg Python PostgreSQL adapter:
pip install psycopg2-binary
From a Jupyter terminal window:
- Sample Python script:
python3 ./01_simple_script.py
- Sample PySpark script:
$SPARK_HOME/bin/spark-submit 02_bakery_dataframes.py
- Load PostgreSQL sample data:
python3 ./03_load_sql.py
- Sample Jupyter Notebook: open
04_pyspark_demo_notebook.ipynb
from Jupyter Console
docker pull jupyter/all-spark-notebook:latest
docker stack ps pyspark --no-trunc
docker logs $(docker ps | grep pyspark_pyspark | awk '{print $NF}') --follow
docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"
apt-get update -y && apt-get upgrade -y
apt-get install htop
htop --sort-key help
htop --sort-key
# optional from Jupyter terminal if not part of SparkSession spark.driver.extraClassPath
cp postgresql-42.2.8.jar /usr/local/spark/jars