marpontes / pyspark-setup-demo

Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PySpark / Jupyter Notebook Demo

Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks. Complete information for this project can be found by reading the related blog post, Getting Started with PySpark for Big Data Analytics, using Jupyter Notebooks and Docker

Architecture

Set-up

  1. git clone this project from GitHub
  2. Create $HOME/data/postgres directory for PostgreSQL files
  3. For local development, install Python packages with pip3 install -r requirements.txt or python3 -m pip install -r requirements.txt
  4. Deploy Docker Stack: docker stack deploy -c stack.yml pyspark
  5. Download 'BreadBasket*DMS.csv' from kaggle to the work/ subdirectory *This dataset was recently removed from kaggle. However, a copy is included as part of this project, 'BreadBasket_DMS.csv', or is available elsewhere on GitHub, for example, 'BreadBasket_DMS.csv'. Thanks, wsargent for this update!
  6. From the Jupyter terminal, install Psycopg Python PostgreSQL adapter: pip install psycopg2-binary

Demo

From a Jupyter terminal window:

  1. Sample Python script: python3 ./01_simple_script.py
  2. Sample PySpark script: $SPARK_HOME/bin/spark-submit 02_bakery_dataframes.py
  3. Load PostgreSQL sample data: python3 ./03_load_sql.py
  4. Sample Jupyter Notebook: open 04_pyspark_demo_notebook.ipynb from Jupyter Console

Jupyter Notebook

Misc. Commands

docker pull jupyter/all-spark-notebook:latest
docker stack ps pyspark --no-trunc
docker logs $(docker ps | grep pyspark_pyspark | awk '{print $NF}') --follow

docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"

apt-get update -y && apt-get upgrade -y
apt-get install htop
htop --sort-key help
htop --sort-key

# optional from Jupyter terminal if not part of SparkSession spark.driver.extraClassPath
cp postgresql-42.2.8.jar /usr/local/spark/jars

References

About

Demo of PySpark and Jupyter Notebook with the Jupyter Docker Stacks

License:MIT License


Languages

Language:Jupyter Notebook 93.2%Language:Python 3.2%Language:Shell 2.7%Language:TSQL 0.9%