shaunryan / spark_python_template

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Introduction

Template project for running spark development locally with an IDE. Primarily serves as local python build template and a playground setup for hudi and delta lake.

Requires a spark standalone or cluster. Please see dependencies.

Dependencies

To work locally it requires spark installed locally, SPARK_HOME set and $SPARK_HOME/bin set on the path. Also add the following to conf at $SPARK_HOME/conf/spark_defaults.conf in order to use delta lake and hudi.

For Delta lake:

spark.jars.packages               io.delta:delta-core_2.12:0.8.0
spark.sql.extensions              io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog   org.apache.spark.sql.delta.catalog.DeltaCatalog

For Hudi:

spark.serializer                 org.apache.spark.serializer.KryoSerializer

spark.jars.packages               org.apache.parquet:parquet-hive-bundle:1.11.1,org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.1.1

and for both:

spark.serializer                 org.apache.spark.serializer.KryoSerializer

spark.jars.packages               io.delta:delta-core_2.12:0.8.0,org.apache.parquet:parquet-hive-bundle:1.11.1,org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.1.1
spark.sql.extensions              io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog   org.apache.spark.sql.delta.catalog.DeltaCatalog

Setup

Create virual environment and install dependencies for local development:

python3.7 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -r dev_requirements.txt

Note the variables in .env. Adjust them accordingly.

ENVIRONMENT=local
HUDI_DATAROOT="file:///Users/shaunryan/code/spark_python_template/data/hudi/"
DELTA_DATAROOT="file:///Users/shaunryan/code/spark_python_template/data/delta/"
LOG_LEVEL=WARN

Run

It's a vscode project. The .vscode/launch.json is included in the repo. Just hit F5 to run main.py that will call into the module spark_python_demo

About


Languages

Language:Python 96.3%Language:Shell 3.7%