- build docker image
docker build .
- start docker compose
AIRFLOW_IMAGE_NAME={hash from above} docker-compose up
- navigate to http://localhost:8081 log in with airflow:airflow
- go to Admin -> Connections and add
nessie-default
as aNessie
connection type (host = http://nessie:19120api/v1) - go to Admin -> Connections and add
spark-cluster
andspark-cluster-sql
(host = spark://spark and port= 7077). types are spark and spark_sql respectively - go to Admin -> Connections and add
aws-nessie
as anaws
type with user=access_key and pass=secret_key - run the example_spark_operator dag
- nessie_provider.hooks.nessie_hook - Defines a Hook in Airflow and exposes a connection in the UI
- nessie_provider.operators.create - runs pynessie to create a ref
- nessie_provider.operatprs.merge - runs pynessie to execute a merge
See dags/dummy_dag.py
- Create branch
- run spark jobs to add two tables to branch
- merge branch
- delete branch
- figure out how to realistically handle the
NessieSparkSql
job - possibly create a sensor and create a job that uses it. Sensor could be a) wait for table to change or b) wait for commit on branch for example
- possibly add operators to expose more Nessie functionality
- correctly package and push to PyPI - correct names, correct docs, typing, black, testing etc etc
- add
packages
andenv
to spark sql operator on airflow github - SparkSql operator is annoying, it appears you can only run one at a time as it does somehitng funky w/ a derby db. May just want to use spark-submit?
- is there a way to submit sql via a rest api or something? or via a pyspark job? Probably w databricks