rymurr/airflow-data

Steps

build docker image docker build .
start docker compose AIRFLOW_IMAGE_NAME={hash from above} docker-compose up
navigate to http://localhost:8081 log in with airflow:airflow
go to Admin -> Connections and add nessie-default as a Nessie connection type (host = http://nessie:19120api/v1)
go to Admin -> Connections and add spark-cluster and spark-cluster-sql (host = spark://spark and port= 7077). types are spark and spark_sql respectively
go to Admin -> Connections and add aws-nessie as an aws type with user=access_key and pass=secret_key
run the example_spark_operator dag

nessie_provider.hooks.nessie_hook - Defines a Hook in Airflow and exposes a connection in the UI
nessie_provider.operators.create - runs pynessie to create a ref
nessie_provider.operatprs.merge - runs pynessie to execute a merge

See dags/dummy_dag.py

figure out how to realistically handle the NessieSparkSql job
possibly create a sensor and create a job that uses it. Sensor could be a) wait for table to change or b) wait for commit on branch for example
possibly add operators to expose more Nessie functionality
correctly package and push to PyPI - correct names, correct docs, typing, black, testing etc etc
add packages and env to spark sql operator on airflow github
SparkSql operator is annoying, it appears you can only run one at a time as it does somehitng funky w/ a derby db. May just want to use spark-submit?
is there a way to submit sql via a rest api or something? or via a pyspark job? Probably w databricks