- Source code - Github
- Author - Gavin Noronha - gavinln@hotmail.com
This project provides a Ubuntu (14.04) Vagrant Virtual Machine (VM) with Airflow, a data workflow management system from Airbnb.
There are Puppet scripts that automatically install the software when the VM is started.
-
To start the virtual machine(VM) type
vagrant up
-
Connect to the VM
vagrant ssh
-
Setup the home directory
export AIRFLOW_HOME=~/airflow
-
Initialize the sqlite database
airflow initdb
-
Start the web server
airflow webserver -p 8080
-
Open a web browser to the UI at http://192.168.33.10:8080
-
List DAGS
airflow list_dags
-
List tasks for
example_bash_operator
DAGairflow list_tasks example_bash_operator
-
List tasks for
example_bash_operator
in a tree viewairflow list_tasks example_bash_operator -t
-
Run the
runme_0
task on theexample_bash_operator
DAG todayairflow run example_bash_operator runme_0 `date +%Y-%m-%d`
-
Backfill a DAG
export START_DATE=$(date -d "-2 days" "+%Y-%m-%d") airflow backfill -s $START_DATE example_bash_operator
-
Clear the history of DAG runs
airflow clear example_bash_operator
-
Go to the Airflow config directory
cd ~/airflow
-
Set the airflow dags directory in airflow.cfg by change the line:
dags_folder = /vagrant/airflow/dags
-
Restart the web server
airflow webserver -p 8080
-
Main documentation
-
Videos on Airflow
-
Slides
-
Airflow reviews
-
Airflow tips and tricks
- https://medium.com/handy-tech/airflow-tips-tricks-and-pitfalls-9ba53fba14eb#.i2hu0syug
- https://stlong0521.github.io/20161023%20-%20Airflow.html
- https://databricks.com/blog/2016/12/08/integrating-apache-airflow-databricks-building-etl-pipelines-apache-spark.html
- http://site.clairvoyantsoft.com/installing-and-configuring-apache-airflow/
- https://gtoonstra.github.io/etl-with-airflow/principles.html
- https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls
-
Change to the airflow directory
cd /vagrant/airflow
-
Set airflow environment
source set_airflow_env.sh
-
Run airflow without any logging messages
-
Edit file ~/airflow/airflow.cfg
-
Set the following:
dags_folder = /vagrant/airflow/dags load_examples = False
-
Start the scheduler by running the following
airflow scheduler
The following software is needed to get the software from github and run Vagrant to set up the Python development environment. The Git environment also provides an SSH client for Windows.
- : By default, Vagrant will share your project directory (the directory with the Vagrantfile) to /vagrant Read more here: https://www.vagrantup.com/docs/synced-folders/