Using Airflow to implement our ETL pipelines
- ods/opening_crawler: Crawlers written by @Rain. Those openings can be used for recuitment board, which was implemented by @tai271828 and @stacy.
- ods/survey_cake: A manually triggered uploader which would upload questionnaire to bigquery. The uploader should be invoked after we recieved the surveycake questionnaire.
docker pull puckel/docker-airflow:1.10.9
- Python dependencies:
virtualenv venv; . venv/bin/activate
pip install poetry
poetry install
- Npm dependencies, for linter, formatter and commit linter (optional):
brew install npm
npm ci
git add <files>
npm run check
: Apply all the linter and formatternpm run commit
- Build docker image:
docker build -t davidtnfsh/pycon_etl:cache --cache-from davidtnfsh/pycon_etl:cache .
- Start the Airflow server:
docker run --rm -p 80:8080 --name airflow -v $(pwd)/dags:/usr/local/airflow/dags -v $(pwd)/service-account.json:/usr/local/airflow/service-account.json davidtnfsh/pycon_etl:cache webserver
- service-account.json: Please contact @david30907d using email, telegram or discord.
- Setup the Authentication of GCP: https://googleapis.dev/python/google-api-core/latest/auth.html
- After invoking
gcloud auth application-default login
, you'll get a credentials.json resides in/Users/<xxx>/.config/gcloud/application_default_credentials.json
. Invokeexport GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json"
if you have it.
- After invoking
- Give Toy-Examples a try
Please check .github/workflows for details
BigQuery Example:
from google.cloud import bigquery
client = bigquery.Client(project='pycontw-225217')
# Perform a query.
QUERY = '''
SELECT scenario.day2checkin.attr.diet FROM `pycontw-225217.ods.ods_opass_attendee_timestamp`
'''
query_job = client.query(QUERY) # API request
rows = query_job.result() # Waits for query to finish
for row in rows:
print(row.diet)