ADS 507: Practical Data Engineering - Team 5 Final Project
Collaborators
Topic:
Build an automated ELT data pipeline using Python, Google's Cloud Services such as Bigquery or Cloud Functions, and dbt for data transformation.
Data Set
Design Document
Code Review
- Split the column of “yearmonth” into “year” and “month” from stg_monthly_total.sql and stg_monthly_avg.sql
- Create a “marts” folder to match the YAML file path of models > marts > core > dim tables
- Standardize the code format using SQLFluff as Linter
- Perform quality checks for continuous integration testing on data
- Validate final materialized dimensional tables pushed into production data warehouse before populating Google's Data Studio dashboards
Overview of Final Project Directory
- Python Scripts
- yml file contains a list of dependencies
- txt and txt.bak are the same environments files to support the yml file
- update_daily_data update the new data
- upload_starting_data upload the new data
- Analysis
- Created by dbt
- Architecture_diagrams
- Create architecture diagram for data pipeline
- Finalize architecture diagram draft as png
- Final output
- Final output is a PDF file of a dashboard created by Google's Data Studio Service
- Macros
- Created by dbt
- Models
- SQL transformation in these models
- Marts/Core
- Create monthly average and extract data from dataSF
- Create monthly cumulative and extract data from dataSF
- Create monthly boosters and extract data from dataSF
- Staging
- Create boosters model
- Create cumulative model
- Create San Francisco Covid monthly model
- Seeds
- Created by dbt
- Snapshots
- Created by dbt
Tests
- Test for staging files where all data should be non-negatives and any data less than zero will throw an error during the command dbt test
- Tests are configured to run on a daily schedule with dbt
Architecture Diagram
Final Output
Triggering the Pipeline
-
The pipeline is trigged on a scheduled nightly run at 11:50 P.M. PST using Google Cloud Scheduler service
-
dbt is also configured for a scheduled run of a handful of commands at 11:59 P.M. PST every night to clean and transform incoming new raw data and update existing tables in BigQuery
Database/Data Store/ Data Warehouse
SQL Data Transformation Tools
Linters
- SQLFluff was installed as a extension on Microsoft Visual Studio Code when writing SQL queries
How to deploy pipeline (Manually Set-up)
- Create an account with Google Cloud
- Create an account with dbt
- Create a virtual environment with Python or Conda and install the dependencies inside the file: Python Scripts > environment.yml or requirements.txt
- Run python script: _upload_starting_data.py"
- Configure Google Cloud Scheduler to run everyday at 11:50 P.M. PST - this will send out a pub/sub message with the trigger topic: _new_data
- Paste the following environment and code on Google Cloud Functions: Python Scripts > update_daily_data.py and requirements.txt
- Set up Secruity and Authentication by adding access and admin privleges to users
- The Python script update_daily_data.py will be deploy on Google Cloud Functions daily appending new rows to an existing table called daily-data on Google data warehouse Big Query
- Set up automated connectors between BigQuery API and dbt
- Create staging models for light transformations on raw data
- Create tests for incoming raw data
- Finalize documentation and schedule daily runs to materialize clean updated data as dimensional table back to BigQuery as new tables
- Launch Google Data Studio and import data directly from BigQuery to create dashboards
How to Monitor Data Pipeline
- Two Google Cloud Services: Error Reporting and Cloud Monitoring
- Error reporting sends notifications such as emails or texts about failures for services in the pipeline - also gives a quick report/dashboard of the type of errors detected and the level of severity
- Cloud Monitoring displays reports or dashboards about the resources used from each service in order to maximize performance or reduce costs for unused services.
References
CDC Museum COVID-19 Timeline. (2022, January 5). Centers for Disease Control and Prevention. https://www.cdc.gov/museum/timeline/covid19.html#:%7E:text=January%2020%2C%202020%20CDC,18%20in%20Washington%20state
DataSF. (2021, February 11). COVID vaccinations given to SF residents over time. Retrieved February 25, 2022, from https://data.sfgov.org/COVID-19/COVID-Vaccinations-Given-to-SF-Residents-Over-Time/bqge-2y7k
U.S. Food and Drug Administration. (2021, August 23). FDA Approves First COVID-19 Vaccine. U.S. Food and Drug Administration. https://www.fda.gov/news-events/press-announcements/fda-approves-first-covid-19-vaccine