ADS 507: Practical Data Engineering - Team 5 Final Project

Collaborators

Topic:

Build an automated ELT data pipeline using Python, Google's Cloud Services such as Bigquery or Cloud Functions, and dbt for data transformation.

Data Set

Design Document

Code Review

Split the column of “yearmonth” into “year” and “month” from stg_monthly_total.sql and stg_monthly_avg.sql
Create a “marts” folder to match the YAML file path of models > marts > core > dim tables
Standardize the code format using SQLFluff as Linter
Perform quality checks for continuous integration testing on data
Validate final materialized dimensional tables pushed into production data warehouse before populating Google's Data Studio dashboards

Overview of Final Project Directory

- Python Scripts

yml file contains a list of dependencies
txt and txt.bak are the same environments files to support the yml file
update_daily_data update the new data
upload_starting_data upload the new data

- Analysis

Created by dbt

- Architecture_diagrams

Create architecture diagram for data pipeline
Finalize architecture diagram draft as png

- Final output

Final output is a PDF file of a dashboard created by Google's Data Studio Service

- Macros

Created by dbt

- Models

SQL transformation in these models

- Marts/Core

Create monthly average and extract data from dataSF
Create monthly cumulative and extract data from dataSF
Create monthly boosters and extract data from dataSF

- Staging

Create boosters model
Create cumulative model
Create San Francisco Covid monthly model

- Seeds

Created by dbt

- Snapshots

Created by dbt

Tests

Test for staging files where all data should be non-negatives and any data less than zero will throw an error during the command dbt test
Tests are configured to run on a daily schedule with dbt

Architecture Diagram

Final Output

Triggering the Pipeline

The pipeline is trigged on a scheduled nightly run at 11:50 P.M. PST using Google Cloud Scheduler service
dbt is also configured for a scheduled run of a handful of commands at 11:59 P.M. PST every night to clean and transform incoming new raw data and update existing tables in BigQuery

Database/Data Store/ Data Warehouse

SQL Data Transformation Tools

Linters

SQLFluff was installed as a extension on Microsoft Visual Studio Code when writing SQL queries

How to deploy pipeline (Manually Set-up)

Create an account with Google Cloud
Create an account with dbt
Create a virtual environment with Python or Conda and install the dependencies inside the file: Python Scripts > environment.yml or requirements.txt
Run python script: _upload_starting_data.py"
Configure Google Cloud Scheduler to run everyday at 11:50 P.M. PST - this will send out a pub/sub message with the trigger topic: _new_data
Paste the following environment and code on Google Cloud Functions: Python Scripts > update_daily_data.py and requirements.txt
Set up Secruity and Authentication by adding access and admin privleges to users
The Python script update_daily_data.py will be deploy on Google Cloud Functions daily appending new rows to an existing table called daily-data on Google data warehouse Big Query
Set up automated connectors between BigQuery API and dbt
Create staging models for light transformations on raw data
Create tests for incoming raw data
Finalize documentation and schedule daily runs to materialize clean updated data as dimensional table back to BigQuery as new tables
Launch Google Data Studio and import data directly from BigQuery to create dashboards

How to Monitor Data Pipeline

Two Google Cloud Services: Error Reporting and Cloud Monitoring
Error reporting sends notifications such as emails or texts about failures for services in the pipeline - also gives a quick report/dashboard of the type of errors detected and the level of severity
Cloud Monitoring displays reports or dashboards about the resources used from each service in order to maximize performance or reduce costs for unused services.

References

CDC Museum COVID-19 Timeline. (2022, January 5). Centers for Disease Control and Prevention. https://www.cdc.gov/museum/timeline/covid19.html#:%7E:text=January%2020%2C%202020%20CDC,18%20in%20Washington%20state

DataSF. (2021, February 11). COVID vaccinations given to SF residents over time. Retrieved February 25, 2022, from https://data.sfgov.org/COVID-19/COVID-Vaccinations-Given-to-SF-Residents-Over-Time/bqge-2y7k

U.S. Food and Drug Administration. (2021, August 23). FDA Approves First COVID-19 Vaccine. U.S. Food and Drug Administration. https://www.fda.gov/news-events/press-announcements/fda-approves-first-covid-19-vaccine

jimmy-nguyen-data-science / SF-COVID19-Vaccinations-pipeline