jimmy-nguyen-data-science / SF-COVID19-Vaccinations-pipeline

Data Engineering Project with Python, SQL, dbt, and Google Cloud Services

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ADS 507: Practical Data Engineering - Team 5 Final Project

Collaborators

  • Jimmy Nguyen
  • Abby Tan
  • Yi Wang

Topic:

Build an automated ELT data pipeline using Python, Google's Cloud Services such as Bigquery or Cloud Functions, and dbt for data transformation.

Data Set

San Francisco's daily data on COVID-19 vaccinations

Design Document

The design document illustrates the data pipeline in detail

Code Review

  • Split the column of “yearmonth” into “year” and “month” from stg_monthly_total.sql and stg_monthly_avg.sql
  • Create a “marts” folder to match the YAML file path of models > marts > core > dim tables
  • Standardize the code format using SQLFluff as Linter
  • Perform quality checks for continuous integration testing on data
  • Validate final materialized dimensional tables pushed into production data warehouse before populating Google's Data Studio dashboards

Overview of Final Project Directory

- Python Scripts

  • yml file contains a list of dependencies
  • txt and txt.bak are the same environments files to support the yml file
  • update_daily_data update the new data
  • upload_starting_data upload the new data

- Analysis

  • Created by dbt

- Architecture_diagrams

  • Create architecture diagram for data pipeline
  • Finalize architecture diagram draft as png

- Final output

  • Final output is a PDF file of a dashboard created by Google's Data Studio Service

- Macros

  • Created by dbt

- Models

  • SQL transformation in these models

- Marts/Core

  • Create monthly average and extract data from dataSF
  • Create monthly cumulative and extract data from dataSF
  • Create monthly boosters and extract data from dataSF

- Staging

  • Create boosters model
  • Create cumulative model
  • Create San Francisco Covid monthly model

- Seeds

  • Created by dbt

- Snapshots

  • Created by dbt

Tests

  • Test for staging files where all data should be non-negatives and any data less than zero will throw an error during the command dbt test
  • Tests are configured to run on a daily schedule with dbt

DBT run tests part 1 DBT run tests part 2

Architecture Diagram

Final Output

Triggering the Pipeline

  • The pipeline is trigged on a scheduled nightly run at 11:50 P.M. PST using Google Cloud Scheduler service

  • dbt is also configured for a scheduled run of a handful of commands at 11:59 P.M. PST every night to clean and transform incoming new raw data and update existing tables in BigQuery

Database/Data Store/ Data Warehouse

  • Google data warehouse BigQuery

SQL Data Transformation Tools

  • dbt

Linters

  • SQLFluff was installed as a extension on Microsoft Visual Studio Code when writing SQL queries

How to deploy pipeline (Manually Set-up)

  1. Create an account with Google Cloud
  2. Create an account with dbt
  3. Create a virtual environment with Python or Conda and install the dependencies inside the file: Python Scripts > environment.yml or requirements.txt
  4. Run python script: _upload_starting_data.py"
  5. Configure Google Cloud Scheduler to run everyday at 11:50 P.M. PST - this will send out a pub/sub message with the trigger topic: _new_data
  6. Paste the following environment and code on Google Cloud Functions: Python Scripts > update_daily_data.py and requirements.txt
  7. Set up Secruity and Authentication by adding access and admin privleges to users
  8. The Python script update_daily_data.py will be deploy on Google Cloud Functions daily appending new rows to an existing table called daily-data on Google data warehouse Big Query
  9. Set up automated connectors between BigQuery API and dbt
  10. Create staging models for light transformations on raw data
  11. Create tests for incoming raw data
  12. Finalize documentation and schedule daily runs to materialize clean updated data as dimensional table back to BigQuery as new tables
  13. Launch Google Data Studio and import data directly from BigQuery to create dashboards

How to Monitor Data Pipeline

  1. Two Google Cloud Services: Error Reporting and Cloud Monitoring
  2. Error reporting sends notifications such as emails or texts about failures for services in the pipeline - also gives a quick report/dashboard of the type of errors detected and the level of severity
  3. Cloud Monitoring displays reports or dashboards about the resources used from each service in order to maximize performance or reduce costs for unused services.

References

CDC Museum COVID-19 Timeline. (2022, January 5). Centers for Disease Control and Prevention. https://www.cdc.gov/museum/timeline/covid19.html#:%7E:text=January%2020%2C%202020%20CDC,18%20in%20Washington%20state

DataSF. (2021, February 11). COVID vaccinations given to SF residents over time. Retrieved February 25, 2022, from https://data.sfgov.org/COVID-19/COVID-Vaccinations-Given-to-SF-Residents-Over-Time/bqge-2y7k

U.S. Food and Drug Administration. (2021, August 23). FDA Approves First COVID-19 Vaccine. U.S. Food and Drug Administration. https://www.fda.gov/news-events/press-announcements/fda-approves-first-covid-19-vaccine

About

Data Engineering Project with Python, SQL, dbt, and Google Cloud Services


Languages

Language:Python 100.0%