James-Wachuka / mentalhealth_analysis-data-pipeline

An end to end data pipeline for for mental health analysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MENTAL HEALTH ANALYSIS DATA PIPELINE

An end to end data pipeline built using python, prefect, gcp, dbt cloud

PIPELINE ARCHITECTURE

pipeline architecture

PREREQUISITES

  • Python
  • prefect
  • GCP
  • dbt cloud
  • terraform

Setup a python venv and install the required packages

CREATING INFRA ON GCP USING TERRAFORM

  • Configure gcloud sdk on your machine and setup terrraform
  • create a terraform directory to initialize terrafrom files
# download and setup terraform
wget -O- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install terraform

# set credentials for gcp
export GOOGLE_APPLICATION_CREDENTIALS="yourkeys.json"

# Refresh token/session, and verify authentication
gcloud auth application-default login

# Initialize state file (.tfstate)
terraform init

# Check changes to new infra plan
terraform plan -var="project=<your-gcp-project-id>"

# Create new infra
terraform apply -var="project=<your-gcp-project-id>"

RUNNING PREFECT FLOWS

  • Create a prefect directory and inside add flows and blocks folders
  • Start prefect server
  • Setup kaggle access
# setup kaggle api access
mv kaggle.json /home/james23/.kaggle/kaggle.json

run the flows to ingest data and load into bigquery

# start prefect orion server
prefect server start

# register your custom block
prefect block register --file my_block.py

# run your flows 
python3 ./prefect/flows/data_to_gcs.py

python3 ./prefect/flows/gcs_to_bq.py

prefect-output

RUNNING TRANSFORMATIONS IN DBT CLOUD

  • setup a dbtcloud account
  • clone your github repo and initialize dbt
  • setup database credentials in dbt cloud for bigquery (bigquery)
  • create and configure dbt_project.yml, macros and models accordingly
  • run the builds in developer mode

Example of macros get_gender_properties.sql

Example of staging-models(for development) stag_mentalhealth_data.sql

Example of core models(for production) dim_employee.sql

  • IMPORTANT: configure the development env with the correct target database/dataset for bigquery dbt-builds

ADDING A DEPLOY ENVIRONMENT ON DBT CLOUD

  • This evironment runs jobs for loading data into production tables
  • set up the deployment enviroment
  • add job runs and schedule them deployment env

VISUALIZING THE DATA

  • looker studio is used to build the dashboard for analysis

link looker-dashboard of mental health analysis

  • IMPORTANT: configure the deployment env with the correct target database/dataset for bigquery

NEXT STEPS

You can customize this project in the following ways.

  • Run the flows in prefect cloud

  • Enhance deployment by adding triggers (eg. on pull request )

CONTRIBUTING

Contributions are welcome! If you have any ideas, improvements, or bug fixes, please open an issue or submit a pull request.

LICENSE

This project is licensed under the MIT License.

About

An end to end data pipeline for for mental health analysis

License:MIT License


Languages

Language:Python 66.5%Language:HCL 33.5%