End-to-End Github Events Data Pipeline

Problem Description
Technologies
Data Pipeline Architecture and Workflow
Reproducability
- Step 1: Build GCP Resources from Local Computer
- Step 2: Setup Workaround on VM
Further Improvements

Problem Description

This project showcases the best practices from Data Engineering Zoomcamp course. I aim to build an end-to-end batch data pipeline and analyze Github user activities from the beginning of this year.

You must have known about Github. GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 330 million projects.

It is very interesting that Github user activities are publicly available here. The dataset are grouped on hourly basis and stored in a JSON format. Each user activity is labeled with event type, such as: push, pull-request, issues, commit, etc. There are >150K rows of activities recorded in each dataset. In average, the size of the daily data is around 1,4GB and this dataset is updated daily.

In this project, I am going to implement some data engineering best practices and gain interesting metrics, such as:

daily most active users (count by number of push)
daily most active repos (count by number of PR)
daily most active organizations (count by number of PR)
daily number of event based on its type

Technologies

This project utilizes such following tools that I learned from Data Engineering zoomcamp.

Google Cloud Storage as the datalake to store our raw dataset.
Google BigQuery as the data warehouse.
dbt core as the transformation tool to implement data modeling.
Self-hosted Prefect core to manage and monitor our workflow.
Terraform to easily manage the infrastructure setup and changes.
Google Compute Engine as the virtual host to host our data pipeline.
Looker as the dashboard for end-user to create report and visualize some insights,

with some improvements to support easy reproducability, such as:

Makefile
Containerized environment with docker

Data Pipeline Architecture and Workflow

The workflow is as follows:

(1) Ingest historical and (5) moving-forward data to Google Cloud Storage

The system ingests data from gharchive. Historical data covers the data that are created before the ingestion date. Moving-forward data is data that are updated daily.

Prefect is used to run both types of data and to schedule daily ingestion for moving-forward data.

Google Cloud Storage is our datalake. The data is stored in the datalase as Parquet files, partitioned by Year, Month and Day.

(2) BigQuery loads data from Cloud Storage

Google BigQuery serves as the data warehouse for the project. I implement several data modeling layers in BigQuery. The first layer of my data warehouse is the raw layer.

The raw layer is the layer where BigQuery uses external table to load our data from the datalake. Raw layer can contain any data duplications that may happens during the ingestion process (for example: through backfilling)

The external table configurations are defined in the terraform script.

(3) Data Warehouse Transformation with dbt and (6) prefect to schedule incremental transformation

BigQuery along with dbt is implemented to utilize data modeling layers.

As previously explained, the first layer of my data warehouse is raw layer; and the second layer is the staging layer. Staging layer ensures the data idempotency by removing data duplicate. The last layer is the core layer that do cleaning data (removal any events that are generated by bot) and the do metrics aggregation process.

Prefect is used to run ad-hoc dbt data transformation and to schedule daily data transformation for moving-forward data.

(4) Data Visualization with Looker

The link to the dashboard

Reproducability

Pre-requisites: I use Makefile to make ease the reproducability process. Please install make tool in the Ubuntu terminal with this command sudo apt install make -y.

Step 1: Build GCP Resources from Local Terminal

Clone this (repository)[https://github.com/oktavianidewi/github-data-pipeline.git] to your local workspace.
Open google cloud console and create a new GCP project by clicking New Project button.

You will be redirected to a new form, please provide the project information detail, copy the project_id and click Create button.

Let's create some resources (BigQuery, Cloud Engine and Cloud Storage and Service Account) on top of the newly created project using terraform. You should provide some information such as: project_id, region and zone in infra/gcp/terraform.tfvars as per your GCP setting.
Once you've updated the file, run this command to spin-up the infrastructure resource up.
```
make infra-init-vm
```
If there is an error saying that the call to enable services (compute, cloud_resource) is failing, please re-run the command again.
Go to google console dashboard and make sure that all of the resources are built succesfully.
- IAM and Service Account
- Cloud Storage
- Compute Engine
- BigQuery
- /ssh folder contains ssh-key to access newly created Compute Engine. Copy this file to your home/.ssh directory
```
cp ssh/github-pipeline-project ~/.ssh/
```
Download the service-account keyfile (previously created via terraform) to a file named sa-github-pipeline-project.json with this command
```
make generate-sa
```

Step 2: Setup Workaround on VM Terminal

Connect to GCE using SSH

ssh -i ~/.ssh/github-pipeline-project <your-email-address-for-gcp>@<external-vm-ip>

Once you've got succesfully connected to the VM. Clone this repository, install make tool and terraform inside the VM.

git clone https://github.com/oktavianidewi/github-data-pipeline.git

sudo apt install make -y

make install-terraform

Let's go back to the local terminal. Upload your service-account keyfile to VM directory with this command:
```
make copy-service-account-to-vm
```
In VM terminal, run initial setup for VM with make command, that will install docker, docker-compose, pip and jq tools.
```
make initial-setup-vm
```

Change some variables in .env file and supply it with your GCP account information

# CHANGE-THIS
GCP_PROJECT_ID="pacific-decoder-382709"

# CHANGE-THIS
GCS_BUCKET="datalake-github-pipeline"

# CHANGE-THIS
GCP_ZONE="asia-southeast1-b"

# CHANGE-THIS
GCS_BUCKET_ID="datalake-github-pipeline_pacific-decoder-382709"

# CHANGE-THIS
GCS_PATH="gs://datalake-github-pipeline_pacific-decoder-382709"

# CHANGE-THIS
GCP_EMAIL_WITHOUT_DOMAIN_NAME="learndewi"

Change some variables in infra/gcp/terraform.tfvars and supply it with your GCP setting information

project_id = "pacific-decoder-382709" # CHANGE-THIS
region     = "asia-southeast1"        # CHANGE-THIS
zone       = "asia-southeast1-b"      # CHANGE-THIS

Run prefect server and agent in VM with this command

export PREFECT_ORION_UI_API_URL=http://<external-vm-ip>:4200/api 

make docker-spin-up

Go to your browser and paste http://<external-vm-ip>:4200 on URL bar to open prefect dashboard. Now your prefect core is up!
Start historical data ingestion to cloud storage with this command
```
make block-create
make ingest-data
```
This command will create a prefect flow deployment to:
- read data from github events
- chunk data into a certain rows
- type cast data as needed
- store data into cloud storage partitioned by its year, month and day.
To ingest moving-forward data run this command:
```
make set-daily-ingest-data
```
This command will schedule a deployment in prefect to run daily ingestion data at 00.01.
Once all the raw data are stored in the datalake, we can start transforming the raw data into our data warehouse with dbt. Run this command to transform data to dev environment.
```
make transform-data-dev
```
To transform data to a production environment, run this command
```
make transform-data-prod
```
To set daily transformation data to the production environment, run this command:
```
make set-daily-transform-data-prod
```
This command will schedule a deployment in prefect to run daily dbt transformation data at 03.01.