Overview

This project showcases the orchestration of a scalable data pipeline using Kubernetes and Apache Airflow. It makes use of well known data solutions like Spark, Kafka, and DBT, while maintaining a strong focus on data quality.
With NYC Taxi data stored in S3 as the starting point, this pipeline goes beyond generating a standard BI dashboard. It also delivers a prediction service and data models tailored for data scientists' needs.

Table of Contents

Explore Some Project Deliverables
Architecture
Extra Notes on Technologies Used
Testing
How to Install

Explore Some Project Deliverables:

**Kindly note that the services are hosted using free GCP credits and may not be available once the credits are exhausted.

Static Docs:
- Great Expectations: Link | Screenshot
- DBT Docs: Link | Screenshot
- Elementary Docs: Link | Elementary Dashboard
ML Serving/Prediction Service: Link | Screenshot
MLFlow: Screenshot

You can find extra documentation in the docs/ dir.

Architecture

Visually Summarized Architecture

Data Stack

Airflow Job DAG Flowchart

CICD Pipelines

flowchart LR

    A[GitHub Push / Pull Request]
    A --> |Python File?| B(PyApps CI)
    A --> |Docker Folder Content/ K8 Manifests Changed?| C(Build & Deploy Components)
    A --> |Terraform File Changed?| D(Terraform Infrastructure CD)


    B -- Yes --> E[Run Tests]
    C -- Yes --> F[Build Images]
    D -- Yes --> G[Terraform Actions]

    E --> |Test Successful?| H[Format, Lint, and Sort]
    F --> |Deployment Restart Required?| I[Kubectl rollout restart deployment]
    G --> |Terraform Plan Successful?| J[Terraform Apply]

    E --> K[End]
    H --> K
    F --> |No Deployment Restart Required| K
    I --> K
    G --> |Terraform Plan Unsuccessful| K
    J --> K

    B -- No --> K
    C -- No --> K
    D -- No --> K

Extra Notes on Technologies Used:

Flask: Used to serve static documentation websites for DBT, Great Expectation, and Elementary.
Prometheus: Collects metrics from various sources including Airflow (via Statsd), SparkApplications (via JMX Prometheus Java Agent), Postgres (via prometheus-postgres-exporter), and Kubernetes (via kube-state-metrics).
Grafana Utilized to visualize the metrics received by Prometheus. The Grafana folder contains JSON dashboard definitions created to monitor Airflow and Spark.
Kafka: Integrated as a streaming solution in the project. It reads a file and utilizes its rows as streaming material.
Streamlit: Used to serve the best xgboost model from the MLFlow Artifact Repository.
gitsync: Implemented to allow components to directly read their respective code from GitHub without needing to include the code in the Docker image.
Spark-on-K8s: Utilized to run Spark jobs on Kubernetes.

Testing

Other than standard unit testing, the below was also done.

Data Quality Tests

Data quality is validated on ingestion using Great Expectations.
All data source and models have dbt expectations data quality tests.
There are also unit tests where a sample input and expected output are used to test the model.

Integration Tests

An integration test for the lambda function can be found in the tests/integration directory.

How to Install:

Prerequisites

Google Cloud SDK with Kubectl.

Installation

Clone the repository:

git clone https://github.com/Elsayed91/taxi-data-pipeline

Rename template.env to .env and fill out the values, you dont need to fill out buckets or AUTH_TOKEN values.
Run the project setup script it will prompt you to login to your gcloud account, do so and it will do the rest.

make setup

to manually trigger the batch-dag:

make trigger_batch_dag

to run kafka

make run_kafka

to destroy kafka instance

make destroy_kafka

to run the lambda integration test

make run_lambda_integration_test

Note: to use the workflows, a GCP service account as well as the content of your .env file need to be added to your Github Secrets.

About

Automated data & MLOps pipeline leveraging Kubernetes and Apache Airflow. Integrates Spark, Kafka, and DBT with a focus on data quality. Tailors solutions for diverse user needs.

Languages

Language:Python 64.3%Language:HCL 18.0%Language:Shell 13.4%Language:Dockerfile 2.4%Language:HTML 1.3%Language:Makefile 0.7%