Pipeline for IMDB Movies and Ratings on GCP

This marks the second project within the Data Engineering Zoomcamp curriculum. In this project, we'll demonstrate a pipeline designed to handle batch processing of data sourced from IMDb Non-Commercial Datasets.

IMDb, aka Internet Movie Database, serves as an extensive online repository of data concerning movies, TV shows, and video games. It encompasses a comprehensive array of details, including cast and crew information, plot summaries, ratings, and reviews, rendering it an indispensable resource for both movie enthusiasts and professionals in the industry.

Our primary objective is to establish a robust data pipeline that retrieves raw data from the IMDB dataset, stores it, enhances its quality, and presents it in an intuitive dashboard. This pipeline enables us to address various inquiries, such as the distribution of films across genres, the duration of the longest movie in hours, the proportion of films within different categories, the total count of movies in the database, and the distribution of films by release years, among others.

As per the IMDB data specifications, the dataset undergoes daily updates. Consequently, our data pipeline is scheduled to execute daily at 1pm UTC time to ensure the integration of the latest data into our system.

Built With

Dataset repo: IMDB
Infrastructure as Code: Terraform
Workflow Orchestration: Airflow
Data Lake: Google Cloud Storage
Data Warehouse: Google BigQuery
Managed service for Apache Spark: Google Dataproc
Transformation: PySpark
Visualisation: Metabase
Programming Language: Python and SQL

Project Architecture

The cloud infrastructure has been established using Terraform, providing a scalable and reliable foundation for our applications and services. Meanwhile, Airflow, our workflow orchestration tool, is operational within a local Docker container.

Prerequisites

A Google Cloud Platform account.
Install VSCode or Zed or any other IDE that works for you.
Install Terraform
Install Docker Desktop
Install Google Cloud SDK
Clone this repository onto your local machine.

Create a Google Cloud Project

Go to Google Cloud and create a new project.
Get the project ID and define the environment variables GCP_PROJECT_ID in the .env file located in the root directory
Create a Service account with the following roles:
- BigQuery Admin
- Storage Admin
- Viewer
- Compute Admin
- Dataproc Administrator
Download the Service Account credentials and store it in $HOME/.google/credentials/.
You need to activate the following APIs here
- Cloud Storage API
- BigQuery API
Assign the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of your JSON credentials file, such that GOOGLE_APPLICATION_CREDENTIALS will be $HOME/.google/credentials/<authkeys_filename>.json
- add this line to the end of the .bashrc file
```
export GOOGLE_APPLICATION_CREDENTIALS=${HOME}/.google/google_credentials.json
```
- Activate the enviroment variable by runing source .bashrc
Set up the infrastructure on GCP with Terraform
- Using Zed or VSCode, open the cloned project IMDB-pipeline-project.
- To customize the default values of variable "project" and variable "region" to your preferred project ID and region, you have two options: either edit the variables.tf file in Terraform directly and modify the values, or set the environment variables TF_VAR_project and TF_VAR_region.
- Open the terminal to the root project.
- Navigate to the root directory of the project in the terminal and then change the directory to the terraform folder using the command cd terraform.
- Set an alias alias tf='terraform'
- Initialise Terraform: tf init
- Plan the infrastructure: tf plan
- Apply the changes: tf apply

Set up Airflow and Metabase

Please confirm that the following environment variables are configured in .env in the root directory of the project.
- AIRFLOW_UID. The default value is 50000
- GCP_PROJECT_ID. This should be set from Create a Google Cloud Project section
- GCP_IMDB_BUCKET=imdb_datalake_<GCP project id>
- GCP_PROJECT_REGION. The location where the Data Proc is deployed
- GCP_IMDB_WH_DATASET=imdb_analytics
- GCP_IMDB_DATAPROC_TEMP_BUCKET. Review the Google Cloud Storage (GCS) section in the console to find the temp bucket name for Data Proc. This is the value you need here.
Run docker-compose up.
Access the Airflow dashboard by visiting http://localhost:8080/ in your web browser. The interface will resemble the following. Use the username and password airflow to log in.

Visit http://localhost:1460 in your web browser to access the Metabase dashboard. The interface will resemble the following. You will need to sign up to use the UI.

Data Ingestion

Once you've completed all the steps outlined in the previous section, you should now be able to view the Airflow dashboard in your web browser. Below will display as list of DAGs.

Below is the DAG's graph.

To run the DAG, Click on the play button.

Data Transformation

The IMDB DAG manages this process. All transformation scripts are located in the data_airflow/dags/scripts folder. Airflow copies these scripts to Google Cloud Storage (GCS) and submits transformation jobs to Data Proc.

Data Visualization

Please watch the provided video tutorial to configure your Metabase database connection with BigQuery.You have the flexibility to customize your dashboard according to your preferences. Additionally, this PDF linked below contains the complete screenshot of the dashboard I created.