Overwatch League Stats

Background

Overwatch is a first-person shooter team game with with a wide variety of heroes to choose from. Overwatch League (OWL) was the professional esports league of Overwatch. I really enjoyed watching the OWL games since start, and have compiled and uploaded the game match stats to my Kaggle. After initially struggling with analyzing vast amounts of stock data, I eventually shifted focus to illustrating the data engineering process using smaller datasets. This approach effectively showcases the orchestration of diverse data sources.

The datasets, originally provided by IBM Watson, include players, head-to-head match-ups, and maps. The player historical statistics should contain OWL games from 2018 till now. It's centered around each player, and player's picked hero, its team name, performance, match IDs, etc.

Background
Problem Statement
Data Architecture
Getting Started
Dashboard
Future Work
Contributing

Problem Statement

The problem about e-sports game matches is that you cannot really measure a player's skills solely by metrics and numbers. Gladly, Overwatch is a FPS game, so aim rates and reaction time matter, but not 100% as short TTK (time-to-kill) FPS games, when team cooperation, hero picking, and game mechanics come into place. Therefore, I intended to start the analysis from simple metrics of eliminations, deaths, and damage, but extended it into map analysis. More in-depth thoughts and analyses are welcomed.

Data Architecture

The project is designed with a stream data pipeline and expected to batch processing the OWL match data weekly. A few technologies used:

Cloud: Google Cloud Platform (GCP)
- Data Lake: Google Cloud Storage
- Data Warehourse: Google BigQuery
Infrastructure as Code: Terraform
Workflow Orchestration:
- Mage
- Airflow
Data Ingestiong: Batch Processin with Spark on Dataproc
Transformation:
- Spark
- dbt
Visualisation: Google Data Studio

Getting Started

Prerequisites

I created this project in WSL 2 (Windows Subsystem for Linux) on Windows 10. To get a local copy up and running in the same environment, you'll need to:

Install WSL 2 (Windows Subsystem for Linux) on Windows
Install Docker Desktop
Install Python (py3.10 above used for the project)
Install VSCode
Have a Google Cloud Platform account
Install Google Cloud SDK for Ubuntu

Create a Google Cloud Project

Go to Google Cloud and create a new project. The default project id is project-stocks.
Go to IAM and create a Service Account with these roles:
- BigQuery Admin
- Compute Admin
- Storage Admin
- Storage Object Admin
- Viewer
WARNING: As a proof of concept, the project creates a service account with the permission of BigQuery Admin, Service Account Key Admin, Storage Insights Collector Service, Storage Object Creator and Storage Object Viewer, which might not be the best security practice. Any suggestions welcome on connecting GCE with Airflow in a dockerised setting in Terraform for this specific use case.
Download the Service Account credentials and put inside the terraform folder.
On the Google console, enable the following APIs:
- IAM API
- IAM Service Account Credentials API
- Cloud Dataproc API
- Compute Engine API
- Lookder Studio

Set up the infrastructure with Terraform on Google Cloud Platform

Open the project folder in VSCode with WSL
Open variables.tf and modify:
- variable "project" to your own project id, maybe not neccessary
- variable "region" to your project region
- variable "credentials" to your credentials path
Open the terminal in VSCode and change directory to terraform folder: cd terraform
Initialize Terraform: terraform init
Plan the infrastructure: terraform plan
Apply the changes: terraform apply
Refer to the details notes if you are not fully familiar with terraform setup.

If everything goes right, you now have a bucket on Google Cloud Storage called '<your_project>' and a dataset on BigQuery called ''.

Port Mage to Run the Scheduled Pipeline

I already clone the Mage-zoomcamp folder to the repo, so go to the mage-zoomcamp folder: cd mage-zoomcamp
Change name of the dev.env file to .env to set up the environment
Change the GOOGLE_PROJECT_ID and other project_ID and database buckets to your setup
Run docker-compose build
Run docker-compose up and agree to the updates
Go to port http://localhost:6789/ and run the scheduled mage pipeline owl_pipeline.
Close out mage and the port after match_stats and map_stats datasets are successfully uploaded to GCS and BigQuery. Run docker-compose down to end.

(Optional) Airflow

Use the dockerized files docker-compose up to set up Airflow Orchestration to replace Mage. #WIP

(Optional) DBT

Refer to the instructions to set up dbt with BigQuery on Docker. The transformation can also be done within Mage or Spark. I include the PydSpark transformation scripts below.

Spark ETL Jobs

Follow along the Spark notes if you are not fully sure about running PySpark on the Google Cloud Storage. You need to configure a few environment variables before successfully running.

Follow the script spark_sql.py to oad the data stored in GCS bucket and BigQuery into a PySpark dataframe.
Transform and clean the data using PySpark functions and write it to BigQuery.
Create a DataProc cluster named owl-analysis-cluster with the same location to GCS region, and use the spark_local.py script to submit jobs.

Dashboard

Looker Studio is a cloud-based business intelligence and data analytics platform used for visualizing the useful insights.

Link to the dashboard with Analytics: https://lookerstudio.google.com/reporting/ac7d4497-216a-4924-8755-68d058dd129e

Pie chart of maps being played thruought the season - distribution of categorical data
Total damage by player, sorting by the most
Total elimination vesus death per game by hero - distribution of data across a temporal line

Future Work

More unit tests and automation tests on Mage
CI/CD
Machine Learning predictions on game matches

Contributing

Feel free to comment or contribute to this project or my dataset on Kaggle.

Much appreciation to Data Engineering Zoomcamp by DataTalksClub for the amazing course.

sherrytp / overwatch-data-engineer