sherrytp / overwatch-data-engineer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Overwatch League Stats

Background

Overwatch is a first-person shooter team game with with a wide variety of heroes to choose from. Overwatch League (OWL) was the professional esports league of Overwatch. I really enjoyed watching the OWL games since start, and have compiled and uploaded the game match stats to my Kaggle. After initially struggling with analyzing vast amounts of stock data, I eventually shifted focus to illustrating the data engineering process using smaller datasets. This approach effectively showcases the orchestration of diverse data sources.

The datasets, originally provided by IBM Watson, include players, head-to-head match-ups, and maps. The player historical statistics should contain OWL games from 2018 till now. It's centered around each player, and player's picked hero, its team name, performance, match IDs, etc.

Table of Contents

Problem Statement

The problem about e-sports game matches is that you cannot really measure a player's skills solely by metrics and numbers. Gladly, Overwatch is a FPS game, so aim rates and reaction time matter, but not 100% as short TTK (time-to-kill) FPS games, when team cooperation, hero picking, and game mechanics come into place. Therefore, I intended to start the analysis from simple metrics of eliminations, deaths, and damage, but extended it into map analysis. More in-depth thoughts and analyses are welcomed.

Data Architecture

The project is designed with a stream data pipeline and expected to batch processing the OWL match data weekly. A few technologies used:

alt text

Getting Started

Prerequisites

I created this project in WSL 2 (Windows Subsystem for Linux) on Windows 10. To get a local copy up and running in the same environment, you'll need to:

Create a Google Cloud Project

  1. Go to Google Cloud and create a new project. The default project id is project-stocks.

  2. Go to IAM and create a Service Account with these roles:

    • BigQuery Admin
    • Compute Admin
    • Storage Admin
    • Storage Object Admin
    • Viewer

    WARNING: As a proof of concept, the project creates a service account with the permission of BigQuery Admin, Service Account Key Admin, Storage Insights Collector Service, Storage Object Creator and Storage Object Viewer, which might not be the best security practice. Any suggestions welcome on connecting GCE with Airflow in a dockerised setting in Terraform for this specific use case.

  3. Download the Service Account credentials and put inside the terraform folder.

  4. On the Google console, enable the following APIs:

    • IAM API
    • IAM Service Account Credentials API
    • Cloud Dataproc API
    • Compute Engine API
    • Lookder Studio

Set up the infrastructure with Terraform on Google Cloud Platform

  1. Open the project folder in VSCode with WSL
  2. Open variables.tf and modify:
    • variable "project" to your own project id, maybe not neccessary
    • variable "region" to your project region
    • variable "credentials" to your credentials path
  3. Open the terminal in VSCode and change directory to terraform folder: cd terraform
  4. Initialize Terraform: terraform init
  5. Plan the infrastructure: terraform plan
  6. Apply the changes: terraform apply
  7. Refer to the details notes if you are not fully familiar with terraform setup.

If everything goes right, you now have a bucket on Google Cloud Storage called '<your_project>' and a dataset on BigQuery called ''.

Port Mage to Run the Scheduled Pipeline

  1. I already clone the Mage-zoomcamp folder to the repo, so go to the mage-zoomcamp folder: cd mage-zoomcamp
  2. Change name of the dev.env file to .env to set up the environment
  3. Change the GOOGLE_PROJECT_ID and other project_ID and database buckets to your setup
  4. Run docker-compose build
  5. Run docker-compose up and agree to the updates
  6. Go to port http://localhost:6789/ and run the scheduled mage pipeline owl_pipeline.
  7. Close out mage and the port after match_stats and map_stats datasets are successfully uploaded to GCS and BigQuery. Run docker-compose down to end.

(Optional) Airflow

Use the dockerized files docker-compose up to set up Airflow Orchestration to replace Mage. #WIP

(Optional) DBT

Refer to the instructions to set up dbt with BigQuery on Docker. The transformation can also be done within Mage or Spark. I include the PydSpark transformation scripts below.

Spark ETL Jobs

Follow along the Spark notes if you are not fully sure about running PySpark on the Google Cloud Storage. You need to configure a few environment variables before successfully running.

  1. Follow the script spark_sql.py to oad the data stored in GCS bucket and BigQuery into a PySpark dataframe.
  2. Transform and clean the data using PySpark functions and write it to BigQuery.
  3. Create a DataProc cluster named owl-analysis-cluster with the same location to GCS region, and use the spark_local.py script to submit jobs.

dataproc GCS

Dashboard

Looker Studio is a cloud-based business intelligence and data analytics platform used for visualizing the useful insights.

Link to the dashboard with Analytics: https://lookerstudio.google.com/reporting/ac7d4497-216a-4924-8755-68d058dd129e viz

  1. Pie chart of maps being played thruought the season - distribution of categorical data
  2. Total damage by player, sorting by the most
  3. Total elimination vesus death per game by hero - distribution of data across a temporal line

Future Work

  • More unit tests and automation tests on Mage
  • CI/CD
  • Machine Learning predictions on game matches

Contributing

Feel free to comment or contribute to this project or my dataset on Kaggle.

Much appreciation to Data Engineering Zoomcamp by DataTalksClub for the amazing course.

About


Languages

Language:Python 76.7%Language:Dockerfile 16.4%Language:HCL 5.5%Language:Shell 1.2%Language:Jupyter Notebook 0.2%