seokyim8/Steam_data_pipeline

aws data-science database pipeline python scrapy scrapy-crawler steam-api apache-superset ec2 mysql rds selenium docker

Creator: Seok Yim (Noah)

Do you want to be the PIONEER of soon-to-POP-OFF games? Then you're gonna like this...

Title: Steam Data Pipeline

Project Summary:

A data pipeline that regularly scrapes, cleans, stores, and publishes data for newly released games on Steam. The data visualization is taken care of by Apache Superset (publicly accessible).

*** Preview ***

Website link:
http://18.212.126.33:8080/superset/dashboard/1/?standalone=3&show_filters=1

Authentication for anonymous users (Anyone can view it with these credentials):
ID: public
password: public

Description:

I frequently saw websites/projects with Steam-related data for popular(top 100) games but never saw one primarily focused on new releases on Steam. Thus, I decided to make one myself.

Technologies Used:

Python, MYSQL, AWS(EC2, RDS), Docker, Scrapy, Apache Superset, Selenium

Steps Taken:

Created a Scrapy project that scrapes data from the official Steam website (https://store.steampowered.com/search/?sort_by=Released_DESC&supportedlang=english).
Added selenium to deal with infinite scrolling. Created a Python scheduler with Apscheulder along with Python asyncio.
Launched an EC2 and RDS instance, each for persisting the program and running the MYSQL database, respectively.
Created a Docker image that downloads the Python dependencies along with the Chrome browser.
On EC2, initialized the containerized project along with the containerized Apache Superset image.
Made the dashboard publicly available.

Final Product:

- A dashboard/BI tool that updates every day at 7:30 am EST(with a couple extra updates during the day) with 1,000 entries from Steam.
- Contains visual expressions of the data that facilitate individuals in understanding the latest trends in games.

About

A data pipeline that provides with web-scraped information from Steam

aws data-science database pipeline python scrapy scrapy-crawler steam-api apache-superset ec2 mysql rds selenium docker

Languages

Language:Python 96.7%Language:Dockerfile 3.3%