victoray / DataWarehouse

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Modeling

Building an ETL pipeline with Python.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

To run and preview the dataset you will need:

Python <3.6
Amazon Redshift
Psycopg2

Installing

To create the tables and load the data from the JSON files into the Redshift DataWarehouse. Run the following commands:

python create_tables.py
python etl.py

Running the tests

To test the pipeline. Run:

jupyter notebooks
#open the test.ipynb then run the cells

Python Scripts

  • subqueries.py: contains all the ETL Pipeline queries

Datasets

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

Database Design

The design adopts a star schema with dimensional tables(users, songs, artists, time) and facts table(songplays). The database is designed to optimize queries by denormalizing the tables to simplify queries and reduce joins. The DB is designed to better understand user activities like the songs they play and the artists they listen to.

  • songplays: Records in log data associated with song plays
  • users: Users in the app
  • songs: Songs in database
  • artists: Artists in database
  • time: Timestamps of events broken down into specific units

Built With

Authors

  • Victor A.

About


Languages

Language:Python 100.0%