Data Modeling

Building an ETL pipeline with Python.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

To run and preview the dataset you will need:

Python <3.6
Amazon Redshift
Psycopg2

Installing

To create the tables and load the data from the JSON files into the Redshift DataWarehouse. Run the following commands:

python create_tables.py
python etl.py

Running the tests

To test the pipeline. Run:

jupyter notebooks
#open the test.ipynb then run the cells

Python Scripts

subqueries.py: contains all the ETL Pipeline queries

Datasets

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

Database Design

The design adopts a star schema with dimensional tables(users, songs, artists, time) and facts table(songplays). The database is designed to optimize queries by denormalizing the tables to simplify queries and reduce joins. The DB is designed to better understand user activities like the songs they play and the artists they listen to.

songplays: Records in log data associated with song plays
users: Users in the app
songs: Songs in database
artists: Artists in database
time: Timestamps of events broken down into specific units

Built With

Authors

Victor A.

victoray / DataWarehouse