Data-modeling-Postgres

Introduction

The project is one of the many projects I worked on through my training at Udacity as a data engineer. In this project, I'll model user activity data for a music streaming app called Sparkify. I’ll create a relational database and ETL pipeline designed to optimize queries for understanding what songs users are listening to. In PostgreSQL, I will also define Fact and dimension tables and insert data into the new tables.

Dataset & Data model

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset: song_data/A/B/C/TRABCEI128F424C983.json song_data/A/A/B/TRAABJL12903CDCF1A.json And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like:

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Dataset

The second dataset consists of log files in JSON format generated by an event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

The log files in the dataset I'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset :

log_data/2018/11/2018-11-12-events.json log_data/2018/11/2018-11-13-events.json And below is an example of what the data in a log file, 2018-11-12-events.json, looks like:

Data model

ERD

Data Dictionary

songs : dimension table about songs

Column's name	type	example
song_id	varchar `PRIMARY KEY`	"SONHOTT12A8C13493C"
title	varchar	"Something Girls"
artist_id	varchar	"AR7G5I41187FB4CE6C"
year	integer	1982
duration	float	233.40363

artists : dimension table about artists

Column's name	type	example
artist_id	varchar `PRIMARY KEY`	"ARGSJW91187B9B1D6B"
name	varchar	"JennyAnyKind"
location	varchar	"North Carolina"
latitude	float	35.21962
longitude	float	-80.01955

time : dimension table about the date-time

Column's name	type	example
start_time	timestamp `PRIMARY KEY`	"2018-11-11 02:33:56.796"
hour	integer	2
day	varchar	11
week	integer	45
month	integer	11
year	integer	2018
dayofweek	integer	6

users : dimension table about users

Column's name	type	example
userId	integer `PRIMARY KEY`	69
firstName	varchar	"Anabelle"
lastName	varchar	"Simpson"
gender	varchar	"F"
lever	varchar	"free"

songplays : fact table about songplays

Column's name	type	example
songplay_id	serial `PRIMARY KEY`	1
start_time	timestamp `FOREIGN KEY` references start_time in time table	"2018-11-11 02:33:56.796"
userId	integer `FOREIGN KEY` references userId in users table	69
song_id	varchar `FOREIGN KEY` references song_id in songs table	null
artist_id	varchar `FOREIGN KEY` references artist_id in artists table	null
level	varchar	"free"
session_id	integer	455
location	varchar	"Philadelphia-Camden-Wilmington, PA-NJ-DE-MD"
user_agent	varchar	"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"""

ETL process

To fill in the database created above with these local JSON files, we need to create an ETL pipeline using pandas to parse files and extract the needed fields for each table. Steps to build ETL pipeline for to extract 1 file and fill in 1 table:

Create function get_files to return json song files
Use pandas to read the json file
Select columns for song ID, title, artist ID, year, and duration
Select the record in the dataframe and convert it to a list
Query the insert statement with that one record to insert the song into the song table

The same process is applied to the other tables with other fields extracted.

Note about the songplays table :

This one is a little more complicated since information from the songs table, artists table, and original log file are all needed for the songplays table. Since the log file does not specify an ID for either the song or the artist, you'll need to get the song ID and artist ID by querying the songs and artists tables to find matches based on song title, artist name, and song duration time. More details are within the notebook ( etl-proto.ipynb )

Code execution steps:

Run create_tables.py to drop if exists then create your database and tables.
Run etl.py to execute the ETL process on the whole datasets and load data into the different tables
Run test.ipynb to confirm the creation of the tables with the correct columns. Make sure to click "Restart kernel" to close the connection to the database after running this notebook, and the etl-proto.ipynb as well
EXAMPLE-QUERIES.ipynb to execute some queries for analysis purpose

omarfessi / data-modeling-Postgres