Data Engineering Nanodegree Project 1: Data Modeling with Postgres

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app.

Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

This project aims on modeling the data with Postgres and build an ETL pipeline using python.

Getting Started

Pre-requisites:

Python 3
PostgreSQL
virtualenv

First create virtualenv and install requirements

virtualenv .venv --python=`which python3`
pip install -r requirements.txt

Datasets

Song dataset

The first dataset is a subset of real data from the Million Song Dataset.

{
   "num_songs":1,
   "artist_id":"AR36F9J1187FB406F1",
   "artist_latitude":56.27609,
   "artist_longitude":9.51695,
   "artist_location":"Denmark",
   "artist_name":"Bombay Rockers",
   "song_id":"SOBKWDJ12A8C13B2F3",
   "title":"Wild Rose (Back 2 Basics Mix)",
   "duration":230.71302,
   "year":0
}

Log dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.

Log data json example

{
   "artist":"Mr Oizo",
   "auth":"Logged In",
   "firstName":"Kaylee",
   "gender":"F",
   "itemInSession":3,
   "lastName":"Summers",
   "length":144.03873,
   "level":"free",
   "location":"Phoenix-Mesa-Scottsdale, AZ",
   "method":"PUT",
   "page":"NextSong",
   "registration":1540344794796.0,
   "sessionId":139,
   "song":"Flat 55",
   "status":200,
   "ts":1541106352796,
   "userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"",
   "userId":"8"
}

Database Design

Songplays

songplays (
	songplay_id SERIAL PRIMARY KEY,
	start_time timestamp NOT NULL,
	user_id int NOT NULL,
	level varchar,
	song_id varchar NOT NULL,
	artist_id varchar NOT NULL,
	session_id int,
	location varchar,
	user_agent varchar
);

Users

users (
    user_id SERIAL PRIMARY,
    first_name varchar,
    last_name varchar,
    gender varchar,
    level varchar
);

Songs

songs (
    song_id varchar PRIMARY KEY,
    title varchar NOT NULL,
    artist_id varchar NOT NULL,
    year integer,
    duration numeric
);

Artists

artists (
	artist_id varchar PRIMARY KEY,
	name varchar NOT NULL,
	location varchar,
	latitude numeric,
	longitude numeric
);

Time

time (
    time_id SERIAL PRIMARY KEY,
    start_time timestamp,
    hour integer,
    day integer,
    week integer,
    month integer,
    year integer,
    weekday varchar
);

ETL Process

The first step is to run the script create_tables.py that import all queries stored at sql_queries.py and execute them to drop old tables and create the new schema.

The script etl.py will collect the raw data from both datasets and populate the new database with the transformed data.

Project Repository Files

create_tables.py

Drop old tables and create new schema

sql_queries.py

SQL queries used by create_tables.py

test.ipynb

Jupyter notebook used to validate data

etl.ipnyb

Jupyter notebook used to develop the ETL process

etl.py

Script that execute the whole ETL process

requirements.txt

Requirements to run the project

Usage

First lets activate the virtualenv

source .venv/bin/activate

The script create_tables drops and creates your tables. You run this file to reset your tables before each time you run your ETL scripts.

python create_tables.py

The etl script reads and processes files from song_data and log_data and loads them into your tables.

python etl.py

lsossai / nd-data-eng-project-1