A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app.
Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
This project aims on modeling the data with Postgres and build an ETL pipeline using python.
Pre-requisites:
- Python 3
- PostgreSQL
- virtualenv
First create virtualenv and install requirements
virtualenv .venv --python=`which python3`
pip install -r requirements.txt
The first dataset is a subset of real data from the Million Song Dataset.
{
"num_songs":1,
"artist_id":"AR36F9J1187FB406F1",
"artist_latitude":56.27609,
"artist_longitude":9.51695,
"artist_location":"Denmark",
"artist_name":"Bombay Rockers",
"song_id":"SOBKWDJ12A8C13B2F3",
"title":"Wild Rose (Back 2 Basics Mix)",
"duration":230.71302,
"year":0
}
The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.
{
"artist":"Mr Oizo",
"auth":"Logged In",
"firstName":"Kaylee",
"gender":"F",
"itemInSession":3,
"lastName":"Summers",
"length":144.03873,
"level":"free",
"location":"Phoenix-Mesa-Scottsdale, AZ",
"method":"PUT",
"page":"NextSong",
"registration":1540344794796.0,
"sessionId":139,
"song":"Flat 55",
"status":200,
"ts":1541106352796,
"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"",
"userId":"8"
}
songplays (
songplay_id SERIAL PRIMARY KEY,
start_time timestamp NOT NULL,
user_id int NOT NULL,
level varchar,
song_id varchar NOT NULL,
artist_id varchar NOT NULL,
session_id int,
location varchar,
user_agent varchar
);
users (
user_id SERIAL PRIMARY,
first_name varchar,
last_name varchar,
gender varchar,
level varchar
);
songs (
song_id varchar PRIMARY KEY,
title varchar NOT NULL,
artist_id varchar NOT NULL,
year integer,
duration numeric
);
artists (
artist_id varchar PRIMARY KEY,
name varchar NOT NULL,
location varchar,
latitude numeric,
longitude numeric
);
time (
time_id SERIAL PRIMARY KEY,
start_time timestamp,
hour integer,
day integer,
week integer,
month integer,
year integer,
weekday varchar
);
The first step is to run the script create_tables.py
that import all queries stored at sql_queries.py
and execute them to drop old tables and create the new schema.
The script etl.py
will collect the raw data from both datasets and populate the new database with the transformed data.
Drop old tables and create new schema
SQL queries used by create_tables.py
Jupyter notebook used to validate data
Jupyter notebook used to develop the ETL process
Script that execute the whole ETL process
Requirements to run the project
First lets activate the virtualenv
source .venv/bin/activate
The script create_tables drops and creates your tables. You run this file to reset your tables before each time you run your ETL scripts.
python create_tables.py
The etl script reads and processes files from song_data and log_data and loads them into your tables.
python etl.py