SparkifyDB ETL Pipeline

Purpose: to simplify analytics of Sparkify ("app") user song play activities with a Postgres database, ETL pipeline solution, and the Million Song Dataset ("music database").

Disclaimer: Sparkify is a fictional music streaming app. The Million Song Dataset however is real: http://millionsongdataset.com/ (freely-available collection of audio features and metadata for a million contemporary popular music tracks.)

Table of Content

Problem statement and proposal
How to run the Python Scripts
Explaination of the files in the repository
Rationale of databse schema design and ETL pipeline
Example queries and results for the song play analysis
Appendix

1. Problem statement and proposal

Problem statement: a startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

Proposed solution: create a Postgres database with tables designed to optimize queries on song play analysis, along with a database schema and ETL pipeline. the database and ETL pipeline may be tested by running queries by the analytics team from Sparkify and compare against expected results.

2. How to run the Python Scripts

The following processes are repeatable and re-runnable.

Setup (or re-setup) the Postgres Environment. This step create the sparkifyeb (STAR schema) database, and create the initial skeleton empty tables inside it: a fact table songplays, and 4 dimension tables: users, songs, artists, time.

python create_tables.py

Run the ETL pipeline to populate the sparkifydb tables. This step extract the music app data (data/song_data) and Sparkify event log data (data/log_data), apply transformation, and load into the sparkifydb database (our ultimate output).

python etl.py

Run basic tests against sparkifydb by running the Jupyter Notebook etc.ipynb from top to bottom. Any errors and warnings should be fixed. In the end we should no longer get any warnings or errors. (ensure to restart kernel at the end when done to free up session). If we have done our job correctly, we should expect to see one record with non-null song_id or artist_id within the songplays table:

songplay_id: NextSong
start_time: 2018-11-21 21:56:47
user_id: 15
level: paid
song_id: SOZCTXZ12AB0182364
artist_id: AR5KOSW1187FB35FF4
session_id: 818
location: Chicago-Naperville-Elgin, IL-IN-WI
user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36"

Run the Example query notebook example_queries.ipynb to perform analysis (it contains sample SQL queries here). At the bottom of that notebook you will see a SQL query that contains a full join across all normalised tables, and generate the eventual denormalised view. Make sure to restart the notebook when you are done.

3. Explaination of the files in the repository

Song Dataset

Location: data/song_data.

This is a subset of real data from the Million Song Dataset (http://millionsongdataset.com/). Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are file paths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like:

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

I've also developed JSON schema to help visualise structure of raw JSON file (at data_schema/song_schema.json to help visualise structure of a JSON record in data/song_data) and potential JSON validation improvement ideas (e.g. data quality checks / bad data quarantine, etc.)

Log Dataset

Location: data/log_data.

Spakify app log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

The (daily event) log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

Each daily event log file contains multiple events within that day. Each event is represented by one JSON record (compacted into one line). Below is what a line of JSON may look like (displayed in a pretty format for ease of visualisation)

{
    "artist": null,
    "auth":"LoggedIn",
    "firstName":"Walter",
    "gender":"M",
    "itemInSession":0,
    "lastName":"Frye",
    "length":null,
    "level":"free",
    "location":"San Francisco-Oakland-Hayward, CA",
    "method":"GET",
    "page":"Home",
    "registration":1540919166796.0,
    "sessionId":38,
    "song":null,
    "status":200,
    "ts":1541105830796,
    "userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"",
    "userId":"39"}

I've also developed JSON schema to help visualise structure of raw JSON file (at data_schema/log_schema.json) to help visualise structure of a JSON record in data/log_data and potential JSON validation improvement ideas (e.g. data quality checks / bad data quarantine, etc.).

4. Rationale of databse schema design and ETL pipeline

At a high level, we use a STAR schema to model our data. STAR schema tends to promote data integrety and minimise data duplications.

Fact Table: songplay - this table is at event level. i.e. one line per log event.

Dimension table: these tables are in general more static. These tables provides additional static-like attributes to enrich the Fact Table. We may query these tables individually, as well as joining with other tables for more joined-up analysis.

users: one row per unique Sparkify user for us to build user centric queries. e.g. what artists or songs a particular user
artists: one row per unique Sparkify user for us to build artist centric queries. e.g. finding out most popular artists at all time, or for a particular period.
time: one row per unique timestamp dimension for us to build time centric queries. e.g. user volume per weekday, or per month, etc.)
songs: one row per unique song us to build song centric queries. e.g. how long is a song, who created it, which year was the release.

How we map JSON fields to the STAR schema table (the big picture)

This section describes how the raw upstream JSON files are mapped to the eventual Postgres sparkifydb tables at field level. This part has helped me develop the initial ETL prototype.

<postgres-table-field-name> : <postgres-table-field-type> <-- <raw-json-file>.<json-field-name>: <raw-json-field-type>

Notes: we use the following aliases for documentation purposes below.

log: means this comes from the log JSON files (from data/log_data). log.userId refers to the userId field of the log JSON file. And so on.
song: means this comes from the song JSON files (from data/log_data). song.songId refers to the songId field of the log JSON file. And so on.
Including psql constraint types and primary key indications (following iteration using test.ipynb)
Note to self: we ought to document the Postgres-JSON mapping via configuration files instead (rather then the unwieldy markdown syntax below), and dynamically generate SQL create-table queries that way (Infrastructure as Code). Probably in a more JSON schema fashion. For future!

Fact Table

songplays - records in log data associated with song plays i.e. records with page NextSong. Fields:
- songplay_id: VARCHAR <-- SERIAL PRIMARY KEY
- start_time: TIMESTAMP (NOT NULL) <-- log.ts: int (need convert UNIT Epoch time into PSQL TIMESTAMP type. Transform as string "yyyy-mm-dd HH:MM:SS" so we can parse into PSQL able smoothly)
- user_id: BIGINT <-- log.userId: integer
- level: VARCHAR <-- log.level: string
- song_id: VARCHAR <-- song.song_id: string
- artist_id: VARCHAR <-- song.artist_id: string
- session_id: INT <-- log.sessionId: integer
- location: VARCHAR <-- log.location: string
- user_agent: VARCHAR <-- userAgent: string
... ON CONCLICT (songplay_id) DO NOTHING

Dimension Tables

users - users in the app. Fields:
- user_id: BIGINT (PRIMARY KEY) <-- log.userId: integer
- first_name: VARCHAR <-- log.firstName: string
- last_name: VARCHAR <-- log.lastName: string
- gender: VARCHAR <-- log.gender: string
- level: VARCHAR <-- log.level: string
... ON CONCLICT (user_id) DO <replace old values with new values>
songs - songs in music database. Fields:
- song_id: VARCHAR (PRIMARY KEY) <-- song.song_id: string
- title: VARCHAR (NOT NULL) <-- song.title: string
- artist_id: VARCHAR <-- song.artist_id: string
- year: INT (NOT NULL) <-- song.year: integer
- duration: DECIMAL <-- song.year: number
... ON CONCLICT (song_id) DO NOTHING
artists - artists in music database. Fields:
- artist_id: VARCHAR (PRIMARY KEY) <-- song.artist_id: string
- name: VARCHAR (NOT NULL) <-- song.artist_name: string
- location: VARCHAR <-- song.artist_location: string
- latitude: DOUBLE PRECISION <-- song.artist_latitude: number
- longitude: DOUBLE PRECISION <-- song.artist_longitude: number
... ON CONCLICT (artist_id) DO NOTHING
time - timestamps of records in songplays broken down into specific units. Fields:
- start_time: TIMESTAMP (PRIMARY_KEY) <-- log.ts: int (need convert UNIT Epoch time into PSQL TIMESTAMP type. Transform as string "yyyy-mm-dd HH:MM:S
- hour: INT <-- derived from start_time Get Hour (0-23).
- day: INT <-- derived from start_time Get Date (yyyy-mm-dd).
- week: INT <-- derived from start_time Get Week Number (0-53).
- month: INT <-- derived from start_time Get Month Number (1-12).
- year: INT <-- derived from start_time Get Year (yyyy).
- weekday: INT <-- derived from start_time Get Weekday (1-7).
... ON CONCLICT (start_time) DO NOTHING

5. Example queries and results for the song play analysis

Run the Example query notebook example_queries.ipynb to perform analysis. At the bottom of that notebook you will see a SQL query that contains a full join across all normalised tables, and generate the eventual denormalised view. Make sure to restart the notebook when you are done.

The following SQL query will create a de-normalised view:

SELECT \
    songplays.start_time AS event_start_time \
    ,songplays.songplay_id AS songplay_id \
    ,time.year AS event_year \
    ,time.month AS event_month \
    ,time.day AS event_day \
    ,time.hour AS event_hour \
    ,time.week AS event_week \
    ,time.week AS event_weekday \
    ,users.user_id AS user_id \
    ,users.first_name AS user_first_name \
    ,users.last_name AS user_last_name \
    ,users.gender AS user_gender \
    ,users.level AS user_level \
    ,songs.song_id AS song_id \
    ,songs.title AS song_title \
    ,songs.year AS song_release_year \
    ,songs.duration AS song_duration \
    ,artists.artist_id AS artist_id \
    ,artists.location AS artist_location \
    ,artists.latitude AS artist_latitude \
    ,artists.longitude AS artist_longitude \
FROM songplays \
INNER JOIN users \
    ON users.user_id = songplays.user_id \
INNER JOIN songs \
    ON songs.song_id = songplays.song_id \
INNER JOIN artists \
    ON songplays.artist_id = artists.artist_id \
INNER JOIN time \
    ON songplays.start_time = time.start_time \
;

6. Appendix

Additional texts included below. These helped me developed the initial prototypes.

Appendix 1: Detail Development Workflow

Phase 1: experiment

Review and revise the following componets (listed in order more or less represent the pipeline workflow):

psql_sandpit.ipynb: got a bit rusty using Postgres via the terminal commands? Use this notebook (in conjunction with a terminal) to have a play. e.g. invoke the psql command line interface (CLI); list databases / tables; create/drop database and tables; insert values to tables, etc.
sql_queries.py: a helper module imported by the main create_tables.py and etl.py module.
create_tables.py: This step (re-)generate the sparkifydb database, with the blank skeleton tables created (with zero rows populated):
- songplays
- users
- songs
- artists
- time
etl.py: This step load raw data (from data) to the sparkifydb Postgres database. During the development phase, developer may use the Jupyter Notebook etl.ipynb for experimentation and iterate the etl.py module.

Phase 2: test our ETL pipeline

To test out the eventual ETL pipeline, perform the following steps (all re-runnable):

Setup (or re-setup) the Postgres Environment:

python create_tables.py

Run the ETL script to populate all the postgres tables:

python etl.py

Run etc.ipynb from top to bottom to ensure we are not getting any warnings or errors (ensure to restart kernel at the end when done to free up session). Note the bottom part of the notebook contains a final sanity check:

%sql SELECT * FROM songplays WHERE song_id IS NOT NULL OR artist_id IS NOT NULL;

The project Rubric suggests that there should be one record with non-null song_id and non-null artist_id in the songplays table. (i.e. based on the song and log samples there should only be one sparkify user session overlaps with the Million Song Dataset - join via song title, artist name, and song duration). If we have done our job correctly, we should expect to see one record as per folllowings:

songplay_id: NextSong
start_time: 2018-11-21 21:56:47
user_id: 15
level: paid
song_id: SOZCTXZ12AB0182364
artist_id: AR5KOSW1187FB35FF4
session_id: 818
location: Chicago-Naperville-Elgin, IL-IN-WI
user_agent: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36"

Appendix 3: Potential Future Opportunities

Opporunity 1: test upstream raw json records.

Run a JSON schema check to ensure each raw JSON record (in data) aligns to the JSON schema files (in data_schema). Ingest the good records and quarantine the bad records.

Opportunity 2: test the ETL process.

Test on a record by record basis and confirm that each JSON raw record is parsed into the sparkifydb tables correctly. Compare expected vs actual values. Log all the bad ETL.

Opportunity 3: mini-batch insert multiple rows rather than 1 row at a time (with SQL COPY).

Appendix 4: Other Useful References

I have saved a copy of the original Udacity etl.py as etl_template.py for future references. Maybe our etl.py can be made better in future.

Useful links:

Atlas7 / udacity-dend-project-01-sparkify-etl-postgres