stephanderton / Data_Modeling_with_Postgres

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Modeling with Postgres

Stephanie Anderton DEND Project #1 April 20, 2019

Sparkify Songplay Database

The purpose of this database is to make it easier for Sparkify to query their data, with the main goal being to understand what songs users are listening to. It is populated with data extracted from JSON metadata on songs and JSON logs on user activity in the Sparkify music streaming app.

Datasets

Song Dataset

The song dataset consists of files in JSON format, each containing metadata about a single song and the artist of that song. The files are partitioned in subdirectories, organized by the three letters after TR of each song's track ID. For example, these are the file paths for two files in this dataset.

song_data/A/B/C/TRABCAJ12903CDFCC2.json
song_data/A/B/A/TRABAVQ12903CBF7E0.json

Here is an example of what a single song file, TRABCAJ12903CDFCC2.json, looks like in JSON format.

{"num_songs": 1, "artist_id": "ARULZCI1241B9C8611", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Luna Orbit Project", "song_id": "SOSWKAV12AB018FC91", "title": "Midnight Star", "duration": 335.51628, "year": 0}

Log Dataset

The log dataset consists of files in JSON format, each containing metadata about event activity in the music streaming app. These files are partitioned in subdirectories, organized by year and month. For example, these two files in this dataset.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

Here is an example of what the first line of data (a single event record in JSON format) looks like in the file labelled 2018-11-23-events.json.

{"artist":"Great Lake Swimmers","auth":"Logged In","firstName":"Kevin","gender":"M","itemInSession":0,"lastName":"Arellano","length":215.11791,"level":"free","location":"Harrisburg-Carlisle, PA","method":"PUT","page":"NextSong","registration":1540006905796.0,"sessionId":815,"song":"Your Rocky Spine","status":200,"ts":1542931645796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"66"}
...

Note: Only log records with page feature as “NextSong” are associated with song plays and loaded to the database.

DB Schema

The sparkifydb data model is essentially a star schema (or a minimal snowflake schema because of the relationship between songs and artists), and is implemented in a Postgres database. It contains one fact table of songplays and four dimension tables for users, songs, artists, and time.

This schema is not fully normalized, as the level feature is replicated in the songplays fact table as well as the users dimension table. It’s structure will allow queries to be optimized for song play analysis, with simpler joins and aggregations. All essential data for songplays and user level is contained in the songplays fact table.

ETL Pipeline

The ETL pipeline extracts data from files in two local directories, /data/log_data and /data/song_data, and then transforms the data and loads into the five tables of the sparkifydb database. This is handled by three files using Python and SQL.

Step File Purpose
1 create_tables.py Creates and initializes the tables for the sparkifydb database.
2 etl.py Reads and processes files from the song_data and log_data directories, and loads them into the sparkifydb database tables.
- sql_queries.py Contains all SQL queries. This file is imported into etl.py.

Note: These three Python files are found in the same directory that contains the data directory, which itself contains the song_data and log_data directories that are to be processed.

Steps to Run the ETL

  1. In a terminal, run python create_tables.py to reset the tables in the sparkifydb database:
oot@68edcb7a4e06:/home/workspace# python create_tables.py
Connected to default database
Dropped the sparkifydb database
Created the sparkifydb database
Connected to sparkifydb database
root@68edcb7a4e06:/home/workspace#
  1. Then, in the (same) terminal, run python etl.py to process the datasets: (Note: some output lines in the example output below have been removed for readability )
root@68edcb7a4e06:/home/workspace# python etl.py
Connected the sparkifydb database
71 files found in data/song_data
1/71 files processed.
2/71 files processed.
3/71 files processed.
...
68/71 files processed.
69/71 files processed.
70/71 files processed.
71/71 files processed.
30 files found in data/log_data
1/30 files processed.
2/30 files processed.
3/30 files processed.
...
28/30 files processed.
29/30 files processed.
30/30 files processed.
root@68edcb7a4e06:/home/workspace#

Sample Query

Note: The following query was run in a Jupyter Notebook, but the python code needed to run the SQL statement has not been included for readability. This query could be run as is from pgAdmin4 or a PSQL terminal.

SELECT DISTINCT p.user_id, p.song_id, p.artist_id
FROM   songplays p
WHERE  p.level = 'paid';

Output:

 * postgresql://student:***@127.0.0.1/sparkifydb
23 rows affected.
user_id	song_id	artist_id
70	None	None
85	None	None
82	None	None
25	None	None
58	None	None
36	None	None
15	None	None
88	None	None
42	None	None
80	None	None
30	None	None
73	None	None
15	SOZCTXZ12AB0182364	AR5KOSW1187FB35FF4
95	None	None
16	None	None
29	None	None
97	None	None
72	None	None
65	None	None
20	None	None
49	None	None
44	None	None
24	None	None

About


Languages

Language:Jupyter Notebook 77.9%Language:Python 22.1%