ETL & Cloud Data Warehouse in AWS
Stephanie Anderton DEND Project #3 May 29, 2019
Sparkify Songplay Data Warehouse
A music streaming startup, Sparkify, wants to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity in the Sparkify music streaming app, as well as a directory with JSON metadata on the songs in their app.
This ETL pipeline will load their data from S3 to the staging tables on Redshift, and transform the data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to.
Datasets
The two datasets reside in S3:
- Song data:
s3://udacity-dend/song_data
- Log data:
s3://udacity-dend/log_data
Log data JSON path: s3://udacity-dend/log_json_path.json
Song Dataset
The song dataset consists of files in JSON format, each containing metadata about a single song and the artist of that song. The files are partitioned in subdirectories, organized by the first three letters after TR
of each song's track ID. For example, these are the file paths for two files in this dataset.
song_data/A/B/C/TRABCAJ12903CDFCC2.json
song_data/A/B/A/TRABAVQ12903CBF7E0.json
Here is an example of what a single song file, TRABCAJ12903CDFCC2.json, looks like in JSON format.
{"num_songs": 1, "artist_id": "ARULZCI1241B9C8611", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Luna Orbit Project", "song_id": "SOSWKAV12AB018FC91", "title": "Midnight Star", "duration": 335.51628, "year": 0}
Log Dataset
The log dataset consists of files in JSON format, each containing metadata about event activity in the music streaming app. These files are partitioned in subdirectories, organized by year and month. For example, these are two files in this dataset.
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
Here is an example of what the first line of data (a single event record in JSON format) looks like in the file labelled 2018-11-23-events.json.
{"artist":"Great Lake Swimmers","auth":"Logged In","firstName":"Kevin","gender":"M","itemInSession":0,"lastName":"Arellano","length":215.11791,"level":"free","location":"Harrisburg-Carlisle, PA","method":"PUT","page":"NextSong","registration":1540006905796.0,"sessionId":815,"song":"Your Rocky Spine","status":200,"ts":1542931645796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"66"}
...
Note: Only log records with page
feature as “NextSong” are associated with song plays and loaded to the final database.
DB Schema
The sparkify data model is essentially a star schema (or a minimal snowflake schema because of the relationship between songs and artists), and is implemented in a Postgres database on Redshift. It contains one fact table of songplays and four dimension tables for users, songs, artists, and time.
This schema is not fully normalized, as the level
feature is replicated in the songplays fact table as well as the users dimension table. It’s structure will allow queries to be optimized for song play analysis, with simpler joins and aggregations. All essential data for songplays and user level is contained in the songplays fact table.
ETL Pipeline
The ETL pipeline extracts data from directories in S3, stages them on Redshift, and then transforms and loads the data into the five tables of the sparkify database. This is handled by three files written in Python and SQL.
Step | File | Purpose |
---|---|---|
1 | create_tables.py | Creates and initializes the staging tables and final dimensional tables for the sparkify database. |
2 | etl.py | Reads and processes files from the song_data and log_data directories on S3, and loads them into the sparkify database tables. |
- | sql_queries.py | Contains all SQL queries. This file is imported into create_tables.py and etl.py. |
- | dwh.cfg | Configuration file required for launching Redshift cluster and accessing datasets on S3. |
* | mylib.py | Library with methods for logging events during the ETL process. |
* Additional code, not part of the project requirements.
Steps to Run the ETL
In a terminal, run the following commands to create (or reset) the tables in the sparkify database and to process the datasets:
- python create_tables.py
- python etl.py,
The following is an example of the commands and output generated when running the scripts.
steph@STEPH-LAPTOP MINGW64 ~/Udacity/DEND/PROJECT_3/ETL-Cloud-Data-Warehouse (master)
$ python create_tables.py
Logfile : ./logs/etl-20190529.log
host=dwhcluster.cbsjbxldkge8.us-west-2.redshift.amazonaws.com dbname=sparkify user=dwhuser password=Passw0rd port=5439
(base)
steph@STEPH-LAPTOP MINGW64 ~/Udacity/DEND/PROJECT_3/ETL-Cloud-Data-Warehouse (master)
$ python etl.py
Logfile: ./logs/etl-20190529.log
host=dwhcluster.cbsjbxldkge8.us-west-2.redshift.amazonaws.com dbname=sparkify user=dwhuser password=Passw0rd port=5439
Load Staging tables...
Insert into Final tables...
Check table counts...
(base)
steph@STEPH-LAPTOP MINGW64 ~/Udacity/DEND/PROJECT_3/ETL-Cloud-Data-Warehouse (master)
$
Logfile Output
Starting the Redshift Cluster
09:42:08 AM : ===[ Inititate Cluster ]===
09:42:08 AM : 2019-05-29 09:42:08 AM
09:42:09 AM : DWH_CLUSTER_TYPE: multi-node
09:42:09 AM : DWH_NUM_NODES: 4
09:42:09 AM : DWH_NODE_TYPE: dc2.large
09:42:09 AM : DWH_CLUSTER_IDENTIFIER: dwhCluster
09:42:09 AM : DWH_REGION: us-west-2
09:42:09 AM : HOST: dwhcluster.cbsjbxldkge8.us-west-2.redshift.amazonaws.com
09:42:09 AM : DB_NAME: sparkify
09:42:09 AM : DB_USER: dwhuser
09:42:09 AM : DB_PASSWORD: Passw0rd
09:42:09 AM : DB_PORT: 5439
09:42:09 AM : IAM_ROLE_NAME: dwhRole
09:42:09 AM : IAM_POLICY_ARN: arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
09:42:09 AM : ARN: 'arn:aws:iam::376450510082:role/dwhRole'
09:42:09 AM : Creating a new IAM Role
09:42:10 AM : New IAM Role ARN: arn:aws:iam::376450510082:role/dwhRole
09:42:11 AM : Create cluster successful
09:42:12 AM : ClusterIdentifier: dwhcluster
09:42:12 AM : NodeType: dc2.large
09:42:12 AM : ClusterStatus: creating
09:42:12 AM : MasterUsername: dwhuser
09:42:12 AM : DBName: sparkify
09:42:12 AM : VpcId: vpc-aa278fd2
09:42:12 AM : NumberOfNodes: 4
09:48:15 AM : ClusterStatus: available
09:48:15 AM : time to spin cluster: 6 minutes
09:48:15 AM : HOST: dwhcluster.cbsjbxldkge8.us-west-2.redshift.amazonaws.com
09:48:15 AM : ARN: arn:aws:iam::376450510082:role/dwhRole
09:48:16 AM : Opened TCP port on cluster endpoint
09:48:17 AM : host=dwhcluster.cbsjbxldkge8.us-west-2.redshift.amazonaws.com dbname=sparkify user=dwhuser password=Passw0rd port=5439
09:48:17 AM : connected to database: sparkify
09:48:17 AM : Ready for ETL...
Create Tables & ETL
05:28:16 PM : ---[ Create Tables ]---
05:28:16 PM : 2019-05-29 05:28:16 PM
05:28:17 PM : DB connection : open
05:28:17 PM : Drop existing tables...
05:28:17 PM : delete table [ staging_events ]
05:28:17 PM : delete table [ staging_songs ]
05:28:18 PM : delete table [ songplays ]
05:28:18 PM : delete table [ users ]
05:28:18 PM : delete table [ songs ]
05:28:18 PM : delete table [ artists ]
05:28:19 PM : delete table [ time ]
05:28:19 PM : Create tables...
05:28:19 PM : create table [ staging_events ]
05:28:19 PM : create table [ staging_songs ]
05:28:20 PM : create table [ songplays ]
05:28:20 PM : create table [ users ]
05:28:20 PM : create table [ songs ]
05:28:21 PM : create table [ artists ]
05:28:21 PM : create table [ time ]
05:28:21 PM : DB connection : closed
05:29:24 PM : ---[ Begin ETL ]---
05:29:24 PM : 2019-05-29 05:29:24 PM
05:29:24 PM : LOG_DATA: 's3://udacity-dend/log_data'
05:29:24 PM : LOG_JSONPATH: 's3://udacity-dend/log_json_path.json'
05:29:24 PM : SONG_DATA: 's3://udacity-dend/song_data'
05:29:24 PM : DB connection : open
05:29:24 PM : Disable result cache for session
05:29:24 PM : Load staging tables...
05:29:24 PM : load staging table [ staging_events ]...
05:29:28 PM : load staging table [ staging_songs ]...
05:33:12 PM : Load final tables...
05:33:12 PM : insert to table [ songplays ]
05:33:12 PM : insert to table [ users ]
05:33:13 PM : insert to table [ songs ]
05:33:13 PM : insert to table [ artists ]
05:33:13 PM : insert to table [ time ]
05:33:14 PM : Check table counts...
05:33:14 PM : table count [ staging_events ] : 8056
05:33:14 PM : table count [ staging_songs ] : 14896
05:33:15 PM : table count [ songplays ] : 333
05:33:15 PM : table count [ users ] : 104
05:33:15 PM : table count [ songs ] : 14896
05:33:15 PM : table count [ artists ] : 10025
05:33:16 PM : table count [ time ] : 6813
05:33:16 PM : DB connection : closed
Sample Queries
Here are some sample queries that can be run to test out the final dimensional tables:
- Return the top 10 most frequently played songs, based on the song ID.
- Return the top 10 users based on the total number of songs they have listened to in the app.
- Get the user ID for the user who has listened to the most number of songs in the app.
- Return the top 5 sessions with the most number of songs, for the top user (found by the previous query) with ID = 49, the user who has listened to the most number of songs.
Top 10 Songs in songplays
Return the top 10 most frequently played songs, based on the song ID.
Note: some song titles appear more than once if there are multiple versions associated with variations in the artist ID.
top_10_songs = ("""
WITH songplays_ext AS (
SELECT *
FROM songplays
JOIN songs
ON sp_song_id = s_song_id
JOIN artists
ON sp_artist_id = a_artist_id
)
SELECT s_title AS "song title",
a_name AS "artist name",
COUNT(*) AS count
FROM songplays_ext
GROUP BY s_title, a_name
ORDER BY count DESC, s_title, a_name
LIMIT 10;
""")
Output: 10 rows
song title | artist name | count |
---|---|---|
You're The One | Dwight Yoakam | 37 |
Catch You Baby (Steve Pitron & Max Sanna Radio Edit) | Lonnie Gordon | 9 |
I CAN'T GET STARTED | Ron Carter | 9 |
Nothin' On You [feat. Bruno Mars] (Album Version) | B.o.B | 8 |
Hey Daddy (Daddy's Home) | Usher | 6 |
Hey Daddy (Daddy's Home) | Usher featuring Jermaine Dupri | 6 |
Make Her Say | Kid Cudi | 5 |
Make Her Say | Kid Cudi / Kanye West / Common | 5 |
Up Up & Away | Kid Cudi | 5 |
Up Up & Away | Kid Cudi / Kanye West / Common | 5 |
The output shows how there is a real need to clean the data; there are many songs that have duplicates with variations on the artist name.
Top 10 Users in songplays
Return the top 10 users based on the total number of songs they have listened to in the app.
top_10_users = ("""
WITH songplays_ext AS (
SELECT sp_songplay_id, u_first_name, u_last_name, u_user_id
FROM songplays
JOIN users
ON sp_user_id = u_user_id AND
sp_level = u_level
)
SELECT DISTINCT( u_first_name || ' ' || u_last_name ) AS "user name",
u_user_id AS "user ID",
COUNT(*) AS "song count"
FROM songplays_ext
GROUP BY "user ID", "user name"
ORDER BY "song count" DESC, "user name"
LIMIT 10;
""")
Output: 10 rows
user name | user id | song count |
---|---|---|
Chloe Cuevas | 49 | 42 |
Kate Harrell | 97 | 32 |
Tegan Levine | 80 | 31 |
Aleena Kirby | 44 | 21 |
Jacob Klein | 73 | 18 |
Mohammad Rodriguez | 88 | 17 |
Lily Koch | 15 | 15 |
Jacqueline Lynch | 29 | 13 |
Layla Griffin | 24 | 13 |
Matthew Jones | 36 | 13 |
Chloe has listened to a total of 42 songs, and Kate a total of 32.
ID for user with most songs
Get the user ID for the user who has listened to the most number of songs in the app.
top_user_id = ("""
WITH songplays_ext AS (
SELECT sp_session_id, u_user_id
FROM songplays
JOIN users
ON sp_user_id = u_user_id AND
sp_level = u_level
),
session_counts AS (
SELECT u_user_id,
COUNT( sp_session_id ) AS count
FROM songplays_ext
GROUP BY u_user_id
),
max_session AS (
SELECT MAX(count) AS max_count
FROM session_counts
)
SELECT u_user_id AS "top user id"
FROM session_counts
WHERE count = (
SELECT max_count
FROM max_session
);
""")
Output: 1 rows
top user id |
---|
49 |
Top 5 sessions with most songs for Top User (ID = 49)
Return the top 5 sessions with the most number of songs, for the top user with ID = 49, the user who has listened to the most number of songs.
top_5_sessions_top_user_49 = ("""
WITH songplays_user AS (
SELECT *
FROM songplays
WHERE sp_user_id = 49
),
user_sessions AS (
SELECT u_first_name, u_last_name,
sp_session_id, sp_start_time, s_title
FROM songplays_user
JOIN users
ON sp_user_id = u_user_id AND
sp_level = u_level
JOIN songs
ON sp_song_id = s_song_id
)
SELECT (u_first_name || ' ' || u_last_name) AS "user name",
sp_session_id AS "session ID",
(DATE_PART('year',
sp_start_time) || '-' || DATE_PART('month',
sp_start_time) || '-' || DATE_PART('day',
sp_start_time)) AS date,
COUNT(s_title) AS "song count"
FROM user_sessions
GROUP BY sp_session_id, date, "user name"
ORDER BY "song count" DESC, date
LIMIT 5;
""")
Output: 5 rows
user name | session id | date | song count |
---|---|---|---|
Chloe Cuevas | 1041 | 2018-11-29 | 11 |
Chloe Cuevas | 1079 | 2018-11-30 | 5 |
Chloe Cuevas | 816 | 2018-11-21 | 3 |
Chloe Cuevas | 576 | 2018-11-14 | 2 |
Chloe Cuevas | 758 | 2018-11-20 | 2 |
In her longest session, on November 29 in 2018, Chloe listened to 11 songs.