By Mike Ajala.
Sparkify is a music streaming startup company with a service similar to modern day Spotify. Over the years, they have gathered a ton of related (or relation-based) data from artists, songs, users and the various metadata.
While Sparkify has a data analytics team to routinely merge and conform their data sources, the data has become to mangled and unweildy for their level of expertise. Yet, Sparkify's business needs require them to be able to query a source of truth for OLAP-based information at will.
As their new Data Engineer, it is my job to build an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to. I'll test my database and ETL pipeline by running queries given by the analytics team from Sparkify and compare your results with their expected results.
My data sources reside in two locations on S3, namely:
- Song data
s3://udacity-dend/song_data
- Log data
s3://udacity-dend/log_data
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.
data/song_data/A/B/C/TRABCEI128F424C983.json
data/song_data/A/A/B/TRAABJL12903CDCF1A.json
And below is an example of what a single song file,TRAAAAW128F429D538.json
, looks like.
{"num_songs": 1, "artist_id": "ARD7TVE1187B99BFB1", "artist_latitude": null, "artist_longitude": null, "artist_location": "California - LA", "artist_name": "Casual", "song_id": "SOMZWCG12A8C13C480", "title": "I Didn't Mean To", "duration": 218.93179, "year": 0}
The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from a music streaming app based on specified configurations.
The log files in the dataset we'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.
data/log_data/2018/11/2018-11-12-events.json
data/log_data/2018/11/2018-11-13-events.json
And below is an example of what the data in a log file (subset of the file), 2018-11-01-events.json
, looks like.
{"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"}
{"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":0,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Home","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
{"artist":"Des'ree","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":1,"lastName":"Summers","length":246.30812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"You Gotta Be","status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
Using the song and log datasets, I have created a star schema optimized for queries on song play analysis. This includes the following tables.
- songplays - records in log data associated with song plays, i.e., records with page
NextSong
songplay_id
,start_time
,user_id
,level
,song_id
,artist_id
,session_id
,location
,user_agent
-
users - Following information about users:
user_id
,first_name
,last_name
,gender
,level
-
songs - Following info about songs:
song_id
,title
,artist_id
,year
,duration
-
artists - Artists information:
artist_id
,name
,location
,latitude
,longitude
-
time - timestamps of records in songplays broken down into specific units
start_time
,hour
,day
,week
,month
,year
,weekday
Below are the steps I followed to complete the project:
-
I connected to Amazon S3 with my administrative credentials
-
I initiated a new Spark session
-
I loaded the dataset from a publicly accessible S3 bucket into Spark
-
I extracted several fields from the dataframe data in Spark to create various relevant tables for the company's use case
-
I committed these extracted tables into another writeable S3 bucket in parquet format
-
I adapted this process both for the songs dataset and the logs dataset
-
The entire pipeline can be found in the
etl.py
file
To execute this pipeline, follow the steps below:
-
Obtain administrative credentials to your AWS account
-
Save your credentials (
access key
andsecret access key
) into thedl.cfg
file under the[Credentials]
section. Be sure not to enclose your credentials in quotes. -
Open a terminal in the root of this project directory
-
Run
python etl.py
and wait.
The Million Song Dataset provides a real-world scenario for data and metadata analysis in a data mart or data warehouse. The work I have performed in this project presents a perfect scenario for any company with similarly complicated data workloads looking to optimize their data access for transactions or analytics.