Project: Data Lakes with Spark

Synopsis

Sparkify is a music streaming startup company with a service similar to modern day Spotify. Over the years, they have gathered a ton of related (or relation-based) data from artists, songs, users and the various metadata.

While Sparkify has a data analytics team to routinely merge and conform their data sources, the data has become to mangled and unweildy for their level of expertise. Yet, Sparkify's business needs require them to be able to query a source of truth for OLAP-based information at will.

As their new Data Engineer, it is my job to build an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to. I'll test my database and ETL pipeline by running queries given by the analytics team from Sparkify and compare your results with their expected results.

The Data

My data sources reside in two locations on S3, namely:

Song data s3://udacity-dend/song_data
Log data s3://udacity-dend/log_data

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

data/song_data/A/B/C/TRABCEI128F424C983.json
data/song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file,TRAAAAW128F429D538.json, looks like.

{"num_songs": 1, "artist_id": "ARD7TVE1187B99BFB1", "artist_latitude": null, "artist_longitude": null, "artist_location": "California - LA", "artist_name": "Casual", "song_id": "SOMZWCG12A8C13C480", "title": "I Didn't Mean To", "duration": 218.93179, "year": 0}

Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from a music streaming app based on specified configurations.

The log files in the dataset we'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.

data/log_data/2018/11/2018-11-12-events.json
data/log_data/2018/11/2018-11-13-events.json

And below is an example of what the data in a log file (subset of the file), 2018-11-01-events.json, looks like.

{"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"}
{"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":0,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Home","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
{"artist":"Des'ree","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":1,"lastName":"Summers","length":246.30812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"You Gotta Be","status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}

Schema for Song Play analysis

Using the song and log datasets, I have created a star schema optimized for queries on song play analysis. This includes the following tables.

Fact Table

songplays - records in log data associated with song plays, i.e., records with page NextSong
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

users - Following information about users:
- user_id, first_name, last_name, gender, level
songs - Following info about songs:
- song_id, title, artist_id, year, duration
artists - Artists information:
- artist_id, name, location, latitude, longitude
time - timestamps of records in songplays broken down into specific units
- start_time, hour, day, week, month, year, weekday

The Steps

Below are the steps I followed to complete the project:

I connected to Amazon S3 with my administrative credentials
I initiated a new Spark session
I loaded the dataset from a publicly accessible S3 bucket into Spark
I extracted several fields from the dataframe data in Spark to create various relevant tables for the company's use case
I committed these extracted tables into another writeable S3 bucket in parquet format
I adapted this process both for the songs dataset and the logs dataset
The entire pipeline can be found in the etl.py file

To run the program

To execute this pipeline, follow the steps below:

Obtain administrative credentials to your AWS account
Save your credentials (access key and secret access key) into the dl.cfg file under the [Credentials] section. Be sure not to enclose your credentials in quotes.
Open a terminal in the root of this project directory
Run python etl.py and wait.

Relevance to today's workload

The Million Song Dataset provides a real-world scenario for data and metadata analysis in a data mart or data warehouse. The work I have performed in this project presents a perfect scenario for any company with similarly complicated data workloads looking to optimize their data access for transactions or analytics.

agmt5989 / data-lake-dend