A music streaming startup, Sparkify
, has grown their user base and song database and want to move
their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on
user activity on the app, as well as a directory with JSON metadata on the songs in their app.
The goal is to build an ETL pipeline that:
- extracts their data from S3;
- stages them in Redshift; and
- transforms the data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to.
There are two datasets that reside in S3:
- Song data:
s3://udacity-dend/song_data
- Log data:
s3://udacity-dend/log_data
Log data json path: s3://udacity-dend/log_json_path.json
.
-
Subset of real data from the Million Song Dataset.
-
Each file is in JSON format and contains metadata about a song and the artist of that song.
-
The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset:
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
-
Example of the song file
song_data/A/A/B/TRAABCL128F4286650.json
(it's possible to view it in the web browser: http://udacity-dend.s3.amazonaws.com/song_data/A/A/B/TRAABCL128F4286650.json):{ "artist_id": "ARC43071187B990240", "artist_latitude": null, "artist_location": "Wisner, LA", "artist_longitude": null, "artist_name": "Wayne Watson", "duration": 245.21098, "num_songs": 1, "song_id": "SOKEJEJ12A8C13E0D0", "title": "The Urgency (LP Version)", "year": 0 }
-
Log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.
-
The log files in the dataset are partitioned by year and month. For example, here are filepaths to two files in this dataset:
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
-
Example of two lines in log file
log_data/2018/11/2018-11-01-events.json
(it's possible to view it in the web browser: http://udacity-dend.s3.amazonaws.com/log_data/2018/11/2018-11-01-events.json):{"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":2,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Upgrade","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106132796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"} {"artist":"Mr Oizo","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":3,"lastName":"Summers","length":144.03873,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Flat 55","status":200,"ts":1541106352796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
The goal is to create a star schema optimized for queries on song play analysis using the song and event datasets. This includes the following tables.
The following AWS Redshift types were used in the database schema:
SMALLINT
=INT2
: Signed two-byte integer; range -32,768 (2^15
) to +32,767 (2^15 - 1
)INTEGER
=INT
,INT4
: Signed four-byte integer (range -2,147,483,648 (2^31
) to +2,147,483,647 (2^31 - 1
)BIGINT
=INT8
: Signed eight-byte integer; range -9,223,372,036,854,775,808 (2^63
) to 9,223,372,036,854,775,807 (2^63 - 1
)DOUBLE PRECISION
=FLOAT8
,FLOAT
: Double precision floating-point number (15 significant digits of precision)CHAR
=CHARACTER
,NCHAR
,BPCHAR
: Fixed-length character string; 4,096 bytesVARCHAR
=CHARACTER VARYING
,NVARCHAR
,TEXT
: Variable-length character string with a user-defined limit; 65,535 bytes (2^16 - 1
)TIMESTAMP
=TIMESTAMP WITHOUT TIME ZONE
: Date and time (without time zone); resolution = 1 microsecond
These types were defined after inspecting the content of the raw JSON files, following this. Also, data was loaded and inspected to check if the types seemed correct.
They are temporary tables to stage the data before loading them into the star schema tables. They shouldn't be used for analytical purposes. The tables are:
staging_events
- froms3://udacity-dend/log_data
, with columns:artist
,auth
,firstname
,gender
,iteminsession
,lastname
,length
,level
,location
,method
,page
,registration
,sessionid
,song
,status
,ts
,useragent
,userid
staging_songs
- froms3://udacity-dend/song_data
, with columns:num_songs
,artist_id
,artist_latitude
,artist_longitude
,artist_location
,artist_name
,song_id
,title
,duration
,year
-
songplays: records in event data associated with song plays, i.e., records with page
NextSong
songplay_id
(primary key):INT4
; generated automaticallystart_time
:TIMESTAMP
; cannot beNULL
user_id
:INT4
; cannot beNULL
level
:CHAR(4)
(the possible values are"free"
or"paid"
)song_id
:VARCHAR(64)
artist_id
:VARCHAR(64)
session_id
:INT4
location
:VARCHAR
user_agent
:VARCHAR
To match the events with song data, we must find a way to join both staging tables:
staging_events
has columns:- Artist information: name
- Song information: title, duration
staging_songs
has columns:- Artist information: ID, name, latitude, longitude, location
- Song information: ID, title, duration, year
The potential columns for the join are:
- artist name
- song title
- song duration
The join was made using only artist name and song title, because 9 songs don't have a matching duration (5 of them differed by less than 4 seconds).
-
users: users in the app
user_id
(primary key):INT4
first_name
,last_name
:VARCHAR
gender
:CHAR(1)
(the possible values are"M"
or"F"
)level
:CHAR(4)
(the possible values are"free"
or"paid"
)
Table
staging_events
contains users that didn't play any song and users who changed levels (paid and free). We chose to keep all users, even if they didn't play any song, and get the latest values to avoid duplicates. -
songs: songs in music database
song_id
(primary key):VARCHAR(64)
title
:VARCHAR
artist_id
VARCHAR(64)
year
:INT2
duration
:FLOAT8
Table
staging_songs
has unique values of song ID, so we can safely select them all. But as a safeguard, we decided to select the maximum value of all attributes - no strong reasons why the maximum value and not other aggregator. -
artists: artists in music database
artist_id
(primary key):VARCHAR(64)
name
:VARCHAR
location
:VARCHAR
latitude
,longitude
:FLOAT8
In table
staging_songs
the same artist (same artist ID) may have different artist names, usually associated with songs with other invited artists. We chose the minimum value of the name, possibly leaving only the original artist, since it may be shorter.Regarding artist latitude, longitude and location, there are artists with both missing and numeric values, and artists with distinct numeric values. So we chose the maximum value (this removes the missing values and selects one of the numeric values).
-
time: timestamps of records in songplays broken down into specific units
start_time
(primary key):TIMESTAMP
hour
,day
,week
,month
,year
,weekday
:INT2
; cannot beNULL
Since the fact table
songplays
has only timestamps of when a song was played, we will choose only those instants fromstaging_events
, making sure there are unique and notNULL
.
├── README.md: this file
├── create_sql.py: contains functions to create all SQL statements (DROP TABLE, CREATE TABLE, ...)
├── create_tables.py: script that drops and creates all tables (staging tables, and fact and
dimension tables) on Redshift
├── db_utils.py: auxiliary functions related to the database (read config file, connect to the
database, ...)
├── docs
│ ├── aws_redshift.md: instructions to start the Redshift cluster
│ └── tests_debug.md: instructions to debug and run tests
├── dwh.cfg: configuration information (cluster and data source)
├── etl.py: script that copies data from S3 into staging tables on Redshift and processes that data
into analytics tables (fact and dimension tables) on Redshift
├── requirements
│ ├── requirements.txt: project requirements (Python libraries)
│ ├── requirements_dev.txt: additional requirements used for development
│ └── requirements_test.txt: additional requirements to run unit tests
├── sql_queries.py: SQL statements (DROP TABLE, CREATE TABLE, COPY, INSERT)
├── test_create_sql.py: unit tests for functions in create_sql.py
├── test_data_sanity_checks.ipynb: jupyter notebook for data sanity checks
├── test_db_utils.py: unit tests for functions in db_utils.py
└── test_sql_queries_debug.ipynb: jupyter notebook to view the queries generated programmatically
-
Add Redshift and IAM role to
dwh.cfg
. -
Run the Python scripts:
conda create -yn etl-env-redshift python=3.7 --file requirements/requirements.txt conda activate etl-env-redshift python create_tables.py python etl.py conda deactivate
The script
create_tables.py
takes less than 30 seconds, and the scriptetl.py
takes about 5 minutes (specially due to copy into tablestaging_songs
, which takes between 4 and 5 minutes). -
When finished, delete the Redshift cluster and remove the Python environment:
conda env remove -n etl-env-redshift
.