ETL - Data Warehouse

Introduction

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

The goal is to build an ETL pipeline that:

extracts their data from S3;
stages them in Redshift; and
transforms the data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to.

Project datasets

There are two datasets that reside in S3:

Song data: s3://udacity-dend/song_data
Log data: s3://udacity-dend/log_data

Log data json path: s3://udacity-dend/log_json_path.json.

Song dataset

Subset of real data from the Million Song Dataset.
Each file is in JSON format and contains metadata about a song and the artist of that song.
The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset:
- song_data/A/B/C/TRABCEI128F424C983.json
- song_data/A/A/B/TRAABJL12903CDCF1A.json

Example of the song file song_data/A/A/B/TRAABCL128F4286650.json (it's possible to view it in the web browser: http://udacity-dend.s3.amazonaws.com/song_data/A/A/B/TRAABCL128F4286650.json):

{
  "artist_id": "ARC43071187B990240",
  "artist_latitude": null,
  "artist_location": "Wisner, LA",
  "artist_longitude": null,
  "artist_name": "Wayne Watson",
  "duration": 245.21098,
  "num_songs": 1,
  "song_id": "SOKEJEJ12A8C13E0D0",
  "title": "The Urgency (LP Version)",
  "year": 0
}

Log dataset

Log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.
The log files in the dataset are partitioned by year and month. For example, here are filepaths to two files in this dataset:
- log_data/2018/11/2018-11-12-events.json
- log_data/2018/11/2018-11-13-events.json

Example of two lines in log file log_data/2018/11/2018-11-01-events.json (it's possible to view it in the web browser: http://udacity-dend.s3.amazonaws.com/log_data/2018/11/2018-11-01-events.json):

{"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":2,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Upgrade","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106132796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
{"artist":"Mr Oizo","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":3,"lastName":"Summers","length":144.03873,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"Flat 55","status":200,"ts":1541106352796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}

Schema for song play analysis

The goal is to create a star schema optimized for queries on song play analysis using the song and event datasets. This includes the following tables.

The following AWS Redshift types were used in the database schema:

SMALLINT = INT2: Signed two-byte integer; range -32,768 (2^15) to +32,767 (2^15 - 1)
INTEGER = INT, INT4: Signed four-byte integer (range -2,147,483,648 (2^31) to +2,147,483,647 (2^31 - 1)
BIGINT = INT8: Signed eight-byte integer; range -9,223,372,036,854,775,808 (2^63) to 9,223,372,036,854,775,807 (2^63 - 1)
DOUBLE PRECISION = FLOAT8, FLOAT: Double precision floating-point number (15 significant digits of precision)
CHAR = CHARACTER, NCHAR, BPCHAR: Fixed-length character string; 4,096 bytes
VARCHAR = CHARACTER VARYING, NVARCHAR, TEXT: Variable-length character string with a user-defined limit; 65,535 bytes (2^16 - 1)
TIMESTAMP = TIMESTAMP WITHOUT TIME ZONE: Date and time (without time zone); resolution = 1 microsecond

These types were defined after inspecting the content of the raw JSON files, following this. Also, data was loaded and inspected to check if the types seemed correct.

Staging tables

They are temporary tables to stage the data before loading them into the star schema tables. They shouldn't be used for analytical purposes. The tables are:

staging_events - from s3://udacity-dend/log_data, with columns:
- artist, auth, firstname, gender, iteminsession, lastname, length, level, location, method, page, registration, sessionid, song, status, ts, useragent, userid
staging_songs - from s3://udacity-dend/song_data, with columns:
- num_songs, artist_id, artist_latitude, artist_longitude, artist_location, artist_name, song_id, title, duration, year

Fact table

songplays: records in event data associated with song plays, i.e., records with page NextSong
- songplay_id (primary key): INT4; generated automatically
- start_time: TIMESTAMP; cannot be NULL
- user_id: INT4; cannot be NULL
- level: CHAR(4) (the possible values are "free" or "paid")
- song_id: VARCHAR(64)
- artist_id: VARCHAR(64)
- session_id: INT4
- location: VARCHAR
- user_agent: VARCHAR
To match the events with song data, we must find a way to join both staging tables:
- staging_events has columns:
  - Artist information: name
  - Song information: title, duration
- staging_songs has columns:
  - Artist information: ID, name, latitude, longitude, location
  - Song information: ID, title, duration, year
The potential columns for the join are:
- artist name
- song title
- song duration
The join was made using only artist name and song title, because 9 songs don't have a matching duration (5 of them differed by less than 4 seconds).

Dimension tables

users: users in the app
- user_id (primary key): INT4
- first_name, last_name: VARCHAR
- gender: CHAR(1) (the possible values are "M" or "F")
- level: CHAR(4) (the possible values are "free" or "paid")
Table staging_events contains users that didn't play any song and users who changed levels (paid and free). We chose to keep all users, even if they didn't play any song, and get the latest values to avoid duplicates.
songs: songs in music database
- song_id (primary key): VARCHAR(64)
- title: VARCHAR
- artist_id VARCHAR(64)
- year: INT2
- duration: FLOAT8
Table staging_songs has unique values of song ID, so we can safely select them all. But as a safeguard, we decided to select the maximum value of all attributes - no strong reasons why the maximum value and not other aggregator.
artists: artists in music database
- artist_id (primary key): VARCHAR(64)
- name: VARCHAR
- location: VARCHAR
- latitude, longitude: FLOAT8
In table staging_songs the same artist (same artist ID) may have different artist names, usually associated with songs with other invited artists. We chose the minimum value of the name, possibly leaving only the original artist, since it may be shorter.

Regarding artist latitude, longitude and location, there are artists with both missing and numeric values, and artists with distinct numeric values. So we chose the maximum value (this removes the missing values and selects one of the numeric values).
time: timestamps of records in songplays broken down into specific units
- start_time (primary key): TIMESTAMP
- hour, day, week, month, year, weekday: INT2; cannot be NULL
Since the fact table songplays has only timestamps of when a song was played, we will choose only those instants from staging_events, making sure there are unique and not NULL.

Project structure

├── README.md: this file
├── create_sql.py: contains functions to create all SQL statements (DROP TABLE, CREATE TABLE, ...)
├── create_tables.py: script that drops and creates all tables (staging tables, and fact and
                      dimension tables) on Redshift
├── db_utils.py: auxiliary functions related to the database (read config file, connect to the
                 database, ...)
├── docs
│   ├── aws_redshift.md: instructions to start the Redshift cluster
│   └── tests_debug.md: instructions to debug and run tests
├── dwh.cfg: configuration information (cluster and data source)
├── etl.py: script that copies data from S3 into staging tables on Redshift and processes that data
            into analytics tables (fact and dimension tables) on Redshift
├── requirements
│   ├── requirements.txt: project requirements (Python libraries)
│   ├── requirements_dev.txt: additional requirements used for development
│   └── requirements_test.txt: additional requirements to run unit tests
├── sql_queries.py: SQL statements (DROP TABLE, CREATE TABLE, COPY, INSERT)
├── test_create_sql.py: unit tests for functions in create_sql.py
├── test_data_sanity_checks.ipynb: jupyter notebook for data sanity checks
├── test_db_utils.py: unit tests for functions in db_utils.py
└── test_sql_queries_debug.ipynb: jupyter notebook to view the queries generated programmatically

Run the ETL

Create an AWS Redshift cluster.
Add Redshift and IAM role to dwh.cfg.
Run the Python scripts:
```
conda create -yn etl-env-redshift python=3.7 --file requirements/requirements.txt
conda activate etl-env-redshift
python create_tables.py
python etl.py
conda deactivate
```
The script create_tables.py takes less than 30 seconds, and the script etl.py takes about 5 minutes (specially due to copy into table staging_songs, which takes between 4 and 5 minutes).
Run Python unit tests, queries, debug problems.
When finished, delete the Redshift cluster and remove the Python environment: conda env remove -n etl-env-redshift.

gcbeltramini / etl-data-warehouse