Redshift - Data warehouse project

This project exercises the use of:

Python/ETL queries to turn a 3NF schema into a star schema
AWS S3 integration to AWS Redshift for parallel import of staging data
Redshift for querying the imported data

Pre-requisites

To satisfy the pre-requisites, we recommend that you use Infrastructure as Code to manage all AWS resources. Tools like AWS Cloudformation, Terraform or Pulumi will make the provisioning/decomissioning of the Redshift cluster, s3 user and IAM user/roles easier to reproduce in the future.

AWS CLI

This will enable us to be authenticated in AWS. The installation guide can be found here.

AWS IAM role

For the Redshift cluster to read file from S3 buckets, you need to create a new IAM role that has AmazonS3ReadOnlyAccess.

This role will delegate S3 access to Redshift.

Take note of the ARN (arn:aws:iam::[AWS ACCOUNT ID]:role/[ROLE NAME]) as you will need it when creating the cluster.

Note: this permission must be further restricted in terms of targeted resources but you can temporarily use that as a baseline if you accept not spending time on building fine-grained IAM policies and are comfortable with this temporary setup/cluster.

AWS IAM User

A new IAM user with programmatic access keys and console access is needed for this project.

The user will need the permission AmazonRedshiftFullAccess and a new policy.

Create a new policy redshift-passrole that has the permission on IAM::PassRole. Here, the PassRole permission for this new user is needed in order to authorize Redshift to assume the previously created IAM role.

Do not use your root AWS account or an existing user to run this project as a security mistake can open access to someone else to your account.

Now that you've given console access to this new user, you can login in with it for the next pre-requirements.

/!\ Keep this information private and never expose it to anyone or in Git.

Note: this permission must be further restricted in terms of target resources and permission scope but you can temporarily use that as a baseline if you accept not spending time on building fine-grained IAM policies and are comfortable with this temporary setup/cluster.

AWS Security Group

We will restrict access to Redshift on our personal IP with a Security Group.

In the EC2 console, create a new security group. It's not ideal but for simplicity, put the security group in the default VPC that is already attached to your AWS account.

Add a rule to the security group as follows:

Type: Custom TCP Rule.
Protocol: TCP.
Port Range: 5439. The default port for Amazon Redshift is 5439, but your port might be different. See note on determining your firewall rules on the earlier "AWS Setup Instructions" page in this lesson.
Source: select Custom IP, then type your own IP with a /32 at the end. You can find your own public IP here. You may share the same public IP in the same service area where customers of your Internet provider are, but this will limit who can access it..

AWS Redshift cluster

Create a Redshift cluster with the following information:

Cluster identifier: redshift-cluster.
Database name: dev.
Database port: 5439.
Master user name: awsuser.
Master user password and Confirm password: password for the master user account.

4 nodes is recommended for this project. You can use the dc2.large EC2 instance as the cheapest option ($0.33/node/hour). This will allow for 2 slices per node for parallelising ETL ingestion.

/!\ Keep this information private and never expose it to anyone or in Git.

In the additional configurations,

Network and security > VPC security groups: select the VPC where your newly created Security group is.
Cluster permissions > Role: specify the role ARN of the newly created role. VPC security groups: redshift_security_group Available IAM roles: myRedshiftRole

If you can't connect to the cluser, follow this support guide

AWS S3 bucket

A new S3 bucket is needed for storing the datasets.

Usage

Structure & configuration

The project includes 3 main files:

create_table.py: Python functions used to connect to Redshift and create, drop
etl.py: Python functions used to connect to Redshift and load staging tables in Redshift and then process that data into your analytics tables on Redshift.
sql_queries.py: provides all SQL statements to DROP, CREATE, and INSERT INTO the fact and dimension tables used in the star schema

To interact with Redshift and S3, you will need to create a configuration file dwh.cfg and supply:

the credentials to access Redshift
AWS role that allows to read S3 buckets from Redshift
S3 locations used for the datasets:

Rename dwh.cfg-dist into dwh.cfg in the root folder and provide the following configurations:

[CLUSTER]
HOST=
DB_NAME=
DB_USER=
DB_PASSWORD=
DB_PORT=

[IAM_ROLE]
ARN=''

[S3]
LOG_DATA='s3://mybucket/log_data'
LOG_JSONPATH='s3://mybucket/log_json_path.json'
SONG_DATA='s3://mybucket/song_data'

Redshift ETL-ing

Once configured, execute the following commands to load the datasets and create the star schema:

pip3 install -r requirements.txt
python3 create_tables.py
python3 etl.py

Datasets

The project datasets are based on a subset from the Million Song Dataset.

A small subset of the data is available in the data folder.

Songs data

You will need to store the data in a S3 bucket with the following structure:

Song data: s3://mybucket/song_data
Log data: s3://mybucket/log_data
Log data json path: s3://mybucket/log_json_path.json

Each file is in JSON format and contains metadata about a song and the artist of that song.

The files are structured in a hierarchy of folders that follow the alphabetical order of the first three letters of each song's track ID.

For example:

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

A single song file has the following form:

{
    "artist_id": "ARJNIUY12298900C91",
    "artist_latitude": null,
    "artist_location": "",
    "artist_longitude": null,
    "artist_name": "Adelitas Way",
    "duration": 213.9424,
    "num_songs": 1,
    "song_id": "SOBLFFE12AF72AA5BA",
    "title": "Scream",
    "year": 2009
}

Log Dataset

The logs dataset is generated from this event simulator based on the songs in the song dataset. It will be used as our simulation of activity logs from users listening to songs.

The log files are partitioned by year and month. For example,

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

A single log file has the following form:

{
...
}
{
    "artist": null,
    "auth": "Logged In",
    "firstName": "Walter",
    "gender": "M",
    "itemInSession": 0,
    "lastName": "Frye",
    "length": null,
    "level": "free",
    "location": "San Francisco-Oakland-Hayward, CA",
    "method": "GET",
    "page": "Home",
    "registration": 1540919166796.0,
    "sessionId": 38,
    "song": null,
    "status": 200,
    "ts": 1541105830796,
    "userAgent": "\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"",
    "userId": "39"
}
....
{
...
}
....

We will use the log_json_path.json JSONPath manifest to inform Redshift on how to import the JSON data files with a COPY from S3 command.

{
    "jsonpaths": [
        "$['artist']",
        "$['auth']",
        "$['firstName']",
        "$['gender']",
        "$['itemInSession']",
        "$['lastName']",
        "$['length']",
        "$['level']",
        "$['location']",
        "$['method']",
        "$['page']",
        "$['registration']",
        "$['sessionId']",
        "$['song']",
        "$['status']",
        "$['ts']",
        "$['userAgent']",
        "$['userId']"
    ]
}

Notes (resources):

Data model

3NF and star schema

This a represenation of the final schema. Usually these tables live in different schemas but for simplicity, staging and fact/dimensions table are all in the same schema ("public" default Redshift/postgres schema).

Redshit specificities

You can learn more about Redshift specificities here:

Redshift data types. Even if Redshift is based on Postgres, some Redshift data types may differ from Postgres data types.
Converting Redshift epochs and timestamps
Defining constraints in Redshift. As per the documentation, "uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift.". Primary keys and not null constraints are recommended only when you can guarantee their validity.
Unique autogenerated values with IDENTITY for columns candidates to auto increments

A quick refresher on the main data types in play in this project:

Numeric data types and their ranges:

Name      Size     Minimum               Maximum
smallint  2 bytes  -32768                +32767
integer   4 bytes  -2147483648           +2147483647
bigint    8 bytes  -9223372036854775808  +9223372036854775807

Choice between BIGINT AND INT for epoch to date conversions:

Name      Size     Minimum Date      Maximum Date
smallint  2 bytes  1969-12-31        1970-01-01
integer   4 bytes  1901-12-13        2038-01-18
bigint    8 bytes  -292275055-05-16  292278994-08-17

willgarcia / redshift-etl-star-schema