ETL on Cloud Data Warehouse for Song Play Analysis

This project aims to transform raw song and user data and load them into Redshift cluster for later analysis. This is also used to satisfied with Data Warehouse project under Data Engineer Nanodegree Program.

Prerequisite

Python3
Python virtual environment (aka venv)
AWS credentials/config files under ~/.aws directories (see more: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html)

Steps

Bootstrap virtual environment with dependencies

$ python3 -m venv ./venv
$ source ./venv/bin/activate
$ pip install -r requirements.txt

Copy config template template.dwh.cfg to dwh.cfg.
```
$ cp ./template.dwh.cfg ./dwh.cfg
```

Fill dwh.cfg on CLUSTER and MANIFEST sections

For CLUSTER section, this will be used to construct Redshift cluster from scratch. We are free to choose our values. Here are possible values.

[CLUSTER]
DB_NAME=dwh
DB_USER=dwhuser
DB_PASSWORD=<choose_whatever_you_want>
DB_PORT=5439
CLUSTER_TYPE=multi-node
NUM_NODES=4
NODE_TYPE=dc2.large
CLUSTER_IDENTIFIER=dwhCluster
IAM_ROLE_NAME=dwhRole

For MANIFEST section, this refers to another S3 bucket storing Redshift manifest files that we will create later. Here are possible values.

[MANIFEST]
BUCKET_NAME=sample-bucket-for-udacity-data-warehouse-project
EVENT_DATA_KEY=sample-path/sample-log-data-manifest.json
SONG_DATA_KEY=sample-path/sample-song-data-manifest.json

Prepare manifest files.
```
$ python prepare_manifest.py
```
Spin up Redshift cluster.
```
$ python spin_dwh_up.py
```

Create tables and do ETL.

$ python create_tables.py
$ python etl.py

When finished using Redshift cluster, tear it down.
```
$ python tear_dwh_down.py
```

talerngpong / cloud-datawarehouse

ETL on Cloud Data Warehouse for Song Play Analysis

Prerequisite

Steps

About

Languages