samadarshad / UdacityDataC3

Redshift

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Background

This project extracts data from a sample songplay dataset on a S3 bucket, transforms it to fact and dimension tables in Redshift.

Setting up Redshift

Follow instructions in L3 Exercise 2 - IaC - Solution.ipynb to set up a Redshift Cluster.

Dataset

Data in http://udacity-dend.s3.us-west-2.amazonaws.com

Running the scripts

Store the HOST and ARN from the Redshift setup in dwh.cfg

Run create_tables.main()

Run etl.main()

Debugging

Use 'Select * From stl_load_errors' to debug errors

STL Errors

Use a subset of Song data for faster loading and debugging i.e. set SONG_DATA='s3://udacity-dend/song-data/A/A'

Expected results

Populated staging tables:

Staging Events Staging Events

Staging Songs Staging Songs

Populated tables:

Artists Artists

Songs Songs

Users Users

Songplays Songplays

Times Times

About

Redshift


Languages

Language:Jupyter Notebook 69.1%Language:Python 17.1%Language:HTML 13.8%