Hans - Redshift ETL

============

Redshift is highly performant and optimized SQL database with ability to scale. Queries are optimized using AQUA algorithm. Usage is simple. Simpler than Spark.

Features

Redshift support

Setup

Clone this repo:

git@github.com:andreiliphd/hans-redshift-etl.git

File structure

create_tables.py - script creating tables.

etl.ipynb - Jupyter Notebook for ETL and analytics.

etl.py - script to run ETL process.

sql_queries.py - queries to Redshift.

README.md - instruction for this project.

Usage

Use the following command to execute queries on a cluster.

Create configuration file dwh.cfg.

[CLUSTER]
HOST={database_host}
DB_NAME={database_name}
DB_USER={database_username}
DB_PASSWORD={database_password}
DB_PORT={database_port}

[IAM_ROLE]
ARN= {arn_aws_iam}

[S3]
LOG_DATA=s3://udacity-dend/log_data
LOG_JSONPATH=s3://udacity-dend/log_json_path.json
SONG_DATA=s3://udacity-dend/song_data

Create tables.

python create_tables.py

Load data to tables.

python etl.py

Explanation

Usage of sort and distribution keys can increase performance of the queries. Although loading time would increase but queries speed increase significantly. Star schema is used when designing a database. Star schema is good for running analytics and requires less postprocessing steps for further analysis in BI tools.

License

This project is licensed under the terms of the MIT license.

friendkak / hans-redshift-etl