Datawarehouse Redshift

ETL pipeline for a datawarehouse (DWH) hosted on Redshift AWS

Files

AWS/ used to create aws infrastucture read more on section Creating Infrastucture
sql_queries.py: contains all sql queries
create_tables.py: Create all tables. Read more on details about the tables schema on section Schemas
etl.py: Run ETL pipeline. The pipeline read data from s3 (schema format on subsection Data in s3), and converts the data to a Star Schema format (schema format on subsection *schema format on subsection Datawarehouse Schema)
load_s3.py: Upload *.json data on folder data/to s3.
dwg.cfg: Config file

We are using aws as environment variables in this repo:

export AWS_ACCESS_KEY_ID=<YOUR_AWS_ACCESS_KEY_ID>
export AWS_SECRET_ACCESS_KEY=<YOUR_AWS_SECRET_ACCESS_KEY>
export AWS_DEFAULT_REGION=<YOUR_REGION>

Create IAM roles for Redshift, run file: iam.py. It will create a role for Redshift with s3 Read access

Creates Redshift cluster and add Ingress rule for TCP connection on port 5439. Run file redshift.py

Upload all dataset to s3. Run load_s3.py

This data will pass trough an ETL to be stored in Redshift

Fill out the dwh.cfg with the desired redshift confirguration, s3 and IAM role

Once you have all the infraestructure ready. Run the commands:

ETL pipeline for a DWH hosted on Redshift

Apache License 2.0

Language:Python 100.0%