SimplifyData / DEND-Project-3-Data-Warehouse-AWS

Project 3 - Data Engineering Nanodegree

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Project 3: Song Play Analysis With S3 and Redshift

Project passed

Summary


Preamble

In this project we are going to use two Amazon Web Services, S3 (Data storage)
and Redshift (Data warehouse with columnar storage)

Data sources are provided by two public S3 buckets. One bucket contains
info about songs and artists, the second has info concerning actions done
by users (which song are listening, etc.. ). The objects contained in both buckets
are JSON files. The song bucket has all the files under the same directory but
the event ones don't, so we need a descriptor file (also a JSON) in order to extract
data from the folders by path. We used a descriptor file because we don't have a common prefix on folders

The Redshift service is where data will be ingested and transformed,
in fact though COPY command we will access to the JSON files inside
the buckets and copy their content on our staging tables


Schema definition

This is the schema of the database

How to read the schema:

  • Blank bullets are used to identify the fields that can be null
  • Black bullets are used to identify the fields that can not be null
  • If the field is underlined means that is a primary key

schema

To represent this context a Star schema has been used

The songplays table is the core of this schema, is it our fact table and
it contains foreign keys to four tables;

  • start_time REFERENCES time(start_time)
  • user_id REFERENCES time(start_time)
  • song_id REFERENCES songs(song_id)
  • artist_id REFERENCES artists(artist_id)

There are also two staging tables; One for song dataset and one for
and one for event dataset


ETL process

In this project most of ETL is done with SQL (Python used just as bridge), transformation and data normalization is done by Query, check out the sql_queries python module


How to run

Although the data-sources are provided by two S3 buckets the only thing you need for running the example is an AWS Redshift Cluster up and running

And of course Python

Notes:

  • In this example a Redshift dc2.large cluster with 4 nodes has been created, with a cost of USD 0.25/h (on-demand option) per cluster
  • In this example we will use IAM role authorization mechanism, the only policy attached to this IAM will be am AmazonS3ReadOnlyAccess

After opening terminal session, set your filesystem on project root folder
and insert these commands in order to run the demo:

This will create our tables, this must be runned first
python create_tables.py

And this will execute our ETL process
python etl.py


Project structure

This is the project structure, if the bullet contains /
means that the resource is a folder:

  • /img - Simply a folder with images that are used in this md
  • create_tables.py - This script will drop old tables (if exist) ad re-create new tables
  • etl.py - This script executes the queries that extract JSON data from the S3 bucket and ingest them to Redshift
  • sql_queries.py - This file contains variables with SQL statement in String formats, partitioned by CREATE, DROP, COPY and INSERT statements
  • dhw.cfg - Configuration file used that contains info about Redshift, IAM and S3

About

Project 3 - Data Engineering Nanodegree

License:MIT License


Languages

Language:Python 100.0%