Udacity Data Engineering Nanodegree Project 1 - Data Modeling with PostgreSQL

Project Overview

The goal of this project is to create a relational database for a Music Streaming Startup called Sparkify, to help the analytics team understand its customers behaviour. Currently, the data is stored in JSON files, and needs to be transferred to a relational schema, so the analytics team can easily answer business questions.

This project is divided in three parts:

Create the working environment and install the database and libraries needed for the project;
Create the Relational Database Schema;
Define the ETL code for transforming data and loading it into tables;
Validate the data arquitecture by performing analytics queries.

1. Creating the working environment

The first part of the project consists on setting the working environment by creating a new one using Conda and installing PostgreSQL.

There are two ways to create your working environment, using a conda environment Yaml file or manually creating an environment and installing each library.

1.1. Using Conda Env Yaml file

After cloning the project repository, simply create a new environment using the Yaml file.

git clone https://github.com/$GitHubName/DataScience_DevEnv.git
conda env create --file=data_eng_project1.yaml
conda activate data_eng_project1
ipython kernel install --user --name=data_eng_project1

PostgreSQL and all Python libraries will be installed after running the command above.

1.2. Specifying instalation steps

1.2.1. Creating Conda environment

In order to control our development resources we are going to use Conda environment.

conda create -n data_eng_project1 python=3.7
conda activate data_eng_project1

After this step we can work inside a compartmentalized environment with Python 3.7 and specify the requirements need to develop the project.

1.2.2. Installing PostgreSQL

Next, we need to install PostgreSQL to run locally the database.

sudo apt-get install postgresql postgresql-contrib

With PostgreSQL installed, we can check its status.

sudo /etc/init.d/postgresql status

And start it if needed:

sudo service postgresql start

1.2.3. Installing requirements

Finally we install the Python libraries to connect to PostgreSQL Database and perform data manipulation.

pip install -r requirements.txt

1.2.4. Install Jupyter Kernel

In order to help us develop and test our code we need to install Jupyter Kernel, so that we can run the code on the created Conda environment.

ipython kernel install --user --name=data_eng_project1

2. Create the Relational Database Schema

After setting the working environment is time to start creating the Database architecture to store our data.

2.1. Creating the default database

In order to run the create_tables.py script we need to have a default database already available, so we need to create it first.

Next, we create an user an a database to use in our project.

sudo -u postgres createuser -D -A -P student
sudo -u postgres createdb -O student studentdb

In order to check if the default database was successfully created we can use the following commands:

sudo -i -u postgres

And access psql command line tool to check the settings:

psql
postgres=# \l

You should get the following information in return:

                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
-----------+----------+----------+-------------+-------------+-----------------------
 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 studentdb | student  | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
(4 rows)

2.2. Setting permissions for student user

In order to allow user student to create databases we need add the following permission:

postgres=# ALTER USER student WITH CREATEDB;

Next, we can check the user permission.

postgres=# \du

And student should appear with the Create DB attribute.

            List of roles
 Role name |                         Attributes                         | Member of 
-----------+------------------------------------------------------------+-----------
 postgres  | Superuser, Create role, Create DB, Replication, Bypass RLS | {}
 student   | Create DB                                                  | {}

After creating working environment and the default database we are ready to start implementing the scripts to organize data for Sparkify in a relational schema.

2.3. Creating the database and tables

The file create_tables.py has the instructions for creating the database and tables. To execute it:

python create_tables.py

3. Creating the ETL code for loading data into tables

With the tables available, the next step consist in extracting, transforming and loading data into it. This process is also known as ETL (acronym for Extract, Transform and Load) and is coded in the etl.py file.

create etl.py

To check if the creation was successful you can run test.ipynb Jupyter Notebook.

4. Validating the data with SQL queries

Finally, in order to check if the process of creation and loading of data into the database we can use a few queries to aggregate data. You can perform the following queries using the test.ipynb Notebook using the %sql magic command.

Get rows with artist_id and song_id not null:

%sql SELECT * FROM songplays where song_id is NOT NULL and artist_id is NOT NULL

Number of unique artists in artists table:

SELECT COUNT(DISTINCT name) FROM artists

5. Final thoughts

Throughout this project we have reviewed a few key concepts in Relational Database modeling by creating a new schema and parsing and adding data to it. We followed good practices in terms of database normalization and code formatting.

GabrielSGoncalves / udacity_data_engineering_project1