FedericoSerini / DEND-Project-2-Song-Play-Analysis-With-NoSQL

Project 2 - Data Engineering Nanodegree

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Project 2: Song Play Analysis With NoSQL

Project passed

Summary


Preamble

This project is not focused on ETL process, but on data modeling
on Cassandra and how it differs from relational database data modeling.

Let's start with PRIMARY KEY, in Cassandra primary keys works slightly different from RDBMS ones,
in fact in Cassandra, the PRIMARY KEY is made up of either just the PARTITION KEY
or with the addition of CLUSTERING COLUMNS. And you can have just one record unique,
if we insert data with same PRIMARY KEY then data will be overwritten with the latest record state

In WHERE conditions the PRIMARY KEY must be included and you can use your CLUSTERING KEY
to order your data but they are not necessary, by default data will be ordered as DESC

In Cassandra the denormalization process is a must, you cannot apply normalization like an RDBMS
because you cannot use JOINs. More the denormalization process is done more the query will
run faster, in fact, Cassandra has been optimized for fast writes not for fast reads.
To reach the complete denormalization you have to follow the pattern 1 Query - 1 Table.
This leads to data duplication but it does not matter, the denormalization process
by nature itself produces data duplication.

Following the CAP theorem, Cassandra embraces the AP guarantees.
It provides only AP because of its structure, data are shared by nodes so if
a node goes down another one can satisfy the client request but due
the high number of nodes data could be not updated for each node that is why
Cassandra offers only Eventual Consistency


ETL process

Although this is not an ETL focused project, we must care how data will be
ingested in our database, and how to read our input properly.
Our ETL process consist in read every file in /event_data, this data have to be
aggregated in a file called event_datafile_new.csv and after the aggregation,
the file has to be parsed and persisted on the database


How to run

First of all, you need a Cassandra instance up and running
Here you can find the Binary packages for your preferred operating system
After downloading the package, If you do not know to move on just follow this Documentation

You have to install also Python and Jupyter Notebook

Note:
In this example we will not use any authorization mechanism

After installing your database, Python and Jupyter on your local machine
open your terminal and type

jupyter notebook

It will start the service, after the service has been started just drag and drop the notebook


Project structure

This is the project structure, if the bullet contains /
means that the resource is a folder:

  • /event_data - The directory of CSV files partitioned by date
  • /images - Simply a folder with images that are used in Project_1B_ Project_Template notebook
  • Project_1B_ Project_Template.ipynb - It is a notebook that illustrates the project step by step
  • event_datafile_new.csv - The aggregated CSV composed by all event_data files

CQL queries

Query 1: Give me the artist, song title and song's length in the music app history that was heard
during sessionId = 338 and itemInSession = 4

CREATE TABLE IF NOT EXISTS song_data_by_session (session_id INT,
        session_item_number INT,
        artist_name TEXT,
        song_title TEXT,
        song_length DOUBLE,
        PRIMARY KEY ((session_id, session_item_number)))

In this case session_id and session_item_number are enough to make a record unique for our request. Our complete PRIMARY KEY is composed by session_id, session_item_number

SELECT artist_name, song_title, song_length
 FROM song_data_by_session
 WHERE session_id = 338 AND session_item_number =  4

Query 2: Give me only the following: name of artist, song (sorted by itemInSession)
and user (first and last name) for userid = 10, sessionid = 182

CREATE TABLE IF NOT EXISTS song_user_data_by_user_and_session_data (user_id INT,
        session_id INT,
        session_item_number INT,
        artist_name TEXT,
        song_title TEXT,
        user_first_name TEXT,
        user_last_name TEXT,
        PRIMARY KEY ((user_id, session_id), session_item_number))

In this case user_id and session_id are the COMPOUND PARTITION KEY
this allows us to have a unique PRIMARY KEY for our query, but for this request we have to order by session_item_number but not to query on that, so we have to declare session_item_number as CLUSTERING KEY.
Our complete PRIMARY KEY is composed by user_id, session_id, session_item_number

SELECT artist_name, song_title, user_first_name, user_last_name
 FROM song_user_data_by_user_and_session_data
 WHERE user_id = 10 AND session_id = 182

Query 3: Give me every user name (first and last) in my music app history who listened
to the song 'All Hands Against His Own'

CREATE TABLE IF NOT EXISTS user_data_by_song_title (song_title TEXT, user_id INT,
        user_first_name TEXT,
        user_last_name TEXT,
        PRIMARY KEY ((song_title), user_id))

In this case song_title is the PARTITION KEY and user_id is the CLUSTERING KEY, the request asks to retrieve the user name by song title, so we have to set song_title as PARTITION KEY, but more users can listen to the same song so we may have many INSERT with the same key, Cassandra overwrites data with the same key so we need to add a CLUSTERING KEY because we need to have a unique record but not to query on that. Our complete PRIMARY KEY is composed by song_title, user_id

SELECT user_first_name, user_last_name
 FROM user_data_by_song_title
 WHERE song_title = 'All Hands Against His Own'

About

Project 2 - Data Engineering Nanodegree

License:MIT License


Languages

Language:Jupyter Notebook 100.0%