This project was provided as part of Udacity's Data Engineering Nanodegree program, you can see all the Nano Degree projects from here.
A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analysis team is particularly interested in understanding what songs users are listening to. Currently, there is no easy way to query the data to generate the results, since the data reside in a directory of CSV files on user activity on the app.
They'd like a data engineer to create an Apache Cassandra database which can create queries on song play data to answer the questions, and wish to bring you on the project. Your role is to create a database for this analysis. You'll be able to test your database by running queries given to you by the analytics team from Sparkify to create the results.
the goal from project is build data modeling using apache cassandra and build ETL pipeline, througth build and create apache cassandra database and deal with csv files to preprossecing them and insert them into cassandra database it created in previous step and build cassandra database to optimize this there Queries. after merge csv files to large csv file ,build cassandra table to optimize the next Queries and in next figure show the attributes needed on each query
to optimize this query ,build song_info_by_session cassandra table
Table Name: song_info_by_session
column 1: artist text
column 2: song text
column 3: length decimal
Column 4: sessionid int
Column 5: itemlnsession int
PRIMARY KEY(sessionid, itemlnsession)
to optimize this query ,build song_playing_history_by_user cassandra table
Table Name: song_playing_history_by_user
column 1: artist text
column 2: song text
column 3: first_name text
column 4: last_name text
Column 5: sessionid int
Column 6: itemlnsession int
Column 7: userid int
PRIMARY KEY((userid,sessionid), itemlnsession)
userid and sessionid are composed partition key and itemlnsession is cluster key (it used as cluster key to order the song order descending with itemlnsession)
to optimize this query ,build who_listen_to_song cassandra table
Table Name: who_listen_to_song
Column 1: userid int
column 2: first_name text
column 3: last_name text
column 4: song text
PRIMARY KEY( song ,user_id)
working with event_data dataset,it contaion 30 file contains the history of music streaming app.
The directory of CSV files partitioned by date. Here are examples of filepaths to two files in the dataset:
- prepare Environment install python
- install Apache cassandra you can view Cassandra Documentation to install it
- run
Project_1B_ Project_Template.ipynb
using jupyter notebook or any notebook editor - don't forget to close any connection opening
This folder contains a collection of csv files. each file is contain information about history of music streaming app in day.
This folder contains some images they were used in this repository to illustrate some thang.
A Python Jupyter Notebook that was used to reads and processes a data and collect them on one file,same EDA and detailed instructions on the ETL and create and deal with cassandra database.
- juputer notebook
- python
- Apache cassandra
- text editor
- panads
- cassandra drive
- os
- csv
I'm mohamed bekheet, you con browser other repository on my github profile and view my linkedin page and kaggle profile and you can contect with me throgth mohamedbekheet33@gmail.com