askoulwassim / Data_Modeling

Data Engineering Projects: SQL, NoSQL, Data Warehousing, Date Lake & Data Pipeline

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data_Modeling

Relational DBMS with PostgreSQL For Sparkify (Music Liberary Startup)

Use Case of This Project

Sparkify is a music streaming app with directory of JSON logs on user activity on the app, and JSON metadata on songs in their app. The objective is to create a database that would allow an analytics team to query the data and analyze the user activity on the different songs in the app. We decided that we need a relational database since we are looking for simple presentation of the provided data while maintaining simplified queries for our analytics team. We chose to use Python to write up the database schema and ETL pipeline using a Postgres database.

Based on our need for a simple querying of the data for the analytics team, we decided to elect a Star schema of our fact and deminsion tables. Our focus in this database is to gain fast aggregated insights tailored for specific needs of the analytics team, a star schema allows us to build an OLAP system around our data.

Database Setup

We have two datasets that we are working with: song dataset & user logs dataset. The song dataset files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

The user logs dataset simulate activity logs from a music streaming app based on specified configurations. The log files in the dataset are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

Based on the datasets mentioned and that we need to optimize for queries on song play analysis, we need our fact table to be on songs played with every data point on every instance a song is played. While our dimension tables would be divided to users, songs, artists, and time. Here is the following table structures:

Fact Table
1. songplays - records in log data associated with song plays i.e. records with page NextSong

songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables
2. users - users in the app

user_id, first_name, last_name, gender, level

3. songs - songs in music database

song_id, title, artist_id, year, duration

4. artists - artists in music database

artist_id, name, location, latitude, longitude

5. time - timestamps of records in songplays broken down into specific units

start_time, hour, day, week, month, year, weekday

Using The Project

  1. Unzip the data.tar file in order to extract the data files used in this project. The song dataset is a subsection of sets from The Million Song Dataset, and the user log dataset is generated by this event simulator.

  2. Run the create_tables.py in order to create the database with Postgres. Please note that the host, dbname, user, and password arguments of the psycopg2.connect functions need to be changed based on your own connection to Postgres. Check this link that walks you through setting Postgres up on Windows, and there's a lot of helpful contact to do the same with Linux or Mac.

Note: You need to make sure that Pandas and JSON liberaries are installed.

  1. Run etl.py which will go through 71 song files and 30 user log files importing the data into the database using the sql_queries.py.

Check sql_queries.py for the quesries used to create the database. Check etl.ipynb in order to view the process of testing the code for the etl.py program.

  1. Run test.ipynb on jupyter notebook to test that the database has been created successfully.

Congratulations! You have walked through the setup of a relational DBMS with Postgres and applying it to a fake music streaming app. You can apply for a position with Spotify now! 😁

About

Data Engineering Projects: SQL, NoSQL, Data Warehousing, Date Lake & Data Pipeline


Languages

Language:Jupyter Notebook 69.4%Language:Python 30.6%