This project is based on SparkSQL for analysis of different music genres, artists, albums and songs. In the final part of the project, we use matplotlib library for visualization purpose.
Two datasets are used:
- listenings.csv
https://drive.google.com/file/d/14dMLzOTIf1GK-P6bA9rVEI_1WSedOdZU/view?usp=sharing
- A data file of 1GB
- Contains: user_id, date, track, artist and album
- genre.csv
https://drive.google.com/file/d/1q8VWIZFjlOP_91z0GjbCe4RpmtGVDkvz/view?usp=sharing
- Contains: artist, genre
- Google account to access data and run notebook on Google colab
- Installing pyspark
!pip install pyspark
-
'Date' column is removed from the dataset and rows with na are dropped.
-
Find all of the records of those users who have listened to Rihanna
- Find top 10 users who are fan of Rihanna
- Find top 10 famous tracks
- Find top 10 famous tracks of Rihanna
- Find top 10 famous albums
- Inner join two dataframes
- Find top 10 users who are fan of pop music
- Find top 10 famous genres
- Find out each user favourite genre
- find out how many pop,rock,metal and hip hop singers we have and then visualize it using bar chart