Lab | Not hot songs

Introduction

Now that you have scrapped the website Billboard to create a hot_songs dataset, it's time to prepare a new dataset of not_hot_songs. This dataset can contain songs of your choice, others collected from the web or any other combination. Some sources of songs can be:

Wikipedia
Subset of million songs dataset Note: this dataset takes several GB of disk space!!!
Kaggle
Your favourite songs included in the Trello board as long as they are not hot songs :)

Considerations

You want your dataset of not_hot_song to be:

As heterogeneous in terms of (genre, length,...etc) as possible to create better groups of songs.
Not too big and not too small (typically around 2-3K) songs

In a real-life scenario, you might want to have your dataset as biggest as possible and use specialized Big Data techniques like PySpark to group similar songs together. However, you are going to work on your own laptop which has limited power. Therefore, you need to limit the size of your dataset of not_hot_songs otherwise the process of grouping similar songs will take forever.

Deliverables

Your fork should contain a jupyter notebook with the code to:

Gather the songs
Remove songs already present in the hot_songs dataset

IllyaUA / lab-not-hot-songs

Lab | Not hot songs

Introduction

Considerations

Deliverables

About

Languages