IllyaUA / lab-not-hot-songs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ironhack logo

Lab | Not hot songs

Introduction

Now that you have scrapped the website Billboard to create a hot_songs dataset, it's time to prepare a new dataset of not_hot_songs. This dataset can contain songs of your choice, others collected from the web or any other combination. Some sources of songs can be:

Considerations

You want your dataset of not_hot_song to be:

  • As heterogeneous in terms of (genre, length,...etc) as possible to create better groups of songs.
  • Not too big and not too small (typically around 2-3K) songs

In a real-life scenario, you might want to have your dataset as biggest as possible and use specialized Big Data techniques like PySpark to group similar songs together. However, you are going to work on your own laptop which has limited power. Therefore, you need to limit the size of your dataset of not_hot_songs otherwise the process of grouping similar songs will take forever.

Deliverables

Your fork should contain a jupyter notebook with the code to:

  • Gather the songs
  • Remove songs already present in the hot_songs dataset

About


Languages

Language:Jupyter Notebook 100.0%