Given a song, the goal is to find some other similar ones.
The choice of ‘similarity’ is left to the user depending on what he is looking for (same artist, similar attributes like hotness or danceability...)
There are some other restrictions that must be taken into consideration :
-
A user should be able to narrow the search by giving more specific information
-
The responding time should be short enough to seem interactive
-
The data should be analyzed in a scalable way
The data set being used is called 'Million Song Dataset', it contains meta data from around one million different songs, and has a total size of 300 GB.
It is stored using Hierarchical Data Format version 5 (HDF5) which is a storage model for data with many dimensions and complex objects.
The data set does not store any actual audio but instead consists of song features and meta-data collected and analyzed by The Echo Nest. Each HDF5 file stores information about one song as seen in the picture below.