pezon / music-mining

Datasets and analysis for recordings that have charted globally and been nominated for a Grammy.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Music Mining

This is the primary code repository of a project for MSCA 31008 Data Mining titled "How to Have a Number One Hit the Easy Way: Analysis of the music industry using data mining techniques" by Ben Ossyra, Dominique McBride, Peter Fuentes Rosa, Peter Pezon. You can find the accompanying final report and PowerPoint presentation in this repository as well.

This project was inspired by "The Manual" by the KLF, who published a step-by-step guide to achieving a No.1 single with no money or musical skills, and a case study of the duo's UK novelty pop No. 1 "Doctorin' the Tardis".

Installion

  1. Create virtual environment
  2. Install library dependencies
pip install -r requirements.txt
  1. Run jupyter
jupyter lab

Usage

Datasets of interest for analysing tracks:

  • data/tracks.pq (Parquet format; gzipped CSV also available)

Dataset of interest for analysing artists:

  • data/artist_summary.pq (Parquet format; gzipped CSV also available)

Artist summary variables are merged into tracks such that tracks have a snapshot of the artist's history at the time that the track was released. For example, artist summary has charting information from 2000-2021; when joined to tracks, if a track was released in 2017, only charting information for 2016 is considered.

Dataset

Presented is a compilation of a dataset of Spotify tracks and their audio features, and assigned a category for their chart status, artist status and recording award status. The dataset is compiled from recordings nominated for a Grammy (n=~535), tracks that have entered the Chart2000 global aggregated monthly song chart (n=~3300), and a random sample (n=1000) from Spotify daily charts. The dataset is de-duplicated on track id, and limited to the years 2000-2021. The dataset is small (n=4237) but fairly balanced across the three categories.

It’s not “big data,” but it’ll make for more easily analysis. If we’d like to add more data, we can explore incorporating more songs from Spotify’s daily and weekly charts and viral charts, or another global charts aggregator (https://tsort.info/). The limiting factor is the number of tracks nominated for a Grammy (535) and (~75). In the future, we could add data based on nominated albums, or other recordings by artists that have charted or been nominated later or previously in their career. I think this is good for now though!

The goal of this dataset is to focus on the track audio features by creating three categories that summarize the details about the source datasets. They’re simple and non-exhaustive, and created to help merge and balance the data sources. There are opportunities to further develop these categories. So far, they are interesting and may lend themselves to natural clusters.

Data sources

Alternative data sources:

Chart2000:

  • Peak chart position
  • Avg chart position
  • Median chart position
  • Number months its charted
  • Total revenue

MusicBrainz:

  • Grammy award nominations (category and year)
  • Musical genre
  • Number of releases
  • Recording label and publishers

Spotify:

  • Audio features

Collection methodology

Data collection pipeline

Generating datasets

To create datasets from scratch:

  1. Download Spotify charts dataset: https://www.kaggle.com/general/232036
  2. Download Charts2000 dataset: https://chart2000.com/data/chart2000-songmonth-0-3-0063.csv
  3. Run prepare_grammys.ipynb
  4. Run prepare_charts.ipynb (and update path to Charts2000 dataset)
  5. Run prepare_labels.ipynb (WARNING: SLOW. Only update if absolutely necessary. Recommend differential updates.)
  6. Run prepare_artists.ipynb (**WARNING: SLOW. Recommend differential updates.)
  7. Run prepare_tracks.ipynb

When new tracks are added, we need to ingest additional artist and label data if it's not in our dataset. Add tracks to prepare_tracks, then run differential updates in prepare_labels (TBD) and prepare_artists.

Limitations of dataset

Dataset is constructed from Grammy-nominated, top-charting songs and popular songs on Spotify's Daily Charts.

Notes about Grammy awards dataset

Collected from MusicBrainz Grammy Awards event series and subseries release groups. MusicBrainz data is aggregated from Wikipedia. Grammy Award series.

Grammy Nominees found in the GRAMMY Nominees series.

Missing data exist for minor award categories: in many cases, only the award winners are credited. Nominees are credited in major awards, or after 2010. Examples shown below.

Analysis

Some questions we can ask:

  • In the universe of charted songs, why did some songs become nominated?
  • Why were some songs awarded? What are factors for a recording to be nominated for an award?
  • Why did some songs stay on to top of the charts for so long, and others not?
  • What's important - musical characteristics, or the marketing and politics of the record label and publisher?
  • What are characteristics of independent artists vs major label artists?
  • What characteristics of an artist's recordings change after an artist signs to a major label?
  • Once an artist is signed, what happens - are they more likely to chart? Does their sound change? Do they continue to release recordings?
  • What are characteristics of nominated and un-nominated artists?
  • How does an artist's work change after they've been nominated?
  • What features of a song are similar across the top of the charts, and the bottom of the charts?
  • Can we predict whether an artist will be a one-hit-wonder or have a long career?
  • How do recordings cluster - by genre, top charting, major labels?
  • Are the factors involved for a song to be nominated and to be at the top of the chart the same?
  • What songs are similar to nominated or charting songs? What factors separate them? (e.g., experience, label status)
  • How much more likely is an artist to chart or be nominated if they've already won?

Hypotheses:

  • GRAMMY awards are all politics, rather than artist or track qualities.
  • Charting is all marketing budget
  • Top charting songs all sound the same
  • Top charting songs lead to Grammy nominations
  • An artist who has won previously or charted previously are likely to chart again.
  • An artist who has been nominated previously without winning are likely to win.

Modeling

Modeling pipeline

We are creating a model for predicting whether a track will make it to the Global Top 50 charts using audio features.

Prior work

About

Datasets and analysis for recordings that have charted globally and been nominated for a Grammy.


Languages

Language:Jupyter Notebook 99.7%Language:Python 0.3%