billboard-predictor

Predicting which song will end up on the billboard. Applying a logistic regression model and compare it with a classification tree to evaluate our model’s accuracy. Through the use of different models, we were also able to evaluate different trends in our dataset and identify the optimal values for different audio features.

Background

In 2020, the music industry generated $23.1 billion, and streaming made up 56% of the total revenue. Because of the increasing profit generated by the music industry and growing streaming services, artists and record labels have looked to maximize their chances of making their songs popular. The research question we’re addressing here is: what are the most important factors that make a hit song and given these features, what is the probability that a song will end up in the Billboard Hot 100?

Data Collection and Cleaning

The data cleaning took the vast majority of the work. We realized data cleaning is essential to identify and remove errors and duplicates. This improves the quality of the training data for future analytics and helps increase our accuracy. We started by looking at the dataset we found in TidyTuesday. The billboard dataset had records of weekly Billboard Hot 100 going back to the 1950s. The songs features dataset included all the songs on the billboard dataset that are also on Spotify. It also included a host of song features: danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, time_signature, spotify_track_popularity, and spotify_track_explicit. The Songs Feature dataset included a lot of NAs for many fields, which we confirmed were because of the lack of information about the songs in Spotify. We omitted those entries and moved on to create a songs features list of songs that didn’t make it to the Billboard Hot 100 list. We found an existing playlist on Spotify that had up to 5295 songs. In Appendix 1, we talk further about utilizing Spotify API to extract song features for this data. After binding the two datasets to make datasets of all the songs, we then moved on to create two smaller datasets for training and testing purposes. This required us to split the dataset into two randomly chosen training and test datasets.

EDA

After collecting and cleaning the data, we then moved on to data exploration by looking into trends in our dataset. This demonstrates how music trends change over years from 1950 to 2020, specifically with the danceability factor. The y-axis is “Spotify Popularity”, which according to Spotify, is based on “the total number of plays the track has had and how recent those plays are” (Pecker). The value will be between 0 to 100, with 100 being the most popular. The 2020 graph shows how higher danceable songs have higher Spotify popularity, possibly because of platforms such as TikTok and Instagram. These days, danceable songs mean more people listen and dance to it, and are thus more popular.

We were also interested in the distribution of hit songs based on other features. The graphs show the most common features of tempo, speechiness, loudness, and key of songs that ended up in Billboard, which is the y-axis. We can see that for tempo, there were two peaks within this range at about 100 bpm (beats per minute) and 135 bpm. We also discovered that most of the songs that end up in Billboard have less speechiness, while most songs have greater loudness. For the songs that made Billboard’s Top 100, we looked into the mean values for the audio features we detected previously using and the results were fairly reasonable. Tempo was at about 120.28 bpm, loudness was at 8.405, speechiness was at 0.0686, and the estimated overall key was 5. This shows most songs on Billboard are fast paced, loud, and are in major key. We also conducted a variable selection process using forward selection, backward elimination, and forward backward selection. We used our train dataset to achieve this. All 3 models eliminate instrumentalness and liveness, and only include the following audio features: loudness, Spotify track popularity, Spotify track explicit, valence, danceability, mode, acoustiness, tempo, speechiness, key, time signature, energy. Using these variables, we built a logistic regression model. The estimated coefficients of the logistic regression model show that there is a positive correlation between loudness and the outcome (overall loudness of a track are in decibels (dB) that typically range between -60 and 0 db). All of the p-values are less than 0.05, which implies that the given audio features are statistically significant. We compared the logistic regression model with a classification tree. The decision tree reveals to us the variables deemed important by the classification algorithm. It shows that the most important variables in determining whether a song is going to end up on the billboard are spotify_track_explicit and spotify_track_popularity. We can infer that most of the songs on the Billboard are popular on Spotify and they are not explicit. All other observations are distributed among other leaves, which use loudness, acousticness, valence, danceability, mode, and speechiness to predict whether a song ended up in Billboard Hot 100 or not.

We trained our two models, classification tree and logistic regression model, with our training dataset and then we created the ROC curves using the test dataset. The ROC curve shown on the left shows that the logistic regression model is a better model for predicting the likelihood of a song ending up. Our model shows that it correctly predicts the songs ending up in Billboard 96% of the time.

Discussion

In summation, we were able to produce a ranked list of 12 audio features which are the most important predictors of what makes a song popular. Using these features, we can see that the logistic regression model is the best model for predicting the likelihood of a song ending up in the Billboard Hot 100. It is important to note, however, that the logistic regression model does not reach 96.7% true positive rate until 43.2% false positive rate as we can see in the ROC curve. This is likely because of the limitations we discuss next. The biggest critique of our model is that it fails to use an artist’s popularity as a predictor of whether a song will end up in the Billboard Hot 100. An artist's popularity is potentially an essential predictor because more famous artists often have more famous songs. For instance, if an amateur musician puts out a song and a more established musician puts out a song with the exact same audio features, our model will wrongly predict that both songs have an equal likelihood of ending up in the Billboard Hot 100. We can develop a measure of how popular an artist is by looking at how many streams their songs have had in the past or how many of their songs have ended up in the Billboard Hot 100 in the past. This will help us differentiate amateaur musicians from more popular musicians. Secondly, as we saw in the Data Exploration section, trends in what makes a song popular change over time but our models fail to account for these changes. At the moment, our model grants equal weights to old and new songs and therefore assumes that audio features of old songs are equally good predictors of what makes a song popular as of newer songs. In order to fix this, we can weigh newer songs more heavily in the data set so that our model accounts for recent trends in music tastes. Another way to do this would be to have more new songs in the data set as compared to old songs so that the prediction is driven more heavily by recent trends.

rodas-j / billboard-predictor

billboard-predictor

Background

Data Collection and Cleaning

EDA

Discussion

About

Languages