raffg / spotify_analysis

Digging into Spotify's valence score

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What's the Most Wonderful Time of the Year? Hint: It's not what The Economist says

Digging into Spotify's Valence score

Photo by Marcela Laskoski on Unsplash

In the Feb 8th, 2020 edition of The Economist, the Graphic Detail section briefly discussed an analysis of Spotify data suggesting that July is the happiest month on average (Sad songs say so much: Data from Spotify suggest that listeners are gloomiest in February). I attempted to duplicate their study and came to some different conclusions, and made some new discoveries along the way.

The Data

There are two sources of data needed for such an analysis. The first is the Top 200 most streamed songs, by day and by country, which Spotify makes available on spotifycharts.com. Because I didn't want to select each individual country and date from drop-down menus and manually download the almost-70,000 daily chart csv files, I built a scraper to do it all for me.

The second set of necessary data is the valence scores for each song in those charts. Spotify makes this data available via their developer API. To access this data, you'll need to sign up for credentials here. Thankfully, Spotify doesn't make this too difficult though, so you shouldn't have any problems. I built a second scraper which goes through each unique song from the Top 200 charts and downloads its feature vector. There are several features available here, but what The Economist used was the valence score, a decimal between 0 and 1 which describes the "happiness" of the song.

This score was originally developed by a music intelligence and data platform called Echo Nest, which was acquired by Spotify in 2014. A (now dead, but available via the Wayback Machine) blog post only has this to say about the score:

We have a music expert classify some sample songs by valence, then use machine-learning to extend those rules to all of the rest of the music in the world, fine tuning as we go.

Other features available via the API include tempo, energy, key, mode, and danceability, among others, and it has been speculated that these features play a role in the valence score. At any rate, it's a bit of a black box how the valence score is arrived at, but it does seem to match up with the songs very well. However, as the training data is highly likely to favor popular music, I wonder if classical, jazz, or non-Western musical styles are not scored as accurately.

The Analysis

My analysis showed a very similar distribution of data compared with The Economist's analysis: Economist histogram source: https://www.economist.com/graphic-detail/2020/02/08/data-from-spotify-suggest-that-listeners-are-gloomiest-in-february

Mood distribution

Despite the different appearances of our two charts, the shape of the two kernel density estimations is very similar (if you're not familiar, a kernel density estimation, or KDE, is pretty much just a smooth histogram with the area under the curve summing up to 1). The locations of those key songs along the valence axis also matches up. You can see that on average, Brazilians listen to "happier" music than the rest of the world, and in the United States listeners stream music which on average is less happy than the rest of the world. As we'll see, the music of Latin America typical scores very high in valence.

It's the second chart from The Economist where I see some key differences.

Economist chart source: https://www.economist.com/graphic-detail/2020/02/08/data-from-spotify-suggest-that-listeners-are-gloomiest-in-february

First, let's look at that ten-day moving average chart in the upper right corner. It finds that February displays the lowest average valence and July the highest. Here is what I found:

Mood by month

In my analysis, December has the highest average valence, followed by August, and then in third place is July. I initially found some very different average valence scores for February as well, and so investigated why our data sets could be different. The Economist used data from January 1st, 2017 (the earliest available on Spotify Charts) until January 29th, 2020 (presumably, when they performed their scraping). I had all of that data, plus almost all of February 2020 as well. Without aggregating by month, as the above plot does, and also not performing a moving average, I saw a much wider variance in the same month from year to year than I expected:

Mood across time

In any given year, I did see December as the highest. However, in 2018 (a particularly sad year?) the summer featured lower valence scores than February. Additionally, the inclusion of 2020 data, which is much higher than the previous years, acted to inflate February's average across time. Therefore, I chose to exclude 2020 data. Compare these two charts, one with 2020 included and the other without:

Mood by country

Above: including Jan/Feb 2020

Below: excluding Jan/Feb 2020

Mood by country

This doesn't change the charts a great deal, however one key point The Economist noted in their chart was that southern hemisphere New Zealand also experiences a dip in valence in February despite the reversal of their summers and winters compared to the northern hemisphere. When I include February 2020, I see the opposite effect as to what The Economist noticed, but when I exclude February 2020, then I do see the dip; however much less pronounced that what The Economist saw. Following The Economist's convention of including all available data though, I would include February and because 2020 was so much happier than previous years, this proves the opposite point to be true. Including this data does unfortunately seem somewhat arbitrary to me - what do we call an appropriate cutoff point? Including an equal number of months in each year seems reasonable to me though, so slightly less arbitrary to exclude the 2020 data. I'll be curious to rerun this analysis at the end of the year and see what it looks like then.

A key finding of mine though which is most certainly different than what The Economist found is that December is the happiest month, not July as they found.

I also sorted the countries into continents to look at wider trends. As The Economist found, Latin American countries do indeed stream much happier music than the rest of the world. Also to note is that in every continent except Europe, December is the happiest month; and February the saddest everywhere except Africa and Australia:

Mood by continent

Additionally, I looked at mood by day. As I had expected, I found Saturday to be the happiest.

Mood by day

I also looked at this chart for just the US and New Zealand. I found Friday to be the saddest day in the United States and Sunday to be the happiest. Any theories why this may be? New Zealand exhibits the behavior I'd probably most expect, with Monday being the saddest and Saturday the happiest.

Mood by day Mood by day

Finally, I did spend a bit of time looking into the other features available via the Spotify API and made one last chart, danceability for each country:

Danceability by country

Here, I found Fridays to feature more danceable music than Mondays, as I was expecting. Furthermore, the ordering of the countries shifts around a lot from the valence chart. For instance, the United States is at the sadder end of the valence spectrum but at the more danceable end in this chart. I also noted that the lowest countries as ranked by danceability are all Asian countries with traditional music styles which do not fit the "standard" western 12-tone, 4-beats-to-a-bar system. I wonder if the algorithm has a tough time predicting danceability when the structure is so different than the majority of training samples.

I also noted that on Sundays in the Netherlands, their danceability score plummets! Norway and Sweden experience this phenomenon, although to a lesser extent. Religious Sundays may be one explanation for this result, although the Netherlands, Norway, and Sweden don't seem to me to be particularly more religious than many other countries which don't exhibit this behavior.

Just for fun, I looked up the saddest song I know (tragically missing from the Top 200 charts), the 3rd symphony by Henryk Górecki (titled, very appropriately, Symphony of Sorrowful Songs). The 2nd movement features a valence score of just 0.0280, well below Adele's Make You Feel My Love at 0.0896. This score would place it second-to-last on the valence chart for all 68,000+ songs in the Top 200, just slightly above Tool's Legion Inoculant, the saddest song in the charts with a valence of 0.0262 (although, in my opinion, this composition by Tool really stretches the definition of "song" beyond reasonableness).

I also looked up Pharrel Williams' Happy, expecting to find a peak score, and was disappointed to see that it's "only" 0.9620. Compare that to the almost comically happy September by Earth, Wind & Fire, with a valence score of 0.982 (the highest in the charts).

The top 10 happiest songs in the United States' top 200 rankings:

  1. Earth, Wind & Fire - September
  2. Gene Autry - Here Comes Santa Claus (Right Down Santa Claus Lane)
  3. The Beach Boys - Little Saint Nick - 1991 Remix
  4. Logic - Indica Badu
  5. Chuck Berry - Johnny B. Goode
  6. Shawn Mendes - There's Nothing Holdin' Me Back
  7. Foster The People - Pumped Up Kicks
  8. Tom Petty - I Won't Back Down
  9. OutKast - Hey Ya!
  10. Aretha Franklin - Respect

And the top 10 saddest songs:

  1. TOOL - Legion Inoculant
  2. Joji - I'LL SEE YOU IN 40
  3. Trippie Redd - RMP
  4. Drake - Days in The East
  5. Drake - Jaded
  6. Lil Uzi Vert - Two®
  7. TOOL - Litanie contre la Peur
  8. Russ - Cherry Hill
  9. 2 Chainz - Whip (feat. Travis Scott)
  10. Rae Sremmurd - Bedtime Stories (feat. The Weeknd) - From SR3MM

So, it seems that Andy Williams is correct: December actually is the Most Wonderful Time of the Year (valence score: 0.7240).

About

Digging into Spotify's valence score


Languages

Language:Jupyter Notebook 99.8%Language:Python 0.2%