Quantifying Negativity in Music, Part 1: The Code

14 min readMay 20, 2021

The first in a series of articles about a data science project about music involving APIs and sentiment analysis.

Introduction

A long time ago, on a particularly gloomy night, I was listening to the whole of Wish You Were Here by Pink Floyd. Throughout the album, which I’d say is pretty sad and depressing in its own right, I started thinking about the contents of the album, and how it explores so many negative emotions. This led to a lot of thought about the concept of negativity in music, in how the song sounds and the themes the lyrics explore.

The more I let this train of thought run, the more I wondered about comparisons. So many artists, so many albums, so many songs, over so much time. I have no musical background or skill whatsoever. The biggest achievement I have in terms of music is that I once took a test to see if I’m tone-deaf or not, and turns out I’m not.

Since I had no musical skill or understanding, I had only one thing to rely on: programming. And this led to the question: Is there a way to calculate the negativity of an artist and their work using code?

Spotify Web API

Fueled by unemployment, overwhelming boredom, and this unshakeable thought in my head, I set to work. Within no time, I came across Spotify Web API.

God bless them.

Spotify has a very handy Web API that you can use to fetch data from their databases. Of particular interest is the AudioFeaturesObject, which contains a bunch of very interesting keys. Each key refers to a feature of the track. For example, the key danceability scores the track on a range of 0 to 1 based on how suitable for dancing the track is. Similarly, they have other metrics, mostly everything self-explanatory, like instrumentalness, energy, acousticness, etc.

However, the one that suited my requirement was something called valence.

What, according to Spotify, is valence?

Valence is a key to measure how sad a track sounds on a scale of 0.0 to 1.0. Tracks with low valence are more negative sounding, while tracks with high valence are more positive sounding.

Perfect. This is everything I need to get started with.

However, here was the first obstacle: I did not know how to use APIs. I had never used them, and now I needed to learn how to use them. In my quest to find a solution to this problem, I came across CodingEntrepreneurs’ channel on YouTube. Their explanation of using APIs is simple and easy to follow, and it helped me quite a bit. I followed their Spotify API code, and used the basics of it for my usage.

Once I figured out how to connect to Spotify, it was time for me to begin extracting data. The basic requirements I had at the beginning were as follows:

Fetch an album’s data
Fetch the list of tracks for an album
Fetch the audio features information for a given list of tracks

Fetching data from Spotify can be done by running a search query through the API provided. Every search query returns the result in the form of a JSON (JavaScript Object Notation) format. This format contains the result in the form of key-value pairs. For example, there could be a key called “album_name”, with the corresponding value being the name of the album, like “The Wall”. Using a JSON format to present the results helps in orderly breakdown of the information surrounding the entity being searched for, and therefore makes it easier to parse and process the data.

And so I sat down, analyzed the JSON that the search query returns for each type of search, and wrote a bunch of functions that do the aforementioned things. The code isn’t exactly pretty but hey, what the hell, it works and I needed something that worked.

Lyrics from Genius API and Sentiment Analysis

A song is primarily composed of two things: music and lyrics. Spotify Web API was a one-stop solution for music by scoring each track on multiple metrics. But what about the lyrics?

I was gearing for another round of dealing with APIs when I saw that Genius offered an API for developers. However, I didn’t have to take that route because I found the simpler way out: the lyricsgenius package. This package did exactly what I needed it to do: given a track name and an artist, scrape the track lyrics from Genius.

Beautiful. Import the package, create a lyricsgenius object, use it to search the song on the Genius API, fetch the lyrics, and store the lyrics. Bam. Easy-peasy. So I write a class that deals with the different operations related to the lyrics, write the methods, and that should do the trick.

def __init__(self, genius_id, *args, **kwargs):
    """Initialize client_id and client_secret"""    super().__init__(*args, **kwargs)
    self.genius_id = genius_id
    self.genius_obj = lyricsgenius.Genius(self.genius_id)
    self.stop_words = set(stopwords.words('english'))def get_song_lyric(self, song_title, artist_title):
    song = self.genius_obj.search_song(song_title, artist_title)
    if song == None:
        return None
    return song.lyrics

Actually, not exactly easy-peasy. There’s more work to be done here.

You see, lyrics from Genius tell what part of the song is the chorus, the number of the upcoming verse is, who is performing the verse or the chorus or the refrain, if the next section is just an instrumental section or a guitar section, and other stuff like that. Now, all this is not part of the actual song lyrics, this is just information that is added by Genius. All this information, for the most part, is usually enclosed within square brackets ([]). And so, everything in square brackets needed to be removed.

The more I looked at different song lyrics, the more apparent the need of standardization became to me. I needed to generate tokens if I wanted to go ahead to the analysis part. Here, the word “token” refers to a single word in the lyrics. So, we can say that a sentence, or a line, is made up of a bunch of tokens. However, the process of standardization was imperative before I even generated the tokens. Standardization will help in establishing a common format for all the tokens, and will therefore make it simpler to deal with any uncommon issues, like random uppercase letters in the middle of a token, without much hassle.

And so, I identified a bunch of things to be done.

Lowercase the lyrics: Case-sensitivity is a no-no. Making everything lowercase will make it easier for us to identify unique words.
Newline characters: Replace the newline characters with a space. The formatting of the song does not matter to the actual words as long as the words can be separated, and in our case, we’re using a single space as the delimiter.
Special characters: Next, start removing some specific special characters, like this one apostrophe (’) and replace it with the more commonly used apostrophe (‘).
Square brackets: Time to the content that comes inside square brackets. Create a regex (regular expression) pattern that looks for open square brackets and a corresponding closed square bracket, and removes the bracket along with the information present within it.
Consecutive whitespaces: If there are more than one line breaks continuously, the words on either side of the line breaks will be separated by more than one whitespace. To correct this, we’ll replace any occurrence of one or more than whitespace with one single whitespace character. Just to be safe, you know?
Stopwords: Now, here comes the important part. Stopwords. These words are most commonly used words that are a part of the language. For example, words like “a”, “an”, “the”, “them”, “you”, “him”, etc. all count as stopwords. And so, we will remove them from our standardized lyrics string and store it as a list. Doing this will result in a list of words from the lyrics that are not a part of the stopwords.

def generate_tokens(self, lyrics):
    # Standardizing the lyrics    lyrics_updated = re.sub(r'\n', ' ', lyrics.lower())
    lyrics_updated = re.sub(r'’', "'", lyrics_updated)
    lyrics_updated = re.sub(r"\[(\w|\s|\:|\,|\'|\"|\.|\;|\(|\)|\?|\&|\{|\}|\-)*\]", "", lyrics_updated)
    lyrics_updated = re.sub(r'\s+', ' ', lyrics_updated)
    lyrics_updated = lyrics_updated.strip()    # Removing punctuation and stopwords
    exclude = set(string.punctuation)
    filtered_sentence = " ".join(i for i in lyrics_updated.split() if not i in self.stop_words)
    final_tokens = [ch for ch in filtered_sentence if ch not in exclude]
    return final_tokens

Now we have a standardized set of words with the stopwords removed. What next?

Well, we still haven’t gotten around to doing what we originally intended to do with the lyrics: using it to quantify negativity or negative emotions. How do we accomplish this?

Simple. NRClex.

NRClex has a massive corpus of words, each of them tagged with one or more than one of the emotional affects that are being measured. There are ten of them in total, with five positive emotions (trust, surprise, positive, anticipation, joy) and five negative emotions (sadness, fear, negative, disgust, anger).

Since only negative emotions are taken under consideration here, we’ll consider only those words that have at least one negative emotion tagged to it. Once that is done, the next step is to calculate the ratio of negative words to the total number of tokens/words. This value will always be between 0 and 1, which will be useful to us later.

def nrclex_negative_lyric_ratio(self, lyric_tokens):
    """Function to sentiment analyze the lyrics based on negative emotions"""    ctr = 0
    for i in lyric_tokens:
        emotion = NRCLex(i)
        emotion_ratings = [i for i in emotion.top_emotions if i[1] > 0.0]
        # Check if the token has sentiment associated with it
        if len(emotion_ratings) > 0:
            emotions_list = ['sadness', 'fear', 'anger', 'negative', 'disgust']
            # Check if at least one of the sentiments are negative
            if len(set(emotions_list).intersection(set(j[0] for j in emotion_ratings))) > 0:
                ctr += 1    #Calculate ratio of negative emotions tokens to total number of tokens
    neg_percent = ctr/len(lyric_tokens)
    return neg_percent

Wikipedia and BeautifulSoup

So, this whole thing started off as measuring an artist’s sadness, right? And these artists have a bunch of albums released under their name. Well, I quickly realized that it would be too monotonous to go to each artist’s Wikipedia page, find the list of their studio albums, and manually put that into the code to fetch the albums’ data. This was too much, I needed to find another way that makes this whole process easier.

BeautifulSoup to the rescue. It is a package that makes it possible to extract data from HTML and XML pages. Instead of going to the Wikipedia page and copy pasting the names of the albums, I wrote some code that takes the Wikipedia link of the artist, fetches their discography, and makes a list out of it. It looks for the element with the ID “Discography” in the webpage and extracts the data under the unordered list, which are the list of albums of the artist. To maintain standardization, the code removes the mention of the year that is present at the end of each unordered list element.

So, I created a class called WikipediaScraper and wrote a method to extract the discography of artists, the links to whose Wikipedia pages are in the file indicated by the variable fl_nm that is used in the constructor of the class.

def __init__(self, fl_nm):
    super().__init__()
    self.list_links = [i.strip() for i in open(fl_nm).readlines()]def discography_extractor(self):
    """Function to scrape and store artist's discography from their Wikipedia pages"""    complete_dict = {}
    artist_ctr, alb_ctr = 0, 0
    for band in self.list_links:
        response = requests.get(url = band)
        soup = BeautifulSoup(response.content, 'html.parser')
        band_name = soup.find(id = "firstHeading")
        title = soup.find(id = 'Discography')
        albums = []
        discog_list = title.find_next('ul')
        for i in discog_list:
            if len(i) > 1:
                nxt = i.find_next('i')
                if nxt.text not in albums and 'url' not in nxt.text and 'Retrieved ' not in nxt.text:
                    year = re.findall(r'\((\d{4})', str(i))
                    albums.append(nxt.text)
        band_nm = band_name.text.replace(" (band)", "")
        band_nm = band_nm.replace(" (singer)", "")
        print(f"{band_nm}: {len(albums)} studio albums found.")
        artist_ctr += 1
        alb_ctr += len(albums)
        complete_dict[band_nm] = albums
    print(f"\n{artist_ctr} artists data inserted. {alb_ctr} studio albums found.")

Measurement of Negativity

Let’s take stock of what we’ve got in our hands. Data from Spotify is here, data from Genius is here, and I now have a way to get the list of albums for a given artist from their Wikipedia page. Now, the next step is to simply find a way to measure the negativity of the artist.

As we saw earlier, the audio feature named valence is our starting point to this. It determines how negative the song sounds. The lower the valence, the more negative it sounds. However, we’ll create a new column called sonic_valence that simply subtracts the value of valence from 1. This way, we’ll have a column where higher the value, the more negative is the song. This is being done to ensure a standard, where the higher the value, the more is the negativity. Now, if we were to use only sonic_valence and find out the most negative song, we’ll get a result. Let’s look at the top 10 most negative songs based on sonic_valence for Pink Floyd.

Pink Floyd’s top 10 negative songs ordered by sonic_valence (most negative at the top)

So as far as the sadness/negativity based on the sound of the tracks go, Quicksilver from More tops the list.

However, lyrics have to be considered as well. Let’s see the top 10 songs based on the ratio of saddest lyrics for the same artist.

Pink Floyd’s top 10 negative songs ordered by negative_words_ratio (most negative at the top)

Well, now that’s interesting. The list here is completely different from the earlier one. The Great Gig in the Sky tops the list by a large margin, while Quicksilver (the most negative sounding song) is nowhere to be seen in this list.

Clearly, there’s a difference. Songs that sound the most negative do not necessarily have the most negative lyrics. So, how can this be dealt with? I could simply take the average of the two (sonic_valence and negative_lyrics_ratio) and find out the most negative song. But is it really the answer we are looking for?

Think of it this way. Let’s consider two four minute songs, the first one talking about something very depressing, like the death of a loved one, while the second one is a song with the word “sad” repeating 10 times, and that is the only lyric. In this case, the negative lyric ratio for the second song will be 1.0, while for the first song, it will be lesser than 1.0. However, there are more lyrics in the first song than the second one, which is not being considered when we calculate just the ratio of negative lyrics to the total lyrics. We need a better way of combining sonic_valence and negative_lyrics_ratio that manages to weigh the lyrics and its importance with respect to the track before combining the negative lyrics aspect to the valence.

As I was researching for ways to do this, I came across something called lyrical density on Myles Harrison’s blog. He created a measure called lyrical density that is simply just the number of song lyrics over the length of the track. That is to say, he measured the number of words per second. As he mentioned, instrumental tracks will have a lyrical density of 0, and other songs could have a greater lyrical density. This is a great way to weigh the importance of lyrics to the song, because it measures how the lyrics are distributed over the track.

Formula to calculate the weighed lyrical negativity

def get_lyrical_density(self, tokens, track_length):
    """Function to calculate number of tokens over song length"""    token_count = len(tokens)
    track_len_seconds = track_length / 1000
    return token_count / track_len_seconds

Now that I have managed to get the final weighed sad lyrics ratio, I can find the weighed_negativity metric by calculating it with the following formula:

Formula to calculate overall negativity

Now that both the music and the lyrics have been accounted for, we can find out the saddest songs, albums, and, with a large enough database, the saddest artist in the database.

Final Result: Pink Floyd’s most negative song

So, with the mathematics and the code all set, it is time to find out the top 10 most negative Pink Floyd songs. And this is what the results are:

Pink Floyd’s top 10 most negative songs (most negative at the top)

With a weighed negativity of 0.5226, we have a clear winner in Speak To Me from one of the band’s most acclaimed albums: The Dark Side of the Moon. To be fair, it shouldn’t have been a surprise because Speak To Me featured in the second spot in the list generated based on sonic_valence, and featured eighth in the list generated based on negative_words_ratio.

Incredibly, the song Side 1, Pt. 1: Things Left Unsaid, which placed at third and second spots in the sonic_valence and negative_words_ratio lists respectively, fell to the ninth spot in the final list. One would assume it would have taken the prize but that was not to be. This is because even though the track had a higher percentage of negative words, the lyrical density of the track was much lower (0.04498) than that of most other tracks. To put it in context, Speak To Me had a lyrical density of 0.398, which is almost 9 times that of Side 1, Pt. 1: Things Left Unsaid. This, I think, is a perfect case of how weighing the ratio of negative words in the lyrics helps in keeping a check on the impact of lyrics and maintains a balance between the importance of lyrics and sound of the track in the final calculation.

Conclusion

There are a couple of things that I would like to point out first before I get into my own thoughts about the project and the code writing process.

Firstly, this analysis is not foolproof at all. The size of the dataset is not big enough, there are only 163 data points here. Of course it would’ve been preferable for a much, much larger dataset but that is too much to expect from any artist, to be fair.

Having said that, this project gives me hope that numbers can be used to describe an art. At the end of the day, music is a form of art that is highly subjective and each piece can mean different things to different people. However, the work I did and the results I obtained from this project has given me reason to believe that there is merit in trying to describe certain aspects of a piece of music through numbers.

Well, now is the time for my own thoughts about this project. This is the first time that I’ve taken on a project of this size. One thing I’ve learnt the hard way is that gathering data is a massively difficult task. None of the codes are foolproof. The lyricsgenius package sometimes returns the wrong results, at which point you have to manually replace the lyrics. That’s not to say that the package itself is faulty. It is really good and has a really high accuracy in terms of fetching the lyrics of a song. But that only makes the incorrect ones harder to find. It had to be done manually, which took a painstakingly long time to finish.

This applies for the Spotify API too. Sometimes, it did not fetch the metrics of a few songs. When these metrics failed to fetch, the default placeholder, called NaN (Not a Number), is used to fill in these gaps. This meant that I had to look for NaN values in the dataset and re-fetch the data of those songs until no NaN values were returned.

Finally, I have to mention how much of an impression the amount of work it took to build the database left on me. Data collection is a task that I’ve often taken for granted because I’ve always used datasets that are publicly available for download. However, the process of gathering and validating data is one that is laborious and strenuous, but then again, what important task isn’t?

And now that I have a functional code, I will be using this to gather more data about different artists and come back in the next article where I discuss the answers to a few questions that have popped up in my head over the last few months.

EDIT: I very recently found out that there’s another blog about the same topic, with a similar approach, where the analysis was done on Radiohead’s work. I am slightly in awe of how well put together their project is, because it extends to mapping out Radiohead’s career using a trendline with interactive visualizations. Check it out, it is definitely worth a read.