“Of emotions, of love, of breakup, of love and hate and death and dying, mama, apple pie, and the whole thing. It covers a lot of territory, country music does”
Quote from Johnny Cash
I have always admired NLP (natural language processing) research, however haven’t done much work with it on my own. I’ve played around with Stanford’s GloVe Word Vectors and have done some work with measuring sentiment in tweets. After seeing the two posts mentioned in the header, I thought that this would be an excellent way to apply something interesting to something I like! I present a mashup of Natural Language Processing and Country Music.
If interested, code necessary to follow along / replicate the analysis is here.
In addition to being one of my first NLP projects, this was also my first time scraping data. I found Cowboy Lyrics, which had all the lyrics I could want. I didn’t gather all the lyrics on the site, instead I iterated through all the artists on their homepage in the “Last Site Updates” section. For each artist I iterated through all their albums, getting lyrics from all the songs in each album.
To do this I used BeautifulSoup, which is an excellent HTML Parser that works in python. I used the get() function to pull in the raw HTML, but found that the site wasn’t laid out in a way that made it easy to extract just the lyrics. After figuring out what tags I needed to look inside, I then needed to apply some regular expressions to remove interior tags and unnecessary whitespace characters like “\n” and any extra garbage that might not actually be country music lyrics.
At the end of the day, I was left with about ten thousand songs to look at.
As I performed the analysis, I learned more and more things about my data that I didn’t notice at first. Notably, some of the “lyrics” for songs was just the string “(INSTRUMENTAL)”. Clearly these had to go. Also, there were plenty of instances of the word “Chorus”, referring to the actual chorus of the song. I couldn’t come up with a clever way to handle this at the time, but I simply removed instances of this word. Being a country music listener, this isn’t an uncommon word in country songs (singers might reference the chorus, as in a group of singers) but it also isn’t too common to impact the analysis.
After some data scrubbing I used the NLTK package to tokenize words. I then did a simple frequency check to identify the most common words in our new corpus.
countryWords = tokenizerPM.tokenize_words(" ".join(str(song).decode("ascii","ignore") for song in processedLyrics)) countryFreqDist = FreqDist(countryWords) countryFreqDist.most_common(10)
Gives the output
[(u'the', 82204), (u',', 80243), (u'i', 74904), (u'you', 64421), (u'and', 57585), (u'a', 55476), (u'to', 46219), (u'me', 31988), (u'in', 30962), (u'my', 28171)]
Nothing surprising here! These are all just common words from the English language and a comma. To mitigate this, I used the “Stop Words” data from NLTK to scrub out some of these short words that really only act as the operators of the English language. Hopefully this will leave me with more informative findings. NOTE: I cannot seem to get WordPress’s code output to show tabs. Running the code below will get an error.
for stopword in STOP_WORDS: if stopword in countryFreqDist: del countryFreqDist[stopword] for punct in tokenizerPM.CHARACTERS_TO_SPLIT: if punct in countryFreqDist: del countryFreqDist[punct]</pre> countryFreqDist.most_common(10)
Gives the output
[(u'love', 14283), (u'got', 9901), (u'oh', 8341), (u'one', 8081), (u'time', 7676), (u'go', 7608), (u'back', 7340), (u'never', 7048), (u'yeah', 6982), (u'baby', 6968)]
That’s more like it! Those are some country words. Immediately I can start rattling off songs: “Love in the First Degree“, “Give it All We Got Tonight”, “I Go Back”, “Yeah”, “When She Says Baby”, and the list goes on. With the word-frequency object handy, I wanted to plot a word-cloud, just like Iain in the Heavy Metal post. This is a diagram that nests words together, sized by frequency.
But the finding that “love” is the most frequent word in country music (the most frequent in my sample, which is not exhaustive) is only so significant. Love is a fairly common word to use in general. What if we wanted to know what words in country music are the most country? In other words, what words are used in country music at a higher rate than in English?
To answer this, I score each word, by looking at the log of its occurrence count in the Country corpus divided by the count of its occurrence in the Brown corpus.
The Brown corpus is a collection of roughly one million words that covers all sorts of english text. The thought is that a word will have a higher level of “country-ness”, if it occurs with a higher frequency in the Country corpus, , than in the Brown corpus, .
A word cloud of the most country words looks a little different than the cloud for the most frequent. It looks country indeed!
See the two tables below for the least country words, and then the most country words. Please note that I included a filter here. Only words that occurred more than 4 times will show up in this list.
I can’t say that there is much surprising about this list. Multi-syllable words aren’t really big in country music. They aren’t as versatile as words like “gonna”, “ain’t” and “goin'”. With more syllables, a word seems more mechanical and structured, and those types of words don’t fit in the emotion-filled world of country music.
Looking at the country-ness of words in each song, we can determine that Brad Paisley’s The Cigar Song scores as one of the least country songs. I have noticed that Brad Paisley seems to draw from a different vocabulary than most other artists. This song in particular uses “detective”, “insurance”, “policy”, “separate”, which all fall towards the bottom end of the country-ness scale. Love’s Gonna Make It Alright by George Strait is ranked as one of the “most” country songs.
In addition, I took a look at what parts of speech are used in the Country corpus compared to the Brown corpus. I found that
Lets take a look at three artists now, Kenny Chesney, Dwight Yoakam, and Alabama. We can see their word clouds below.
They’re all very similar. That makes sense! While we can expect each artist to use different words at different rates, they all speak English. If we want to see what words are most unique to each artist, we can use the log likelihood ratio. We’ll start by creating a corpus that combines these three artists to set expectations of how often certain words occur in general. We can then compare these expectations with how often certain words occur for each artist. Using the formula below, we can see that words that are specific, unique, or deviating from the combined corpus will get higher values than words that occur at similar rates among the three artists as a whole (like “love”).
Specifically, for each word we can compare the frequency of that word within the artist, with the expectation of the frequency of that word derived from the corpus of the three artists, . We can scale this with the frequency of the word, and add the same term that corresponds with the anti-word, or all other words, represented by . When plotting a word cloud based on this number for each word, we get the following images.
These images now reflect what makes each artist unique when compared to the three of them as a whole. Seems like Alabama had more Christmas songs compared to Kenny and Dwight.
The log likelihood ratio is one way of determining unique and important words, but another simple way is to look at a metric called TF-IDF. Term frequency – inverse document frequency is exactly what it sounds like. We take the frequency a word appears in one “document”, or song, and compare that to how frequently it occurs everywhere else. Words with a higher TF-IDF occur more in a given song than they do in general, indicating it might be more important. Using this handy function we can combine various TF-IDF values into a vector, a numeric representation of a song, which we can combine to get numeric representations of artists. After all, an artist is defined by the music he or she creates! We can do some interesting things once we have numeric representations of artists and songs.
First, we can look at an artist and determine, by using cosine similarity, what songs most represent a band, what other artists are most similar to a band, and what words define/are most important to a band. See the table below for some artists. In this case, “nearby” vectors can be considered more similar. Based on the data I had, here are some findings about artists and songs.
|Band||Nearby Bands||Nearby Songs||Important words|
|Cole Swindell||Cole Swindell, Luke Bryan, Blake Shelton, Toby Keith||The Back Roads & The Back Row, Let Me See Ya Girl, Brought To You By Beer||wanna, beer, girl, gonna, country|
|Alan Jackson||Alan Jackson, George Strait, Dolly Parton, Reba McEntire||The Christmas Guest, Who Says You Can’t Have It All, I’d Love You All Over Again||i’ve, i’ll, jesus, can’t, i’d|
|Canaan Smith||Canaan Smith, Florida Georgia Line, Jason Aldean, Toby Keith||One Of Those, Bronco, Love You Like That||hell ride, wanna love, american, love hurts, mad|
|Garth Brooks||Garth Brooks, George Strait, Dolly Parton, Reba McEntire||That Girl Is A Cowboy, That Ol’ Wind, The Christmas Guest||she’s, ’cause, he’s, i’d, ain’t|
|Eric Church||Eric Church, Toby Keith, Tim McGraw, Trace Adkins||A Lot Of Boot Left To Fill, Cold One, Hungover & Hard Up||an’, ain’t, damn, hell, i’d|
Using the same vectors, we can create a hierarchical clustering diagram, where the most similar bands are next to each other, and connect to less similar bands as you look higher on the chart. Looking at the chart, we could infer that Dolly Parton and Reba McEntire are very similar, while Bryson Wheeler and Kelsea Ballerini are very different!
Keep in mind, these conclusions are based only on the data I was able to pull. I encourage someone to follow up with a more complete dataset! We may see more interesting trends develop.