Does Country Music Drink More Than Other Genres? Investigating 5 years of lyrics to find out

Mark MacArdle
Towards Data Science
9 min readOct 11, 2018

--

From Adam Wilson on Unsplash

Thanks to Spotify’s Hot Country and Country Nights playlists, I recently got into an unusual phase of listening to country music. Rock is normally more my genre so this was a first for me. After my immediate realisation of how catchy country can be, something that really struck me was that there seemed to be so many references to alcohol and drinking!

Take these inspired lines from Chris Stapleton below. Do you think he knows that non-alcoholic things can be used in comparisons too?

You’re as smooth as Tennessee whiskey
You’re as sweet as strawberry wine
You’re as warm as a glass of brandy

Tennessee Whiskey — Chris Stapleton

Or how about these lines from Brothers Osborne. Is it really not their fault?

Blame the whiskey on the beer
Blame the beer on the whiskey
Blame the mornin’ on the night

It Ain’t My Fault — Brothers Osborne

When I think of country music, beer and whiskey are part of its image in my head, but not any more so than for rock or rap/hip-hop music. I wondered if I was hearing more mentions of alcohol just because I was new to country or if there really was a measurable difference in mentions between genres.

I thought that all I would need to do to find out is:

  • Get a lists of the popular songs in different music genres from the Billboard website which has genre specific charts like for rock, pop, country etc.
  • Get their lyrics
  • Get a list of alcohol and drinking related words
  • Count how many of the songs have mentions of those words

This process may have taken me a little longer than those four short bullet points would indicate, but below is a walk through of the process, analysis and findings for you to (hopefully) enjoy!

Data Gathering and Cleaning

Scraping Five Years of Charts

The first step was to create a dataset of songs for each genre. I decided to use the year end charts on the Billboard website as they’re normally 100 entries long so will give a lot of data and I can be sure the represent what people really listen to in those genres. These are American charts but, as America tends to set the trend worldwide (whether we like it or not) and has the biggest country music audience, I feel they’re the best data source for this project.

The Billboard website has charts available for Rock, Country, Pop, R&B/Hip-Hop, Dance/Electronic, Christian and I also decided to include the non-genre specific overall Hot 100 chart for reference. I didn’t include the Latin or International charts as the non-English songs would skew later analysis results.

The links above are to the 2017 year end charts but they all go back to at least 2013. I decided to grab all the past five years so I could investigate trends over time.

I’m using mainly the “Hot” charts for each genre, except Pop which doesn’t have a “Hot” chart.The Hot charts factor in radio plays, physical sales and streams while the Pop chart goes on radio play only. In all cases though they should give a good representation of what people are hearing and listening to in each genre.

Some songs may appear in more than one chart or year. I’m not removing any duplicates as I don’t want to change the content of any charts.

I used Python Beautiful Soup library to help extract the song and artist name for each chart entry from each web page's html. This is the same html you see when you press Ctrl+Shift+i on a web page in Chrome.

Example of song information in html from Billboard’s website

I found some issues when doing this like the 2015 R&B/Hip-Hop chart only has 25 entries where the other years normally have 100. Or that the 2016 Hot 100 only has 99 entries because #87 is missing. No idea on the causes behind these issues but I did factor that chart lengths may all be different when doing my analysis later on.

Getting Lyrics for 2,840 Songs

The chart scraping found 3,019 chart entries and I managed to get the lyrics for 2,840 of these using Genius.com’s API. You just need to register then it’s free to use. In my code I used the LyricsGenius Python package which made Genius.com’s API very easy to work with.

The issues I had at this stage were in matching the song and artist names Billboard used with those used by Genius. For instance there was a lot of issues when a song was by multiple artists. Billboard had a lot of ways of combining artist names like “Featuring”, “x” (like Kygo x Selena Gomez), “With” etc. but Genius was far more picky so I had to try different combinations.

Eventually the time taken to try find these issues was no longer worth it. 2,840 is 94% of the total entries so I decided to move on at that point. The amount of songs with lyrics found per chart are below.

Rock: 483
Country: 490
Dance/Electronic: 442
Pop: 240
Hot 100: 476
R&B/Hip-Hop: 379
Christian: 322

The fact that different amounts were found doesn’t affect later results as the analysis compares percentages.

Lyrics Cleaning

To avoid multiple tenses, plurals or variations of a word causing missed or incorrect counts I used leammatisation to group words to their root form. For example the verbs “walked”, “walks” and “walking” would all be grouped to “walk”.

For this to work all the words needed their part-of-speech tagged. These tags could be verb, adjective, adverb or noun/other. Normally you pass sentences into the part-of-speech tagger but in this case, due to the lack of punctuation in songs, I split the lyrics into lines and passed them.

For both these tasks I used the Python NLTK library and it was very successful. Up to 6 words were grouped into a common source word. For example “go,” “going,” “gone,” “goes,” “gon” and “went” were all grouped to just “go”.

Analysis

Creating a Drinking and Alcohol Words List

I looked for a pre-existing list of alcohol related keywords but unfortunately didn’t find any. Some studies I found had used keyword lists but didn’t share them. So I made my own using the highly scientific method of thinking up all the keywords I could think of and googling synonyms to try find more.

I excluded “drink” and “shot” from this list as they’re not specific enough to drinking alcohol. I first tried with them but they caused some high error rates. It was particularly bad for Christian songs were out of 11 songs identified 8 turned out to be false positives due to these two words.

The final list I came up was:

drunk*, drank*, alcohol, alcoholic, hangover, hungover, liquor, cocktail, booze, boozy, bottle, beer, cider, ale, tequila, vodka, wine, gin, whiskey, scotch, rum, bourbon, champagne, mojito, martini, daiquiri, jager, jagermeister, budweiser, miller, coors, heineken, bacardi, smirnoff, moet, hennessy, bar, pint, firewater, hootch, moonshine, spirits, swig, tipple

“Drunk” and “drank” won’t count any occurrences for the past tense verb of drink as those “drunk” and “drank”s would be lemmatised to “drink”. They will count for any other use though, e.g. as a adjective in “I’m so drunk” or as a noun in “I could bring the drank”.

Measuring Drinking and Alcohol Mentions

The measure I’m using is percentage of songs that mention a drinking or alcohol related word at least once.

Without further ado the percentage of songs mentioning drinking by Billboard charts are:

Wow! Country music songs do seems to mention alcohol far more often. 40% of songs referencing alcohol in some way did seem high to me so I manually checked the 2017 country songs and found only one false positive (caused by “bottle” in Yours by Russell Dickerson) which I feel is an acceptable error rate. The high result makes more sense if you keep in mind that the year end charts were used for this analysis. So this doesn’t necessarily say that 40% of all country music songs mention drinking, just that 40% of the big hits from the last five years do.

Hypothesis Testing if Difference is Significant

Time to get the answer to my original question: do more country music songs reference alcohol and drinking more than other genres.

As shown in the below chart there is clearly a measured difference between country and the other genres. However if you measure two different groups you would expect the result to be little different just because of random variations. I want to confirm that the difference is large enough to be statistically significant, which is another way of saying unlikely to be caused by random variation.

I’m dropping the Hot 100 and Christian chart results for this test as the Hot 100 isn’t genre specific and the Christian chart is an outlier in how little it mentions alcohol and at least to me, is not a mainstream genre.

I’m going to test the statistical significance with a Chi-Squared test for independence as the data is categorical data. It’s categorical data because the songs either do or don’t reference alcohol, there’s no in-between values. This is coincidentally the same test I used in my previous post on measuring loss aversion in penalties.

The test outputs a confidence indicator called a “p-value” and if that is below my chosen significance level then it can be said the measures are different and the measured difference isn’t caused by random variation. I’m choosing a significance level of 0.05 which means only a 5% chance the conclusion is wrong. The null hypothesis is that there’s no difference between the amount of songs mentioning alcohol in country music to the other genres.

The result was:

P-value = 2.71698301e-34

Conclusion: Difference is Significant

The e-34 means there’s 33 zeros after the decimal point before the 271… even starts. That’s a tiny result for the p-value, way below the 0.05 requirement! The null hypothesis can be rejected and it can be concluded that country music songs are more likely to mention alcohol than other genres.

Looking at the percentage of songs mentioning alcohol by year further drives home this difference as it can be seen that country music is ahead every single year.

Fun Facts

What are the different genres drinking?

Drink types like scotch and bourbon with very low mentions were excluded from this chart.

It was Psy and Snoop Dogg’s hit Hangover with an incredible 154 mentions. Nearly all of those mentions are from “hangover” being repeated over and over in the chorus.

Which country song had the most drinking keyword mentions?

It was a three way tie with 14 mentions each for Dierks Bentley’s Drunk On A Plane, Brett Eldredge’s Drunk On Your Love and Chris Stapleton’s Tennessee Whiskey. Tennessee Whiskey also managed the feat of appearing the 2015, 2016 and 2017 charts.

Is there a trend in alcohol mentions?

Yes, alcohol mentions have risen 5 percentage points over the past 5 years.

The data sets of the charts and lyrics are available as csv files, along with the python workbooks that generated them, on my Github here. The bar charts were made in a Tableau workbooks that can be downloaded here.

--

--