Introduction

Sometime ago, I wanted to analyse the lyrics in the Billboard Hot 100 over multiple years but never really knew what I wanted to actually look at. Recently, I scraped lyric data from a few websites to compile a complete set of lyrics from 1951 to 2015 and will look at average unique words through the years, although early years had ~30 songs. This has also let me play around with a few text mining packages, tm and SnowballC, although the later didn’t really use for this.

Most of the work for this was scraping and formatting the lyrics into a consistent format but that will not be shown here. The Billboard Hot 100 artist and song name was gathered from Wikipedia for years 1951-2015 and lyrics from various sources.

A Look at the Data

First, import the libraries to be used:

library(tm)
library(knitr)
library(ggplot2)
library(wordcloud)

Setup

What we want to do is read each lyric and store it as a PlainTextDocument object to be used with tm. From that, the data can be cleaned and extracted for analysis.

# Data was part of a data frame before hand and was read in already
doc.vec <- VectorSource(top100$lyrics)
doc.corp <- Corpus(doc.vec)

First step is to set everything to lowercase:

doc.corp <- tm_map(doc.corp, content_transformer(tolower))

One issue that may be encountered is common words such as “you”, “a”, “the”, and such might want to be removed. A list of common words in english can be found using stopwords(“en”). Though, this can be done like before:

#doc.corp <- tm_map(doc.corp, removeWords, stopwords("en"))

Now, remove all the remove punctuations.

doc.corp <- tm_map(doc.corp, removePunctuation)

Removing all common words may not be what we want since words like “you” and “me” can define the song. By removing punctuation, the removed words should also have punctuation removed as well. With this, some common words can be removed:

doc.corp <- tm_map(doc.corp, removeWords, c("to","too", "i", "its","ill", "im", "a", "an", "and", "at", "are","on", "be", "the", "that", "for"))

Next part is to strip extra white space:

doc.corp <- tm_map(doc.corp, stripWhitespace)

Now to create a Document Term Matrix which will be used for analysis.

dtm <- DocumentTermMatrix(doc.corp)

This now provides us with a basis to examin the lyrics with.

It is VERY IMPORTANT to note that the minimum word length is set to 3 by default. To change it, the call would be DocumentTermMatrix(doc.corp, control=list(wordLengths=c(1,Inf)) ), where 1 can be replaced with any minimum length.

There is much more that could of been done, such as stemming using the SnowballC package to remove affixes like “ing”, “ies”, and such from words to get at a root word. As well, could try to remove all common words provided by the stopwords(“en”) function.

Total Words

Using the Document Term Matrix, each row of the matrix:

top100$totWords <- rowSums(as.matrix(dtm))

ggplot(top100, aes(x=year, y=totWords)) + geom_boxplot() + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

The issue with this, is that certain songs may repeat the chorus multiple times but adding no more unique words.

Total Unique Words

dtm.m <- as.matrix(dtm)
dtm.m[dtm.m > 0] <- 1
top100$totUWords <- rowSums(dtm.m)

ggplot(top100, aes(x=year, y=totUWords)) + geom_boxplot() + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

Lets take a look at which songs had over 300 unique words:

kable(top100[top100$totUWords > 300, c(1,2,5,10,11)], align='c', row.names=F)

Artist	Title	year	Total Words	Total Unique Words
Bad Meets Evil	Lighters	2011	563	317
Bill Hayes	The Ballad of Davy Crockett	1955	557	314
Dem Franchize Boyz	I Think They Like Me	2006	716	325
Drake	Forever	2009	727	355
Drake	Forever	2010	727	355
Eminem	Sing for the Moment	2003	588	316
Eminem	The Real Slim Shady	2000	697	308
Ludacris	Pimpin All Over the World	2005	651	316
Luniz	I Got 5 on It	1995	476	306
Master P	Make Em Say Uhh	1998	656	324
Missy Elliott	Gossip Folks	2003	516	321
Mo Thugs	Ghetto Cowboy	1999	542	312
Puff Daddy	Been Around the World	1998	798	377
Puff Daddy	Victory	1998	613	365
The Game	How We Do	2005	495	313
The Notorious BIG	One More Chance	1995	553	304

It seems that rap music takes up the spots.

Top 25 Most Frequent Words

To take a look at the top 25 most frequented words:

kable(as.data.frame(tail(sort(colSums(as.matrix(dtm))), n=25)), align='c', col.names=c("Frequency"))

Word	Frequency
out	6290
she	6491
one	6521
down	6883
want	7094
youre	7837
yeah	8323
this	8567
now	8635
what	9273
get	9332
can	9337
got	9435
when	9788
but	10362
with	10601
just	11150
baby	11856
like	12258
know	13019
all	13946
dont	14010
your	17967
love	19155
you	77734

Since some frequent terms like you, your, youre, when, with, and other common terms are high up on the list, they should be removed at a later point and could be re-analysed. As well, minimum word length should be set to 1, then can do comparisons like ‘he’ vs ‘she’. At a later point I may look at this.

Update (2016-03-08) - Using stopwords(“en”) and Word Length 1 to Infinity

I wanted to give an update using stopwords(“en”) in full on this data set and word lengths from 1 to inifinity. Since the process is the same as above, I have not given much of

doc.vec <- VectorSource(top100$lyrics)
doc.corp <- Corpus(doc.vec)
# transform to lower then remove using stopwords
doc.corp <- tm_map(doc.corp, content_transformer(tolower))
# Remove all stop words
doc.corp <- tm_map(doc.corp, removeWords, stopwords("en"))
# Remove punctuation
doc.corp <- tm_map(doc.corp, removePunctuation)
# Remove extra white space
doc.corp <- tm_map(doc.corp, stripWhitespace)

# turn into document matrix
dtm.update <- DocumentTermMatrix(doc.corp, control=list(wordLengths=c(1,Inf)))

# Create with total words
dtm.update <- as.matrix(dtm.update)
top100$totWords.update <- rowSums(dtm.update)

ggplot(top100, aes(x=year, y=totWords.update)) + geom_boxplot() + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

# Top 25 words
kable(as.data.frame(tail(sort(colSums(as.matrix(dtm.update))), n=25)), align='c', col.names=c("Frequency"))

Word	Frequency
gonna	5358
way	5445
never	5670
let	5859
girl	5901
say	5916
cause	6012
see	6087
make	6155
time	6186
come	6189
one	6545
want	7115
go	7730
yeah	8332
now	8636
can	9337
get	9341
got	9439
just	11165
baby	11863
like	12263
oh	12996
know	13030
love	19163

# Create word cloud
wordcloud(names(dtm.update[1,]), colSums(as.matrix(dtm.update)), min.freq=4000)

# Create total unique words
dtm.U <- as.matrix(dtm.update)
dtm.U[dtm.U > 0] <- 1
top100$totUWords.update <- rowSums(dtm.U)

ggplot(top100, aes(x=year, y=totUWords.update)) + geom_boxplot() + theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

# Top 25 words
kable(as.data.frame(tail(sort(colSums(as.matrix(dtm.U))), n=25)), align='c', col.names=c("Frequency"))

Word	Frequency
right	1784
take	1814
let	1824
yeah	1881
come	1908
want	1932
way	1977
never	2038
make	2043
say	2160
one	2208
baby	2229
cause	2233
go	2270
get	2280
time	2348
see	2362
oh	2503
got	2514
now	2708
can	2769
like	2910
just	3335
love	3449
know	3498

# Create word cloud
wordcloud(names(dtm.U[1,]), colSums(as.matrix(dtm.U)), min.freq=1500)