Sometime ago, I wanted to analyse the lyrics in the Billboard Hot 100 over multiple years but never really knew what I wanted to actually look at. Recently, I scraped lyric data from a few websites to compile a complete set of lyrics from 1951 to 2015 and will look at average unique words through the years, although early years had ~30 songs. This has also let me play around with a few text mining packages, tm and SnowballC, although the later didn’t really use for this.
Most of the work for this was scraping and formatting the lyrics into a consistent format but that will not be shown here. The Billboard Hot 100 artist and song name was gathered from Wikipedia for years 1951-2015 and lyrics from various sources.
A Look at the Data
First, import the libraries to be used:
What we want to do is read each lyric and store it as a PlainTextDocument object to be used with tm. From that, the data can be cleaned and extracted for analysis.
First step is to set everything to lowercase:
One issue that may be encountered is common words such as “you”, “a”, “the”, and such might want to be removed. A list of common words in english can be found using stopwords(“en”). Though, this can be done like before:
Now, remove all the remove punctuations.
Removing all common words may not be what we want since words like “you” and “me” can define the song. By removing punctuation, the removed words should also have punctuation removed as well. With this, some common words can be removed:
Next part is to strip extra white space:
Now to create a Document Term Matrix which will be used for analysis.
This now provides us with a basis to examin the lyrics with.
It is VERY IMPORTANT to note that the minimum word length is set to 3 by default. To change it, the call would be DocumentTermMatrix(doc.corp, control=list(wordLengths=c(1,Inf)) ), where 1 can be replaced with any minimum length.
There is much more that could of been done, such as stemming using the SnowballC package to remove affixes like “ing”, “ies”, and such from words to get at a root word. As well, could try to remove all common words provided by the stopwords(“en”) function.
Using the Document Term Matrix, each row of the matrix:
The issue with this, is that certain songs may repeat the chorus multiple times but adding no more unique words.
Total Unique Words
Lets take a look at which songs had over 300 unique words:
|Artist||Title||year||Total Words||Total Unique Words|
|Bad Meets Evil||Lighters||2011||563||317|
|Bill Hayes||The Ballad of Davy Crockett||1955||557||314|
|Dem Franchize Boyz||I Think They Like Me||2006||716||325|
|Eminem||Sing for the Moment||2003||588||316|
|Eminem||The Real Slim Shady||2000||697||308|
|Ludacris||Pimpin All Over the World||2005||651||316|
|Luniz||I Got 5 on It||1995||476||306|
|Master P||Make Em Say Uhh||1998||656||324|
|Missy Elliott||Gossip Folks||2003||516||321|
|Mo Thugs||Ghetto Cowboy||1999||542||312|
|Puff Daddy||Been Around the World||1998||798||377|
|The Game||How We Do||2005||495||313|
|The Notorious BIG||One More Chance||1995||553||304|
It seems that rap music takes up the spots.
Top 25 Most Frequent Words
To take a look at the top 25 most frequented words:
Since some frequent terms like you, your, youre, when, with, and other common terms are high up on the list, they should be removed at a later point and could be re-analysed. As well, minimum word length should be set to 1, then can do comparisons like ‘he’ vs ‘she’. At a later point I may look at this.
Update (2016-03-08) - Using stopwords(“en”) and Word Length 1 to Infinity
I wanted to give an update using stopwords(“en”) in full on this data set and word lengths from 1 to inifinity. Since the process is the same as above, I have not given much of