During season 3 episode 2 titled War Is The H-Word, a database of Bender’s top 10 most frequently used words are shown. Some words like ‘pimpmobile’ I would think are not uttered a lot by Bender, rather was used for the setup for the episode. This got me thinking, what are the most frequently used words by Bender on air.
From the episode the top 10 words are:
Episode Transcripts Data
The data will be pulled from the Infosphere website. Unforunately, this does not have all seasons but will make due with what is available. To get the HTML data, the package httr will be used. SelectorGadget can still be used to gather the proper XPath to be used.
First we are going to load the libraries, set up base site address, and create the certification file.
Episode Transcript Links
Each episode is listed by season in a table with a link to the respective episode transcript. Using SelectorGadget, the XPath for the title is: //*[contains(concat( “ “, @class, “ “ ), concat( “ “, “oLeft”, “ “ ))]//a. Using this will give us the href and title, we are only concerned with the href part. By adding /@href to the XPath, only the links will be returned. This will be converted to a character vector and the
Now there is the issue of the direct to DVD movies. These movies were shown in 4 parts each on Comedy Central, so if all transcripts are included we will be double counting these. The four movies: Bender’s Big Score, The Beast with a Billion Backs, Bender’s Game, and Into the Wild Green Yonder will be removed, leaving the four parts of each.
This is not the best way to do this but it will work for our purposes. A better way would be to have scraped table information first, removed the Direct-To-DVD film table and then got the links from the remaining season tables.
Next step will be to parse each episode for Bender’s dialog.
Getting the transcripts for each episode is a bit more involved. Although a style is introduced, getting the actual text for Bender is a little bit more work. From using SelectorGadget on the site, all dialog can be obtained with the “//p” XPath.
We will need to clean up the text. The way I will process this, is to:
Read in the transcripts
Filter only for Bender (“Bender: “ as per Rules/Styles)
Create a substring from after “Bender:” to the end of the string
Remove anything between square brackets as they are a description
Remove extra white space
Lower case all letters.
Remove common words like: “to”,”too”, “i”, “its”,”ill”, “im”, “a”, “an”, “and”, “at”, “are”,”on”, “be”, “the”, “that”, “for”. The function stopWords in tm package can be used but would remove
Putting this all together for each episode:
Bender’s Top 10 Most Frequented Words
This can be cleaned up even further but playing with the removed words if needed. As well, this relies on the data on Infosphere being correct and complete.