This is a follow up to a previous post about graphing X Files episodes ratings. In this post, I will generalize the method and clean up the code to subset the IMDB ratings data to almost any show in the IMDB ratings file. The ratings file can be found at IMDB Interfaces.
Opening the ratings file in a text editor we can see it looks like this:
This has a mix of tab and space deliminated and will have to be cleaned up before the data frame is created.
Unfortunately with this implementation, the show title must be a very close match. Cases are ignored but a show like Star Trek: The Next Generation will not be found if the ‘:’ is missing from Trek. Although there is an error catch, searching should be improved in a future version.
Checking the data
One of the first values to be checked is that it did not grab any extra shows:
Hot Off the Wire
The Wire: The Chronicles
From this, can easily extract our data:
We should also check that all seasons are present. Now this is more tricky with regards to the given data set. Since we do not know how many seasons there are, we will assume that the greatest value, is the total number of seasons. This simplifies it, but again if last season is missing, we will not know using this data set and other sources would need to be used to verify (check IMDB or the shows Wikipedia entry).
The ‘%in%’ statement compares the values of the first vector to all the values in the second vector for matches. [1,2,3] %in% [2,5] would return [FALSE, TRUE, FALSE], using this we can check the seasons.
Since the sum of not found seasons is 0, we can assume we have all the seasons present in the data set.
Again, we can use this same idea to check the episode numbering. As before, we will make the assumption that the total number of episodes in a season is the same as the maximum episode number for the respective season. To double check, we would need to use another source of data (IMDB, Wikipedia, etc.).
For this the plyr package will be used to make it much simpler.
From this and the assumptions made, we can say we have all the seasons and episodes for the show.
Lets generate the graphs like the X Files post, one thing to note is that to use the colour brewer, the colour variable should be a factor.
Hopefully, this is a much clearer version than the previous post and maybe learned something about R.