The Truth Is In There – an X-Files episode analysis with R (Part 4)

mulder_scully_samanthas_diary_closure

Part 4: Character occurrence

Section Chief Blevins: Agent Mulder. What are his thoughts?

Scully: Agent Mulder believes we are not alone.

This is part 4 of a natural language processing analysis of X-Files episode summaries. Parts 1 and 2 dealt with obtaining and cleaning the data, part 3 showed an analysis of prominent words by episode types. In this post, we will look at characters.

As in every TV show, The X-Files features recurring characters – some taking leading roles, others as supporting characters. There are good guys, bad guys, and ambivalent ones. Let’s look at which of these characters gets mentioned most.

# get DTM  of all terms including cast (strip the metadata)
termvector <- colnames(xfmerged[ , -(1:10)])

# create vector of character names
# make sure all character names are in the list of terms
xfcharacters <- intersect(xfcharterms, termvector)

# get rid off non-names using the remcast list
xfcharacters <- setdiff(xfcharacters, remcast)

# reduce the DTM to character terms only
xfcharacters <- select_(xfmerged, .dots = xfcharacters)

# set counts to Boolean
xfcharacters[ , -1][xfcharacters[ , -1] > 1] <- 1
xfcharacters[ , -1][xfcharacters[ , -1] == 0] <- 0

# construct data frame of terms and Boolean counts
xfcharacters <- data.frame(ct = colSums(xfcharacters), name = names(xfcharacters))

# sort descending by count
xfcharacters <- arrange(xfcharacters, desc(ct))

head(xfcharacters, 20)
##     ct       name
## 1  199     sculli
## 2  179     mulder
## 3   73    skinner
## 4   40    doggett
## 5   24     krycek
## 6   23        rey
## 7   20      elder
## 8   20      kersh
## 9   20   samantha
## 10  11    spender
## 11  10 covarrubia
## 12  10     frohik
## 13  10     marita
## 14  10        mrs
## 15   8       byer
## 16   8       lang
## 17   8     melvin
## 18   7     albert
## 19   7     fowley
## 20   7      teena

Looks plausible, with our lead actors in the very top. We will now take a top-down approach and construct a very limited list of main characters and see how they appear over the various seasons. Here’s the list:

xfnames <- c(
        # X-Files agents
        "mulder",
        "sculli", # written this way for stemming reasons
        "doggett",
        "rey", # agent reyes, stemmed
        
        # the bad guys
        "smoke", # the cigarette smoking man
        "krycek",
        "elder",
        "rohrer",
        
        # informants
        "covarrubia",
        "throat", # deep throat
        "kritschgau",
        # "X", deep throat's successor, can't be distinguished in our texts
        
        # the FBI
        "skinner",
        "fowley",
        "kersh",
        "spender",
        
        # the Lone Gunmen
        "byer",
        "frohik",
        "lang",
        "gunmen"
)

We create a DTM from just these names, reconstruct their season number from the production code and summarize the occurence of each character by season

# get DTM 
xftimeline <- select_(xfmerged, .dots = c("ProdCode", xfnames))

# construct summary by season
xftimeline <- xftimeline %>% 
        mutate(Season = as.numeric(substr(ProdCode,1,1))) %>%  # get season number
        select(-ProdCode) %>% # Production code no longer needed
        group_by(Season) %>%  # group and summarise by season
        summarise_each(funs(sum)) %>% 
        t() %>%  # rotate the data frame (cols to rows)
        as.data.frame() # matrix to df

xftimeline <- xftimeline[-1 ,] # get rid of season numbers
colnames(xftimeline) <- 1:ncol(xftimeline) # add season numbers as col names

# get row character names as own column
xftimeline$term <- rownames(xftimeline)

str(xftimeline)
## 'data.frame':    19 obs. of  10 variables:
##  $ 1   : num  270 191 0 0 6 0 4 0 0 25 ...
##  $ 2   : num  318 200 0 0 19 24 6 0 0 4 ...
##  $ 3   : num  307 205 0 0 30 17 2 0 0 5 ...
##  $ 4   : num  275 179 0 0 41 11 5 0 9 3 ...
##  $ 5   : num  324 219 0 0 31 12 6 0 2 2 ...
##  $ 6   : num  330 239 0 0 21 9 6 0 2 1 ...
##  $ 7   : num  235 179 0 0 35 7 0 0 1 5 ...
##  $ 8   : num  122 208 264 30 0 25 1 18 0 2 ...
##  $ 9   : num  53 130 187 138 5 1 0 30 1 3 ...
##  $ term: chr  "mulder" "sculli" "doggett" "rey" ...

Since we’ll use ggplot for creating graphs, we will use a long format for our data:

# create long format 
library(tidyr)
xftimeline <- gather(xftimeline, "season", "count", 1:9)

str(xftimeline)
## 'data.frame':    171 obs. of  3 variables:
##  $ term  : chr  "mulder" "sculli" "doggett" "rey" ...
##  $ season: Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ count : num  270 191 0 0 6 0 4 0 0 25 ...

To plot multiple graphs, we’ll use the following function:

Plot_Timeline <- function(df, titletext) {
        # function to plot a line graph of character
        # appearance by season
        # 
        # Args:
        #  df: data frame with cols season, term (name) and count (nr of appearances)
        #
        # Return:
        #  ggplot2 object (line graph)
        
        library(ggthemes)
        
        g <- ggplot(data = df, aes(x = season, y = count, group = term)) +
                geom_line(aes(color = term)) +
                ggtitle(titletext) +
                theme_few() +
                scale_colour_few()

        g
}

Here are the plots – warning: Contains plot spoilers!

Plot_Timeline(filter(xftimeline, term %in% c("mulder", "sculli", "doggett", "rey")), "X-Files Agents")
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

unnamed-chunk-36-1

While Agent Scully remains relatively constant over the various seasons (despite her character’s absence in some seaons), we clearly see Agent Mulder’s replacement by Agent Doggett in the last few seasons. Agent Reyes’ occurence is more of an afterthought (which, in my humble opinion, is deserved).

Plot_Timeline(filter(xftimeline, term %in% c("smoke", "krycek", "elder", "rohrer")), "The Syndicate")

unnamed-chunk-36-2

The Cigarette Smoking Man shows an increasingly strong presence until season 7 and then drops off, presumed dead. At this point, “super soldier” Knowle Rohrer takes over. Agent Krycek has two peaks, while the obscure First Elder stays in the background throughout all seasons.

Plot_Timeline(filter(xftimeline, term %in% c("throat", "covarrubia", "kritschgau")), "Informants")

unnamed-chunk-36-3

Main informant Deep Throat plays a key role in the first season, gets killed and re-appears in a couple of flashback scenes later. His replacement, “X”, unfortunately can’t be analyzed due to his unfortunate name (blame it on the 90s). Marita Covarrubias and Michael Kritschgau are important for some seasons, but are otherwise rare occurrences.

Plot_Timeline(filter(xftimeline, term %in% c("skinner", "fowley", "kersh", "spender")), "FBI Agents")

unnamed-chunk-36-4

Agent Skinner dominates this picture. During the sixth and seventh seasons we can see a dip in his mentions, while other FBI characters shortly gain more presence.

Plot_Timeline(filter(xftimeline, term %in% c("frohik", "byer", "lang", "gunmen")), "Lone Gunmen")

unnamed-chunk-36-5

While the Lone Gunmen as a group steadily gain importance over the seasons, some seasons feature individual Gunmen members more prominently (Lange in season 2 and 6, Byers in season 5 and 6). Towards the end, the group has a very strong presence, even though individual members are rarely mentioned.

Conclusion and Leftovers

Scully: Mulder, what did you find out there?
Mulder: Scully, I can’t tell you.
Scully: That doesn’t make sense.
Mulder: You’ve got to trust me, Scully. I know things. It’s better you don’t.

This concludes this exploration of the X-Files episode summary data – I hope you have enjoyed it as much as I have. As usual, most of the time was spent on data gathering and cleaning, which is especially true in the case of text-based data (don’t ask me how long it took to construct that remcast list).

Of course, much more could be done with the data, and I hope you grab some of it for your own analyses from my GitHub repo. The original intention of the exercise was to try out some unsupervised machine learning and network analysis techniques, which unfortunately didn’t proove fruitful (K-Means clustering, Latent Semantic Analysis, Topic Models).

Finally, this analysis doesn’t have to be constrained to X-Files episode summaries: A lot of the code from part 1 and 2 can be re-used for other series – maybe even be turned into a package.

The Truth Is In There – an X-Files episode analysis with R (Part 3)

Cornfield_Texas_Mulder_Scully_Fight_the_Future

Part 3: Monster vs. Mythology word use

Mulder: We’ve both lost so much… but I believe that what we’re looking for is in the X-Files. I’m more certain than ever that the truth is in there.

Scully: I’ve heard the truth, Mulder. Now what I want are the answers.

This is part 3 of a natural language processing analysis of X-Files episode summaries. In parts 1 and 2 I explained how to get the data from Wikipedia and pre-process it for analysis. Now we can start working with the data.

Our data is stored in a neat document term matrix (DTM), along with meta-data. That means, every row of our data frame contains one of the 201 X-Files episodes. Our first 10 columns contain information about the episode, while the other 6000+ columns contain an alphabetized list of words and how often these words appear in each episode.

Let’s look at what the most prominent words are in the collection of episode summaries. Remember, our corpus cleanup got rid of English stop words (“the”, “and”, etc.) as well as the names of actors (i.e. “Duchovny” won’t appear in the list – but “Mulder” will).

The total count of the words could be misleading though – episode summary length or the way summaries are written skew the ratio. To “normalize” these effects, we will concentrate on if a word appears in an episode or not. For this, we need to transform the word count in our DTM to a Boolean value – TRUE if the count is 1 or greater, FALSE if the count is zero. For this, we define a simple function numToBool

numToBool <- function(x) {
        # function to set counts to boolean values
        #
        # Args:
        #  x: an integer
        #
        # Return:
        #  a Boolean value (>=1 TRUE, <1 FALSE) 
        
        ifelse(is.numeric(x), x > 0, NA)
}

We apply this function to all of the DTM cells.

xfmergedTF <- xfmerged

# change word counts to Boolean values ()
xfmergedTF[ , -c(1:10)] <- as.data.frame(lapply(
        xfmergedTF[ , -c(1:10)],FUN = function(x) {
                sapply(x, FUN=numToBool)
                }
        ))

str(xfmergedTF[ , -c(1:10)], list.len = 5)
## 'data.frame':    201 obs. of  6152 variables:
##  $ aaa            : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ aaronson       : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ aback          : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ abandon        : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  $ abduct         : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##   [list output truncated]

Now we get the sums of all DTM columns (i.e. the number of episodes each word appears in) and divide it by the number of episodes to get a relative value.

nxf <- nrow(xfmergedTF)

# get total Boolean counts (all episodes)
xfcountAll <- xfmergedTF[ , -c(1:10)] %>% 
        colSums() %>% 
        sort(decreasing = TRUE) %>% 
        `/`(nxf)  #divide by number of episodes

Let’s look at the first 50 terms:

# show terms with largest counts
head(xfcountAll, 50)
##    sculli      find    mulder     agent      kill      tell     later 
## 0.9900498 0.9353234 0.8905473 0.8457711 0.7810945 0.7810945 0.7462687 
##  meanwhil    believ     arriv       man       one      leav      meet 
## 0.7313433 0.7213930 0.6616915 0.6517413 0.6368159 0.6169154 0.6119403 
##       see      bodi  investig    reveal     howev       two       fbi 
## 0.6069652 0.5970149 0.5970149 0.5920398 0.5820896 0.5721393 0.5572139 
##    discov    return      back      take     claim     visit       die 
## 0.5522388 0.5373134 0.5323383 0.5273632 0.5174129 0.5024876 0.4825871 
##       tri      also      head      home       car      case    realiz 
## 0.4776119 0.4726368 0.4577114 0.4527363 0.4427861 0.4427861 0.4427861 
##      dead     found    appear    attack     insid      show      work 
## 0.4378109 0.4378109 0.4328358 0.4278607 0.4278607 0.4278607 0.4228856 
##   attempt    murder     escap      name      room     anoth     death 
## 0.4179104 0.4179104 0.4129353 0.4129353 0.4129353 0.4079602 0.4079602 
##       ask 
## 0.4029851

Pretty grim terminology, isn’t it? Unsurprisingly, we can see our two main characters’ names taking top spots in our list, along with procedural terms such as “case”, “fbi” and “investigate”. Most of the terms seem to be verbs, the most prominent one being “kill”. Words like “body”, “die”, “dead” and “murder” point to a rather dangerous world the protagonists live in.

What we want to find out now is how the terminology for the “Monster of the Week” and the “Mythology Arc” differ. For this, we will construct a slopegraph of top terms with strong differences. The first step is to separate the two datasets:

# get Boolean counts for Monster and Mythology episodes
xfcountTFMonster <- filter(xfmergedTF, Mythology == FALSE)
xfcountTFMyth <- filter(xfmergedTF, Mythology == TRUE)

Our graph will be based on the most prevalent terms from both of theses lists. We therefore need to sort, rank and reconnect them. Since we need to repeat the sorting/ranking for both sets, it makes sense to construct a function. prepCountList() will prepare a ranked list our our terms, sorted by frequency.

prepCountList <- function(df, n) {
        # function to create a ranked list of relative Boolean counts
        #
        # Args:
        #  df: data frame based on the data format in xfmerged
        #  n:  count of episodes
        #
        # Return:
        #  df: a data frame of relative occurence (part),
        #      term and rank
        
        df <- df[ , -c(1:10)] %>%               # get rid of metadata
                colSums() %>%                   # get overall term counts
                sort(decreasing = TRUE) %>%     # sort high to low
                `/`(n) %>%                      # divide by episode count
                as.data.frame()                 # make data frame
        
        df <- mutate(df, term = rownames(df))   # get row names as variable
        df$rank <- 1:nrow(df)                   # add ranks
        
        colnames(df) <- c("part", "term", "rank") # rename col names
        
        df
}

Now we’ll apply the function to the two data frames:

# get number of episodes
nmyth = nrow(xfcountTFMyth)
nmon = nrow(xfcountTFMonster)

nmyth
## [1] 60
nmon
## [1] 141
# create ranked lists for Monster and Mythology episodes
xfcountTFMonster <- prepCountList(xfcountTFMonster, nmon)
xfcountTFMyth <- prepCountList(xfcountTFMyth, nmyth)

head(xfcountTFMonster)
##        part   term rank
## 1 0.9858156 sculli    1
## 2 0.9432624   find    2
## 3 0.8723404  agent    3
## 4 0.8439716 mulder    4
## 5 0.7801418   kill    5
## 6 0.7446809  later    6
head(xfcountTFMyth)
##        part     term rank
## 1 1.0000000   mulder    1
## 2 1.0000000   sculli    2
## 3 0.9166667     find    3
## 4 0.8666667 meanwhil    4
## 5 0.8666667     tell    5
## 6 0.7833333    agent    6

We join the two sets based on the term, so we’ll have both ranked lists in one data frame. We take the top 30 terms from both lists and throw out all terms where the difference in episode occurence is 10% or less. What remains is a list of the terms with the biggest difference between “Monster” and “Mythology” episodes.

# join the two ranked lists
xfcountmerged <- full_join(xfcountTFMyth, xfcountTFMonster, by = "term")

str(xfcountmerged)
## 'data.frame':    6152 obs. of  5 variables:
##  $ part.x: num  1 1 0.917 0.867 0.867 ...
##  $ term  : chr  "mulder" "sculli" "find" "meanwhil" ...
##  $ rank.x: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ part.y: num  0.844 0.986 0.943 0.674 0.745 ...
##  $ rank.y: int  4 1 2 9 7 3 5 14 184 8 ...
# create data for slopegraph (top 30 terms for both episode types)
# exclude difference below 10%
xfcountmerged <- xfcountmerged %>% 
        filter(rank.x <= 35 | rank.y <= 35) %>% 
        select(myth = part.x, monster = part.y, term) %>%
        filter(abs(myth-monster) > .1) %>% 
        mutate(myth = round(myth*100,0), monster = round(monster*100,0))

To construct the slopegraph, we’ll use a slightly adapted version of Nathan Yau’s plotting function. For space reasons, the code is sourced from a separate file in my GitHub repo.

# create slopegraph
source("slopegraph.r")

with(xfcountmerged, slopegraph(myth, monster, term))

unnamed-chunk-29-1

## [1] "Plot generated: "

The “Mythology” episodes have more aliens and abductions in them; Agent Mulder is mentioned in all of these episodes, and Agent Skinner takes an important role in them too. Protagonists seem to move around more and meet more people (“visit”, “arrive”, “leave”, “return”), and more information is exchanged between people (“tell”, “reveal”, “inform”).

“Monster” episodes on the other hand feature more bodies, attacks< and murders, and the situations are described as “cases”. Names and homes are more important.

The differences in the other terms (“meanwhile”, “tri/try”) are difficult to interpret and probably have to do with narrative structures in the episode summaries.

Finally, some cleanup:

# cleanup
rm(xfcountmerged, xfcountTFMonster, xfcountTFMyth, xfmergedTF,
   nmon, nmyth, nxf, xfcountAll, numToBool, prepCountList, slopegraph)

This concludes part 3 of our analysis. In part 4, we will look at characters and their occurence over seasons. Stay tuned!

The Truth Is In There – an X-Files episode analysis with R (Part 2)

Mulder-and-Scully-mulder-and-scully-8403955-2560-1693

Part 2: Preparing for text analysis

Cigarette Smoking Man: But I’ve come today not to ask, but to offer. To offer you the truths that you so desperately sought.

This is part 2 of a natural language processing analysis of X-Files episode summaries. In part 1 I explained how to get the data from Wikipedia. The next step is to get the data ready for analysis.

One problem that we’re facing in the episode summaries is that they contain names – lots of names. Some of them are useful (e.g. the main character names). Others, such as cast names, don’t have any relevance to our analysis. Let’s get those names and create stopword lists out of them, i.e. lists of terms that should be excluded from analysis.

The best source for cast and character information is the X-Files Full Credits page on IMDB, where it is stored in the third table. We scrape the table and store the cast and character names (columns V2 and V4) in two separate data frames called xchar and xcast.

casturl <- "http://www.imdb.com/title/tt0106179/fullcredits/"

# get third table (full cast)
xcastchar <- content(GET(casturl))
xcastchar <- readHTMLTable(xcastchar, which = 3)

# get cast and character names in separate tables
xcast <- data.frame(as.character(xcastchar$V2), stringsAsFactors = FALSE)
xchar <- data.frame(as.character(xcastchar$V4), stringsAsFactors = FALSE)

str(xcast)
## 'data.frame':    1932 obs. of  1 variable:
##  $ as.character.xcastchar.V2.: chr  "Gillian Anderson" "David Duchovny" "Mitch Pileggi" "Robert Patrick" ...
str(xchar)
## 'data.frame':    1932 obs. of  1 variable:
##  $ as.character.xcastchar.V4.: chr  " Dana Scully\n         / ...  \n                  (201 episodes, 1993-2002)" " Fox Mulder\n         / ...  \n                  (179 episodes, 1993-2002)" " Walter Skinner\n         / ...  \n                  (81 episodes, 1994-2002)" " John Doggett\n                  (40 episodes, 2000-2002)" ...

The cast list seems clean, but the character list is a little messy. What we really want is two lists of names that we can use as stopwords – i.e. two vectors containing words we can subtract from our episode summary collection.

To do this, we need to treat the cast and character list as a “text corpus”” for use in the tm package. tm is a framework of packages for natural language processing. Our two lists need to be cleaned, which means removing extra whitespaces, transforming to lower case, removing of punctuation and numbers. We will also stem the lists – which seems strange for names, but since we will be stemming the episode summaries later on, names can only be removed if they are stemmed too.

The following function takes care of all that cleaning, creates a document term matrix (DTM) and returns a term list, i.e. a list of all the words remaining after the cleaning.

library(tm)

getcastcorpus <- function(df) {
        # Creates a frequent term matrix from a data frame of text
        #
        # Args:
        #  df: data frame containing the corpus texts
        #
        # Return:
        #  ffq: frequent term matrix
        
        corp <- Corpus(DataframeSource(df), readerControl = list(language = "en"))
        
        corp <- tm_map(corp, stripWhitespace) #white spaces
        corp <- tm_map(corp, content_transformer(tolower))  #lower case
        corp <- tm_map(corp, removePunctuation, preserve_intra_word_dashes = FALSE) #regular punctuation
        corp <- tm_map(corp, removeNumbers) # numbers
        corp <- tm_map(corp, stemDocument) # stemming
        
        dtm <- DocumentTermMatrix(corp)
        ffq <- findFreqTerms(dtm)
        
        ffq
}

We now apply the function to our two name lists and receive the names as vectors:

#get cast and character names as separate vectors
xfcastterms <- getcastcorpus(xcast)
xfcharterms <- getcastcorpus(xchar)

str(xfcastterms) 
##  chr [1:2314] "aaron" "aartsen" "abbott" "abdoulay" ...
str(xfcharterms)
##  chr [1:1736] "aaron" "aaronson" "abbott" "abducte" "abl" ...

Almost perfect – one problem remains: Some of the names are also English words which we want to include in our analysis. For example, a character might be described as “Gas Station Attendant”. We want to include those words in the analysis, so we need to exclude them from the stopword lists.

For this, I created a list called remcast.csv which includes all those words we don’t want to exclude:

# manual list of non-names found in character description (e.g. "airplane pilot")
remcast <- read.csv("remcast.csv", stringsAsFactors = FALSE)
remcast <- remcast$x
 
head(remcast)
## [1] "abducte"   "abduct"    "abandon"   "addict"    "african"   "afterglow"

Now we subtract the remcast list from both the cast and character vectors. Also, we make sure “Scully” (stemmed to “sculli”) is not removed from the character list.

# remove all non-name terms from cast and character vectors
xfcastterms <- setdiff(xfcastterms, remcast)
xfcharterms <- setdiff(xfcharterms, remcast)

# remove "scully" from cast vector (it's both a cast an character name)
xfcastterms <- setdiff(xfcastterms, "sculli")

# cleanup
rm(casturl, xcast, xchar, xcastchar, getcastcorpus)

Our stopword lists are ready. On to the real data – our episode summaries. Again, we will use the tm package to create a corpus of summaries, clean and tokenize it. Since we want to keep our metadata (e.g. our episode titles), we will use a little trick to attach the episode code for later identification (check out the Stack Overflow question mentioned in the code for more info on this).

# Corpus construction and text cleanup
# http://stackoverflow.com/questions/19850638/tm-reading-in-data-frame-and-keep-texts-id

# create Corpus
m <- list(id = "VarTitle", content = "content")
myReader <- readTabular(mapping = m)
xfcorpus <- Corpus(DataframeSource(xf2), readerControl = list(reader = myReader, language = "en"))

# manually attach the product code for later identificaiton
for (i in 1:length(xfcorpus)) {
        attr(xfcorpus[[i]], "ID") <- xf2$VarTitle[i]
}

inspect(xfcorpus[1])
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  2
## Content:  chars: 4282

Our episode summary corpus is ready for cleaning. Again, we remove white spaces, transform to lower case, remove punctuation and numbers and stem the words. Additionally, we remove non-ASCII symbols, standard English stop words (“the”, “and”, etc.) and our cast stopwords – but not the character stop words.

xfcorpus <- tm_map(xfcorpus, stripWhitespace) #white spaces
xfcorpus <- tm_map(xfcorpus, content_transformer(tolower))  #lower case
xfcorpus <- tm_map(xfcorpus, removeWords, stopwords("english")) #stop words
xfcorpus <- tm_map(xfcorpus, removePunctuation, preserve_intra_word_dashes = FALSE) #regular punctuation
xfcorpus <- tm_map(xfcorpus, content_transformer(function(row) iconv(row, "latin1", "ASCII", sub=""))) # non-ascii chars
xfcorpus <- tm_map(xfcorpus, removeNumbers) # numbers
xfcorpus <- tm_map(xfcorpus, stemDocument) # stemming
xfcorpus <- tm_map(xfcorpus, removeWords, xfcastterms) #remove names from cast list  

We create a DTM out of our corpus…

# Create Document Term Matrix as data frame
xfDTMfull <- DocumentTermMatrix(xfcorpus, control = list(wordLengths = c(3,15)))
xfDTM <- data.frame(as.matrix(xfDTMfull), stringsAsFactors = TRUE)
 
str(xfDTM, list.len = 5) 
## 'data.frame':    201 obs. of  6152 variables:
##  $ aaa            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ aaronson       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ aback          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ abandon        : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ abduct         : num  1 0 0 0 0 0 0 0 0 2 ...
##   [list output truncated]

… and re-attach it to our meta-data:

# Merge DTM with episode data => full data set including character names
xfmerged <- merge(xf2, xfDTM, by.x = "VarTitle", by.y = "row.names" )

str(xfmerged, list.len = 15)
## 'data.frame':    201 obs. of  6162 variables:
##  $ VarTitle       : chr  "1X01_DeepThroat" "1X02_Squeeze" "1X03_Conduit" "1X04_TheJerseyDevil" ...
##  $ ProdCode       : chr  "1X01" "1X02" "1X03" "1X04" ...
##  $ Title          : chr  "Deep Throat" "Squeeze" "Conduit" "The Jersey Devil" ...
##  $ epURL          : chr  "http://en.wikipedia.org/wiki/Deep_Throat_(The_X-Files_episode)" "http://en.wikipedia.org/wiki/Squeeze_(The_X-Files)" "http://en.wikipedia.org/wiki/Conduit_(The_X-Files)" "http://en.wikipedia.org/wiki/The_Jersey_Devil_(The_X-Files)" ...
##  $ Director       : chr  "Daniel Sackheim" "Harry Longstreet" "Daniel Sackheim" "Joe Napolitano" ...
##  $ Writer         : chr  "Chris Carter" "Glen Morgan" "Alex Gansa" "Chris Carter" ...
##  $ AirDate        : chr  "September 17, 1993" "September 24, 1993" "October 1, 1993" "October 8, 1993" ...
##  $ Viewers        : num  11.1 11.1 9.2 10.4 8.8 9.5 10 10.7 8.8 10.4 ...
##  $ Mythology      : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
##  $ content.x      : chr  "In southwestern Idaho, near Ellens Air Force Base, military police raid the home of Colonel Robert Budahas, who has stolen a mi"| __truncated__ "In Baltimore, businessman George Usher arrives at his office building. He is watched from a storm drain by someone who then sne"| __truncated__ "At a campground at Lake Okobogee [sic] National Park in Sioux City, Iowa, Darlene Morris witnesses a flash of light outside of "| __truncated__ "In New Jersey in 1947, a man is attacked while fixing a flat tire on the road near the woods. His corpse is later found with hi"| __truncated__ ...
##  $ aaa            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ aaronson       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ aback          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ abandon        : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ abduct         : num  0 0 0 0 0 0 0 0 2 1 ...
##   [list output truncated]

Some cleanup:

# cleanup
rm(i, m, myReader)

And we’re done with preprocessing our data! Everything we need is neatly stored in a DTM including episode meta-data.

In part 3, we’re going to explore the dataset and look for differences between the “monster of the week” and “mythology arc” episodes.

The Truth Is In There – an X-Files episode analysis with R (part 1)

x-files

The great thing about data science is that you can find data in almost every subject and try out new methods independent of the content. I wanted to explore some natural language processing packages and techniques and was looking for interesting textual data beyond Twittr dumps and R mailing list posts. My choice was the complete list of episode summaries from The X-Files collected on Wikipedia. Not because I am a hardcore fan of the series, but because it ran for a long time (9 seasons with over 200 episodes), features a rather narrow topics range. Also, its episodes can be divided into two categories: “Monster of the week” episodes, dealing with all kinds of scary things Mulder and Scully encounter in their cases, and “Mythology Arc” episodes, uncovering The Truth behind the worldwide conspiracy behind alien invasions and abductions. This means we can look at how features of the episode summaries relate to those two categories. All in all, this should make for some entertaining data exploration while we wait for the series reboot coming 2016.

Part 1: Getting the data

Scully: Am I to understand that you want me to debunk the X-Files Project, sir?

Section Chief Blevins: Agent Scully, we trust you’ll make the proper scientific analysis.

The show and all its episodes are well documented on Wikipedia and an almost standard format. There is an overview of all episodes for a season in table format, each of them linking to a detailed description page per episode. I chose Hadley Wickham’s rvest package for the scraping part.

Take a look at the episode tables on Wikipedia: We will be needing 2-6 and 8-11 on this page (we leave out table 7 because it refers to the first X-Files movie). While the data is in a nice table format, its contents are not yet tidy (PDF): For one, the most important information – the link to the episode page – is “hidden” within the link of the episode title. Also, the information if an episode is a “Mythology” episode is hidden in a “double dagger” footnote behind the episode title.

To extract all three data points from the “Title” column, we will need the following function getHrefs :

library(XML)
library(httr)
library(rvest)

getHrefs <- function(node, encoding) {  
        # The function evaluates every child of an HTML table:
        # If it contains an anchor a, if returns its href attribute
        # (preceded by the protocol and the base url) plus the link text.
        # If not, it just returns the childs text. –
        # via lukeA on http://stackoverflow.com/questions/31924546/rvest-table-scraping-including-links
        # The function also evaluates if a cell contains a footnote and marks it with a superscript 1 (¹)
        # For X-Files, a double dagger footnotes a "mythology arc" episode
        #
        # Args:
        #  node: an html node
        #  encoding: text encoding
        #
        # Return:
        #  if the node is a link:
        #   the link text, a "¹" sign if it contains an image, and the link URL
        # 
        #  if the node is not a link:
        #   the node text
        
        x <- xmlChildren(node)$a
        y <- ifelse(is.null(xmlChildren(node)$img), "", "¹")
        
        
        if (!is.null(x)) 
                
                paste0(xmlValue(x),
                       y,
                       "|",
                       "http://",
                       parseURI(url)$server,
                       xmlGetAttr(x, "href"))
        
        else xmlValue(xmlChildren(node)$text) 
}

This function gets packed into a getTable function to scrape tables:

getTable <- function(url, tabnum) {
        # function to get the table from HTML and apply getHrefs 
        # 
        # Args:
        #  url: The URL
        #  tabnum: the number of the table in the overall HTML code
        #
        # Return:
        #  a table
        
        doc <- content(GET(url))
        tab <- readHTMLTable(doc, which = tabnum, elFun = getHrefs)
        
        tab
}

Finally, we apply the getTable function to scrape all the tables we need and turn it into a data table.

url <- "http://en.wikipedia.org/wiki/List_of_The_X-Files_episodes"

xf <- getTable(url, c(2:6, 8:11)) # tables 2-6, 8-11 (table 7 relates to the first movie)

# consolidate column names
cnames <- c("NrInSeries","NrInSeason","Title","Director","Writer","AirDate","ProdCode","Viewers")
xf <- lapply(xf, setNames, cnames)

# collapse table list into one data frame
xf <- do.call(rbind, xf)

str(xf)
## 'data.frame':    201 obs. of  8 variables:
##  $ NrInSeries: Factor w/ 201 levels "1","10","11",..: 1 12 18 19 20 21 22 23 24 2 ...
##  $ NrInSeason: Factor w/ 25 levels "1","10","11",..: 1 12 18 19 20 21 22 23 24 2 ...
##  $ Title     : Factor w/ 201 levels "Beyond the Sea|http://en.wikipedia.org/wiki/Beyond_the_Sea_(The_X-Files)",..: 15 5 20 3 22 17 11 12 19 8 ...
##  $ Director  : Factor w/ 62 levels "Daniel Sackheim",..: 16 2 6 1 9 12 7 4 19 11 ...
##  $ Writer    : Factor w/ 66 levels "Alex Gansa & Howard Gordon",..: 5 3 7 2 3 8 1 8 3 9 ...
##  $ AirDate   : Factor w/ 201 levels "April 15, 1994",..: 22 23 24 18 21 19 20 17 15 16 ...
##  $ ProdCode  : Factor w/ 201 levels "1X01","1X02",..: 24 1 2 3 4 5 6 7 8 9 ...
##  $ Viewers   : Factor w/ 171 levels "","10.0","10.4",..: 9 6 6 16 3 15 17 2 4 15 ...

Lots of over-factorized data – time for some cleanup:

library(dplyr)
library(tidyr)

xf2 <- tbl_df(xf)

xf2 <- xf2 %>% 
        #separate Title, Director, Writer into text and URL columns
        separate(Title, into = c("Title","epURL"), sep = "\\|", extra = "merge") %>% 
        separate(Director, into = c("Director","DirURL"), sep = "\\|", extra = "merge") %>% 
        separate(Writer, into = c("Writer","WritURL"), sep = "\\|", extra = "merge") %>% 
        #differentiate between Monster and Mythology episodes
        mutate(Mythology = grepl("¹", Title)) %>% 
        mutate(Title = sub("¹", "", Title)) %>% 
        #rearrange and drop unnecessary columns
        select(ProdCode, Title, epURL, Director, Writer, AirDate, Viewers, Mythology) %>% 
        # get rid of factors
        mutate(ProdCode = as.character(ProdCode), 
               AirDate = as.character(AirDate),
               Viewers = as.numeric(as.character(Viewers))
        ) %>% 
        # add title without spaces and punctuation for use in variable names
        mutate(VarTitle = paste(ProdCode, gsub("[^[:alnum:]]", "", Title, perl = TRUE), sep = "_"))

str(xf2)
## Classes 'tbl_df', 'tbl' and 'data.frame':    201 obs. of  9 variables:
##  $ ProdCode : chr  "1X79" "1X01" "1X02" "1X03" ...
##  $ Title    : chr  "Pilot" "Deep Throat" "Squeeze" "Conduit" ...
##  $ epURL    : chr  "http://en.wikipedia.org/wiki/Pilot_(The_X-Files)" "http://en.wikipedia.org/wiki/Deep_Throat_(The_X-Files_episode)" "http://en.wikipedia.org/wiki/Squeeze_(The_X-Files)" "http://en.wikipedia.org/wiki/Conduit_(The_X-Files)" ...
##  $ Director : chr  "Robert Mandel" "Daniel Sackheim" "Harry Longstreet" "Daniel Sackheim" ...
##  $ Writer   : chr  "Chris Carter" "Chris Carter" "Glen Morgan" "Alex Gansa" ...
##  $ AirDate  : chr  "September 10, 1993" "September 17, 1993" "September 24, 1993" "October 1, 1993" ...
##  $ Viewers  : num  12 11.1 11.1 9.2 10.4 8.8 9.5 10 10.7 8.8 ...
##  $ Mythology: logi  TRUE TRUE FALSE FALSE FALSE FALSE ...
##  $ VarTitle : chr  "1X79_Pilot" "1X01_DeepThroat" "1X02_Squeeze" "1X03_Conduit" ...

Nice – everything tidied up. We have even got the “Mythology” tag as a Boolean value in its own column, which will come in handy later. But wait… isn’t there something missing? Right – this is only metadata! What we still need is the episode summaries itself.

The summaries are stored as text on separate pages which can be reached via the links in our epURL column. Check out the summary for the pilote episode: What we need can be found in the paragraphs after the heading “Plot”. Since we are dealing with 201 episodes, we will need another function to get them:

getPlot <- function(url, t = "Plot", h = "h2") {
        # function to extract the plot from an episode page
        # xpath code from http://stackoverflow.com/questions/18167279/trying-to-get-all-p-tag-text-between-two-h2-tags
        # t refers to span ID, h refers to heading tag level
        #
        # Args:
        #  url: the URL of a Wikipedia page containing
        #  t:   a heading name (defaults to "Plot")
        #  h:   the heading formating (defaults to "h2")
        #
        # Return:
        #  the text of the paragraphs after the specified headings
        #  until the next heading, as character string
        
                
        xp = paste0("//p[preceding-sibling::", h, "[1][span='", t, "']]")
        
        eplot <- read_html(url) %>% 
                # get the nodes following the h2 "Plot" until the next h2
                html_nodes(xpath = xp) %>% 
                # strip tags
                html_text() %>%
                # concatenate vector of texts into one string
                paste(collapse = "")
        
        eplot
}

We will apply this function to all of the entries in our epURL column and store the results in a new column content. Depending on your internet connection, this can take a few minutes.

# get plot summaries
xf2$content <- unlist(lapply(xf2$epURL, getPlot))
str(xf2$content)
##  chr [1:201] "In Bellefleur, Oregon, teenager Karen Swenson is seen fleeing through the forest. When she falls, a dark figure approaches, and"| __truncated__ ...

Unfortunately, not all episode pages are formatted the same way. One of them, episode 7ABX04, has its plot summary written under a h3 (not h2) title. We will need to import it seperately.

# get missing plot summary from episode 7ABX04
# (plot is nested in "Synopsis" under h3 tag)
xf2$content[xf2$ProdCode == "7ABX04"] <- getPlot(
        "https://en.wikipedia.org/wiki/The_Sixth_Extinction_II:_Amor_Fati",
        t = "Plot", h = "h3") 

The episode summaries are complete, but unfortunately have a lot of punctuation typos. Especially, a lot of spaces are missing after periods, commas, etc., which will bite us in our tokenization step if we don’t take care of it. So let’s apply a little RegEx magix to our content column:

# replace missing spaces after punctuation in episode summaries
xf2 <- mutate(xf2, content = gsub(pattern = "([.,!?;:])([^ ])", rep = "\\1 \\2", content))

Looks like we have what we need in our table. Let’s get rid off the intermediate data:

# clean up variables and functions
rm(cnames, url, getHrefs, getPlot, getTable, xf)

In part 2, I will show you how to prepare the content data for natural language processing using the tm package.

Use machine-learning to find a family-friendly restaurant

smileyfood

What makes a restaurant child friendy, and how can I use data to predict it? That’s the question I spent the last two months on. As part of my capstone submission to Johns Hopkins University’s Data Science Specialization, I used R to build a a machine-learning prediction algorithm which distinguishes child-friendly restaurants and food-related businesses from child-unfriendly ones using Yelp data. I was able to get a solid accuracy of 89% (9 out of 10 predictions are correct) with an excellent sensitivity of 97% (of 100 child-friendly restaurant recommendations, only 3 are incorrect). That’s enough for a recommendation engine.

In a nutshell, the algorithm says that if you want to enjoy a family outing, pick a place that offers take-out or catering, is inexpensive, has a casual dress code, is non-smoking, known, for great lunches and frequented by groups. On the other hand, if you are out with children, stay away from bars and other places that serve alcohol, venues known for great dinner and late night entertainment, “New American”-style cuisine, expensive, trendy and dressy venues and places that offer street parking and outdoor seating.

 

 

Check it out and let me know what you think in the comments!

Drop that jargon: It’s time for new HR metaphors

dirkschaefer_aufderbaustelle

„All the world’s a stage,
And all the men and women merely players:
They have their exits and their entrances;
And one man in his time plays many parts.”
– William Shakespeare

Shakespeare has an invitation for us: “Imagine that we were all actors”, he asks. With his analogy between life and stage play, he chooses a lens through which the reader can look at the world. This makes it easy understand what he means; it also allows us to take the analogy further and to ask interesting questions: Are our possessions just props? Can we go “off stage”? Who is our audience?

The Power of Metaphors

Metaphors and analogies are an example of how language influences thinking in a subtle yet powerful way. On the one hand, they provide an easily accessible toolbox of mental models which ease thinking and communication. On the other hand, those very mental models rely on unspoken assumptions. Shakespeare’s analogy above suggest that there are two layers of reality (stage/audience), and that people play only one role at a time. Those assumptions remain unchallenged if we choose to use the metaphors.

Business contexts are not immune from language’s influence, as the abundance of jargons shows. Technical terms, abbreviations and buzzwords are an integral part of any discipline. Some of them cross boundaries and infect other areas. Young, conceptual disciplines such as HR are especially prone to borrowing jargon, as their language is still much more in flux as opposed to established areas such as chemistry. But if language influences thinking, wouldn’t the jargon we use influence (or bias) our decisions?

Talent Management: An Engineering Domain

Let’s take the example of Talent Management, where practitioners have become used to expressions such as “talent pipelines”, “platforms”, “lifecycles” or “recruiting”. All those terms are derived from other disciplines. In a quick and dirty text analysis of the most recent 25 articles from Harvard Business Review’s “Talent Management” category, I have found that 6 out of 10 jargon terms come from the engineering/physics area (e.g. “process”, “build”, “potential”, “system”), followed by military terms (“engage”, “recruit”, “strategy”). There are some, but only few terms from other areas.

When using these metaphors, we rely on assumptions from those very disciplines – be it a mechanistic engineering view that a “system” can be “built”, or the strategic military considerations that a “war for talent” can be “won”. But do we really want to accept those assumptions?

Where are the Other Metaphors?

Creativity techniques emphasize the importance of outside influence, stimulating “out of the box” thinking. How about using a different vocabulary for Talent Management? A change of language would introduce different mental models, challenge assumptions and help us find new approaches. Chemistry could help us find talent oxidation, free radicals and leadership crystallization; biology would introduce cross-pollination of skills, symbiotic development and talent spores; the arts might lead us to leadership genres, talent rituals and the right balance between skill expression and technique.

In the end, we might find out that the engineering language is still the best of all of those jargons – but borrowing a different toolbox for a project, workshop or strategy meeting might help you think differently about that well-worn hammer you’ve been using all those years.

What jargons and metaphors would you use for talent management? Let me know in the comments!

(picture by Dierk Schaefer, flickr.com/photos/dierkschaefer, used under CC -by 2.0)