STEP 1: Text preprocessing

Spell checking

This section is based on Rasmus Bååth’s Research Blog1. We load a list of words, sorted by frequency of appearance, in natural language. This is rather a quick and dirty spell checking.

load(file="wordListSpelling.Rdata")

We create a function to correct misspelled words. You can try out this function in your R console, using the examples below.

correct <- function(word) { 
  # How dissimilar is this word from all words in the wordlist?
  edit_dist <- adist(word, wordlist)
  
  # Is there a word that reasonably similar? 
  # If yes, which ones? Select the first result (because wordlist is sorted from most common to least common)
  # If no, append the original word
  c(wordlist[edit_dist <= min(edit_dist,2)],word)[1] 
}
correct("speling")
[1] "spelling"
correct("goodd")
[1] "good"
correct("corect")
[1] "correct"
correct("corrct")
[1] "correct"

We convert the data back to a normal character vector. If we would look at a document using str, we would see that it is a list with the actual text stored in the element “content”. Since this will often be used , we make the following function:

corpus2text <- function(corpus) {
  content <- character(length(corpus))
  if (any(class(corpus) %in% c("VCorpus", "Corpus","SimpleCorpus"))) {
    for (i in 1:length(reviews)) content[i] <- corpus[[i]]$content
  }
  content
}

reviews <- corpus2text(reviews)

Next, we pre-allocate a vector where we will store the spell-checked reviews.

reviews_spell_checked <- character(length(reviews))

The following loop takes a while:

p_load(tictoc)
tic()
for (i in 1:length(reviews)){
  #Instead of applying correct() to each word, we fist make them unique
  #The str_split() function splits the string in words
  count <- table(str_split(reviews[i],' ')[[1]])
  words <- names(count)
  
  #Then we apply our correct function to each unique word
  words <- map_chr(words,correct)
  
  #Next we reconstruct the original vector, but now spell-checked. We do this because we are going to exploit the tm package, which will count the words for us.
  words <- rep(words,count)
  
  #Concatenate back to a string
  reviews_spell_checked[i] <- paste(words, collapse=" ")
  
  #Print progress to the screen
  if((i %% max(floor(length(reviews)/10),1))==0) 
    cat(round((i/length(reviews))*100),"%\n")
}
toc()
53.18 sec elapsed
unique_word_count(reviews_spell_checked)
[1] 250

The word count is further reduced, but is all this worth the long processing time?

Tokenization

When using the tm package, you don’t need to perform tokenization. The tm package does this internally when creating the Corpus.

Remove stopwords (Term filtering)

Since we want to use the tm package, we first make a corpus.

reviews_spell_checked <- Corpus(VectorSource(reviews_spell_checked))

forremoval <- stopwords('english')
head(forremoval)
[1] "i"      "me"     "my"     "myself" "we"     "our"
reviews_spell_checked <-  tm_map(reviews_spell_checked, removeWords, c(forremoval))

unique_word_count(reviews_spell_checked)
[1] 197

We also can delete rare words. However, when using the tm package it is better to do this after creating the dtm with other packages (e.g., tidytext). This is often done before making the dtm.

Stemming

Finally, let’s stem the reviews. We will use the porter stemmer, since lemmatization is not supported in the tm package. A good package for stemming and lemmatization is textstem. Lemmatization will be covered in lecture 5, when sentiment analysis is discussed.

reviews_spell_checked <- tm_map(reviews_spell_checked, stemDocument)

unique_word_count(reviews_spell_checked)
[1] 178

Exercise 1

Write the function correct and check whether the words "cours" and "profesor" are correctly spelled. If not, what is the correct spelling according to the function? Save your result as spelling_word1 and spelling_word2.

Exercise 2

Perform text preprocessing on the subset of the productreviews, given as corpus below, and store it as productreviews_preprocessed. Make sure to perform the following preprocessing steps:

- Remove stopwords,

- Stemming.

To download the productreviews dataset click here2.

To download the wordListSpelling dataset click here3.


Assume that: