This section is based on Rasmus Bååth’s Research Blog1. We load a list of words, sorted by frequency of appearance, in natural language. This is rather a quick and dirty spell checking.
load(file="wordListSpelling.Rdata")
We create a function to correct misspelled words. You can try out this function in your R console, using the examples below.
correct <- function(word) {
# How dissimilar is this word from all words in the wordlist?
edit_dist <- adist(word, wordlist)
# Is there a word that reasonably similar?
# If yes, which ones? Select the first result (because wordlist is sorted from most common to least common)
# If no, append the original word
c(wordlist[edit_dist <= min(edit_dist,2)],word)[1]
}
correct("speling")
[1] "spelling"
correct("goodd")
[1] "good"
correct("corect")
[1] "correct"
correct("corrct")
[1] "correct"
We convert the data back to a normal character vector. If we would look at a
document using str
, we would see that it is a list with the actual text stored in
the element “content”. Since this will often be used , we make the following function:
corpus2text <- function(corpus) {
content <- character(length(corpus))
if (any(class(corpus) %in% c("VCorpus", "Corpus","SimpleCorpus"))) {
for (i in 1:length(reviews)) content[i] <- corpus[[i]]$content
}
content
}
reviews <- corpus2text(reviews)
Next, we pre-allocate a vector where we will store the spell-checked reviews.
reviews_spell_checked <- character(length(reviews))
The following loop takes a while:
p_load(tictoc)
tic()
for (i in 1:length(reviews)){
#Instead of applying correct() to each word, we fist make them unique
#The str_split() function splits the string in words
count <- table(str_split(reviews[i],' ')[[1]])
words <- names(count)
#Then we apply our correct function to each unique word
words <- map_chr(words,correct)
#Next we reconstruct the original vector, but now spell-checked. We do this because we are going to exploit the tm package, which will count the words for us.
words <- rep(words,count)
#Concatenate back to a string
reviews_spell_checked[i] <- paste(words, collapse=" ")
#Print progress to the screen
if((i %% max(floor(length(reviews)/10),1))==0)
cat(round((i/length(reviews))*100),"%\n")
}
toc()
53.18 sec elapsed
unique_word_count(reviews_spell_checked)
[1] 250
The word count is further reduced, but is all this worth the long processing time?
When using the tm package, you don’t need to perform tokenization. The tm package does this internally when creating the Corpus.
Since we want to use the tm package, we first make a corpus.
reviews_spell_checked <- Corpus(VectorSource(reviews_spell_checked))
forremoval <- stopwords('english')
head(forremoval)
[1] "i" "me" "my" "myself" "we" "our"
reviews_spell_checked <- tm_map(reviews_spell_checked, removeWords, c(forremoval))
unique_word_count(reviews_spell_checked)
[1] 197
We also can delete rare words. However, when using the tm package it is better to do this after creating the dtm with other packages (e.g., tidytext). This is often done before making the dtm.
Finally, let’s stem the reviews. We will use the porter stemmer, since lemmatization is not supported in the tm package. A good package for stemming and lemmatization is textstem. Lemmatization will be covered in lecture 5, when sentiment analysis is discussed.
reviews_spell_checked <- tm_map(reviews_spell_checked, stemDocument)
unique_word_count(reviews_spell_checked)
[1] 178
Write the function correct
and check whether the words "cours" and "profesor"
are correctly spelled. If not, what is the correct spelling according to the function?
Save your result as spelling_word1
and spelling_word2
.
Perform text preprocessing on the subset of the productreviews
, given
as corpus below, and store it as productreviews_preprocessed
.
Make sure to perform the following preprocessing steps:
- Remove stopwords,
- Stemming.
To download the productreviews
dataset click
here2.
To download the wordListSpelling
dataset click
here3.
Assume that:
productreviews
and wordlist
datasets are given.