In this exercise, we will perform text preprocessing using the tm
package.
Text preprocessing is a crucial step in many data science projects,
especially those involving natural language processing.
It involves transforming raw text data into a format
that can be easily understood and analyzed by text mining algorithms.
To do so, it is required to install the tm (i.e., text mining) package. This package expects the SnowballC and slam packages to be installed. Good documentation about tm package can be found here1.
p_load(SnowballC, slam, tm)
In the text preprocessing step everything is transformed to lower case, punctuation
and stopwords are removed, and a spelling check is done.
First, we load the productreviews
data as a corpus object.
reviews <- Corpus(VectorSource(reviews))
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 5
This tells us that we have 5 documents (reviews in this case). A SimpleCorpus is a type of VCorpus, which stands for Volatile Corpus, meaning that it is stored in RAM instead of stored on disk. SimpleCorpus is an optimized, faster, but more limited, version of VCorpus. Note that corpus refers to a collection of documents. It is the main structure for managing documents in the tm package. VectorSource means we are reading in a vector It also tells us that there is no metadata on the corpus level or document level.
Let’s look at the first document:
reviews[[1]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 385
We see that there are 7 metadata fields. These can be accessed using str(reviews[[1]])
.
We also see the number of characters (385).
In what follows, we will try to reduce the number of unique (and meaningless) words. We measure the number of unique words for a document in the corpus as follows:
unique_word_count <- function(data){
content <- character(length(data))
if (any(class(data) %in% c("VCorpus", "Corpus","SimpleCorpus"))) {
for (i in 1:length(data)) content[i] <- data[[i]]$content
} else {
content <- data
}
uniquewords <- unique(unlist(map(
str_split(as.character(content)," "),
unique)))
length(uniquewords)
}
unique_word_count(reviews)
[1] 295
Now we can apply the unique_word_count
function in the tm package.
Case conversion transforms all values to lower case. The tm_map
function is
similar to the apply
function since it applies a function to each element
(document in this case). The function content_transformer
allows to adapt the
str_to_lower
function to work with the documents. Note that the function
will also work with the base R alternative tolower
.
reviews <- tm_map(reviews, content_transformer(str_to_lower))
unique_word_count(reviews)
[1] 283
reviews <- tm_map(reviews, removePunctuation)
unique_word_count(reviews)
[1] 256
reviews <- tm_map(reviews, removeNumbers)
unique_word_count(reviews)
[1] 254
reviews <- tm_map(reviews, stripWhitespace)
If you want to look at the result (e.g., the first document),
you could use as.character(reviews[[1]])
.
In the example above, several text preprocessing steps were shortly introduced. However, it
is also possible to combine all different steps into one integrated code block. To do so, the
pipe operator %>%
is used, that is available in the dplyr package.
productreviews_preprocessed <- Corpus(VectorSource(productreviews)) %>%
tm_map(content_transformer(str_to_lower)) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(stripWhitespace)
Perform text preprocessing on the subset of the product reviews, given
as corpus below, and store it as productreviews_preprocessed
.
Make sure to perform the following preprocessing steps:
- Normalization,
- Remove punctuation,
- Remove numbers,
- Strip whitespaces.
To download the productreviews
dataset click
here2.
Assume that: