Text Preprocessing in Text Mining

In this exercise, we will perform text preprocessing using the tm package. Text preprocessing is a crucial step in many data science projects, especially those involving natural language processing. It involves transforming raw text data into a format that can be easily understood and analyzed by text mining algorithms.

To do so, it is required to install the tm (i.e., text mining) package. This package expects the SnowballC and slam packages to be installed. Good documentation about tm package can be found here¹.

p_load(SnowballC, slam, tm)

Text preprocessing

In the text preprocessing step everything is transformed to lower case, punctuation and stopwords are removed, and a spelling check is done. First, we load the productreviews data as a corpus object.

reviews <- Corpus(VectorSource(reviews))

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 5

This tells us that we have 5 documents (reviews in this case). A SimpleCorpus is a type of VCorpus, which stands for Volatile Corpus, meaning that it is stored in RAM instead of stored on disk. SimpleCorpus is an optimized, faster, but more limited, version of VCorpus. Note that corpus refers to a collection of documents. It is the main structure for managing documents in the tm package. VectorSource means we are reading in a vector It also tells us that there is no metadata on the corpus level or document level.

Let’s look at the first document:

reviews[[1]]

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 385

We see that there are 7 metadata fields. These can be accessed using str(reviews[[1]]). We also see the number of characters (385).

In what follows, we will try to reduce the number of unique (and meaningless) words. We measure the number of unique words for a document in the corpus as follows:

unique_word_count <- function(data){
  content <- character(length(data))
  if (any(class(data) %in% c("VCorpus", "Corpus","SimpleCorpus"))) {
    for (i in 1:length(data)) content[i] <- data[[i]]$content
  } else {
    content <- data
  }
  uniquewords <- unique(unlist(map(
    str_split(as.character(content)," "),
    unique)))
  length(uniquewords)
}

unique_word_count(reviews)

[1] 295

Normalization

Now we can apply the unique_word_count function in the tm package. Case conversion transforms all values to lower case. The tm_map function is similar to the apply function since it applies a function to each element (document in this case). The function content_transformer allows to adapt the str_to_lower function to work with the documents. Note that the function will also work with the base R alternative tolower.

reviews <- tm_map(reviews, content_transformer(str_to_lower))

unique_word_count(reviews)

[1] 283

Remove punctuation

reviews <- tm_map(reviews, removePunctuation)

unique_word_count(reviews)

[1] 256

Remove numbers

reviews <- tm_map(reviews, removeNumbers)

unique_word_count(reviews)

[1] 254

Collapse multiple white space in a single whitespace

reviews <-  tm_map(reviews, stripWhitespace)

If you want to look at the result (e.g., the first document), you could use as.character(reviews[[1]]).

Integrated text preprocessing

In the example above, several text preprocessing steps were shortly introduced. However, it is also possible to combine all different steps into one integrated code block. To do so, the pipe operator %>% is used, that is available in the dplyr package.

productreviews_preprocessed <- Corpus(VectorSource(productreviews)) %>%
  tm_map(content_transformer(str_to_lower)) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(stripWhitespace) 

Exercise

Perform text preprocessing on the subset of the product reviews, given as corpus below, and store it as productreviews_preprocessed. Make sure to perform the following preprocessing steps:

- Normalization,

- Remove punctuation,

- Remove numbers,

- Strip whitespaces.

To download the productreviews dataset click here².

Assume that:

The stringr, SnowballC, slam, tm, and dplyr packages are loaded.