Introduction to tidytext

All the text mining steps that we performed with the tm package in the previous exercises can also be done with the tidytext package. In the following exercises we will again work with the productreviews dataset, but now using the tidytext package. Since we will use the same data as used in the exercises of the tm package, there is referred to exercise 2.1 to investigate the data. In the next exercise we will immediately start with the preprocessing of the data. In the text preprocessing step everything is transformed to lower case, punctuation and stopwords are removed, and a spelling check is done. Let’s start by reading in the productreviews dataset.

p_load(tidytext)
reviews <- read_delim("productreviews.csv", delim = "\n", col_names = FALSE)
reviews_text <- reviews %>% pull(X1)

If you want to know more about the tidytext package, please check te following links:

Converting to and from non-tidy formats¹ for a chapter about the link between tm and tidytext,
The tidy text format² for a general introduction to tidytext,
Analyzing word and document frequency: tf-idf³ for tf-idf specific tasks.

STEP 1: Text preprocessing

We will try to reduce the number of unique (and meaningless) words by performing different preprocessing steps. We will measure the number of unique words for a document in the corpus as we did in exercise 2.2, by using the following function.

unique_word_count <- function(data){
  content <- character(length(data))
  if (any(class(data) %in% c("VCorpus", "Corpus","SimpleCorpus"))) {
    for (i in 1:length(data)) content[i] <- data[[i]]$content
  } else {
    content <- data
  }
  uniquewords <- unique(unlist(map(str_split(as.character(content)," "),unique)))
  length(uniquewords)
}

unique_word_count(reviews_text)
[1] 2145

Normalization

In the first preprocessing step, case conversion will transform all values to lower case.

text_clean <- reviews_text %>% str_to_lower()
unique_word_count(text_clean)
[1] 1999

Remove punctuation

text_clean <- text_clean %>% str_replace_all("[[:punct:]]","")
unique_word_count(text_clean)
[1] 1541

Remove numbers

text_clean <- text_clean %>% str_replace_all("[[:digit:]]", "")
unique_word_count(text_clean)
[1] 1515

Remove whitespace

text_clean <- text_clean %>% str_squish()
unique_word_count(text_clean)
[1] 1515

Integrated text preprocessing

In the example above, several text preprocessing steps were shortly introduced. However, it is also possible to combine all different steps into one integrated code block. To do so, the pipe operator %>% is used, that is available in the dplyr package.

text_clean <- reviews_text %>%
  str_to_lower() %>%
  str_replace_all("[[:punct:]]", "") %>%
  str_replace_all("[[:digit:]]", "") %>%
  str_squish()

Exercise

Perform text preprocessing on the subset of the product reviews, given as corpus below, and store it as product_reviews_preprocessed. Make sure to perform the following preprocessing steps:

- Normalization,

- Remove punctuation,

- Remove numbers,

- Remove whitespaces.

To download the productreviews dataset click here⁴.

Assume that:

The productreviews dataset is given.
The tidytext, stringr, tm, and dplyr packages are loaded.