All the text mining steps that we performed with the tm package in the previous exercises
can also be done with the tidytext package. In the following exercises we will again work
with the productreviews
dataset, but now using the tidytext package.
Since we will use the same data as used in the exercises of the tm package, there
is referred to exercise 2.1 to investigate the data. In the next exercise we will
immediately start with the preprocessing of the data. In the text preprocessing step
everything is transformed to lower case, punctuation and stopwords are removed, and a
spelling check is done. Let’s start by reading in the productreviews
dataset.
p_load(tidytext)
reviews <- read_delim("productreviews.csv", delim = "\n", col_names = FALSE)
reviews_text <- reviews %>% pull(X1)
If you want to know more about the tidytext package, please check te following links:
We will try to reduce the number of unique (and meaningless) words by performing different preprocessing steps. We will measure the number of unique words for a document in the corpus as we did in exercise 2.2, by using the following function.
unique_word_count <- function(data){
content <- character(length(data))
if (any(class(data) %in% c("VCorpus", "Corpus","SimpleCorpus"))) {
for (i in 1:length(data)) content[i] <- data[[i]]$content
} else {
content <- data
}
uniquewords <- unique(unlist(map(str_split(as.character(content)," "),unique)))
length(uniquewords)
}
unique_word_count(reviews_text)
[1] 2145
In the first preprocessing step, case conversion will transform all values to lower case.
text_clean <- reviews_text %>% str_to_lower()
unique_word_count(text_clean)
[1] 1999
text_clean <- text_clean %>% str_replace_all("[[:punct:]]","")
unique_word_count(text_clean)
[1] 1541
text_clean <- text_clean %>% str_replace_all("[[:digit:]]", "")
unique_word_count(text_clean)
[1] 1515
text_clean <- text_clean %>% str_squish()
unique_word_count(text_clean)
[1] 1515
In the example above, several text preprocessing steps were shortly introduced. However, it
is also possible to combine all different steps into one integrated code block. To do so, the
pipe operator %>%
is used, that is available in the dplyr package.
text_clean <- reviews_text %>%
str_to_lower() %>%
str_replace_all("[[:punct:]]", "") %>%
str_replace_all("[[:digit:]]", "") %>%
str_squish()
Perform text preprocessing on the subset of the product reviews, given
as corpus below, and store it as product_reviews_preprocessed
.
Make sure to perform the following preprocessing steps:
- Normalization,
- Remove punctuation,
- Remove numbers,
- Remove whitespaces.
To download the productreviews
dataset click
here4.
Assume that:
productreviews
dataset is given.