STEP 2: Creating dtm

Creating a tibble

In the previous exercise we performed multiple preprocessing steps on the data, resulting in the text_clean vector. This is a typical character vector that we want to analyze. In order to turn it into a tidy text dataset, we first need to put it into a data frame. We do this as follows:

text_df <- tibble(doc= 1:length(text_clean), text = text_clean)
text_df
# A tibble: 86 x 2
     doc text                                                                                                                          
   <int> <chr>                                                                                                                         
 1     1 "two monthlong trips abroad this is the best it take a little while to get used to the smaller keyboard but once you do it wo~
 2     2 "this is nearly as heavy as my laptop and i was hoping to find something lighter for travel but it works well anyway"         
 3     3 "wonderfully thin light and durable the keyboard works extremely well for me my only wish about this is that the angle was no~
 4     4 "this keyboardcase cover is absolutely fabulous it works so well and it so convenient and stylish im the envy of all of my fr~
 5     5 "great case easy to use thin and turns my ipad into a macbook"                                                                
 6     6 "the cover is cool the keyboard is a little tight but such an improvement over the screen keyboard the screen sits securely a~
 7     7 "the keyboard i received was clearly used or refurbished the box had already been open and torn there were smudge marks and d~
 8     8 "i purchased this keyboard to pair with the nd generation nexus its very well thought out and i love that the case becomes a ~
 9     9 " its easy to set up it looks a little bit thick it cant protect the whole ipad mini"                                         
10    10 "this item is excellent easy to use works perfectly theres no doubt about it if you need this buy it you wont regret it"      
# ... with 76 more rows

As you know, a tibble is a modern class of data frame within R that has a convenient print method, will not convert strings to factors, and does not use row names. Tibbles are great for use with tidy tools. Notice that this data frame containing text isn’t yet compatible with tidy text analysis, though. We can’t filter out words or count which occur most frequently, since each row is made up of multiple combined words. We need to convert this so that it has one-token-per-document-per-row.

Tokenization

In this step, we will split documents into terms. Within our tidy text framework, we need to both break the text into individual tokens (this process is called tokenization) and transform it to a tidy data structure. To do this, we use tidytext’s unnest_tokens function.

freq <- text_df %>% unnest_tokens(word, text)
freq
# A tibble: 8,514 x 2
     doc word     
   <int> <chr>    
 1     1 two      
 2     1 monthlong
 3     1 trips    
 4     1 abroad   
 5     1 this     
 6     1 is       
 7     1 the      
 8     1 best     
 9     1 it       
10     1 take     
# ... with 8,504 more rows

Remove stopwords

Next, we will remove stopwords and create a word frequency table.

freq <- freq %>% anti_join(stop_words) %>% count(doc,word, name = "freq", sort = TRUE)
freq
# A tibble: 2,307 x 3
     doc word      freq
   <int> <chr>    <int>
 1    18 ipad        14
 2    38 keyboard    11
 3    80 keyboard    10
 4    18 keyboard     9
 5    46 ipad         9
 6    46 issue        9
 7     8 keyboard     8
 8     4 ipad         7
 9     4 position     7
10    16 mini         7
# ... with 2,297 more rows

Creating document by term matrix

Once you have the word count table, it is fairly straightforward to create the dtm. Note that this actually creates a ‘tm’ object as seen before.

dtm <- freq %>% cast_dtm(doc, word, freq)
dtm
<<DocumentTermMatrix (documents: 84, terms: 1147)>>
Non-/sparse entries: 2307/94041
Sparsity           : 98%
Maximal term length: 17
Weighting          : term frequency (tf)
inspect(dtm)  
<<DocumentTermMatrix (documents: 84, terms: 1147)>>
Non-/sparse entries: 2307/94041
Sparsity           : 98%
Maximal term length: 17
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs cover dont easy ipad keyboard keys nice stand tablet type
  16     3    1    0    0        3    4    1     2      0    0
  18     0    1    0   14        9    2    0     0      0    2
  23     0    1    0    0        5    4    1     1      3    0
  24     0    5    0    6        0    0    1     0      3    0
  38     0    0    1    0       11    0    2     3      7    1
  39     1    0    0    1        2    0    0     2      0    0
  4      1    0    1    7        5    0    0     0      0    3
  46     1    1    0    9        5    2    3     0      0    0
  50     0    0    1    0        5    0    0     1      0    0
  83     4    1    3    7        6    1    0     1      0    1

Exercise

Create a document by term matrix of the productreviews dataset and store it as dtm.

To download the productreviews dataset click here1.

To download the product_reviews_preprocessed dataset click here2.


Assume that: