STEP 2, 3, 4: Create document-by-term matrix, Term and document weighting

When creating the dtm, you can immediately choose to make a tf-idf matrix by setting the weighting function in the control list. The control list also specifies only to include words with a minimum length of 2 characters.

(dtm_reviews <- DocumentTermMatrix(
  reviews_spell_checked, 
  control = list(wordLengths = c(2, Inf), weighting=function(x) weightTfIdf(x))
))

<<DocumentTermMatrix (documents: 5, terms: 169)>>
Non-/sparse entries: 195/650
Sparsity           : 77%
Maximal term length: 13
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

We see that the dtm is created. We also get the sparsity. We will remove terms with too much sparsity. To do so, we need to define how much sparsity that we will allow. Let’s try different values for the sparse parameter.

Allow that 30% of the documents do not use a word. This means that at least 70% of the documents need to use the word for the word to stay in the dtm.

dtm_reviews_dense <- removeSparseTerms(dtm_reviews, sparse= 0.3)
inspect(dtm_reviews_dense)

<<DocumentTermMatrix (documents: 5, terms: 3)>>
Non-/sparse entries: 12/3
Sparsity           : 20%
Maximal term length: 4
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
Sample             :
    Terms
Docs         pad         use        work
   1 0.008942447 0.017884894 0.008942447
   2 0.000000000 0.000000000 0.029266190
   3 0.006849534 0.013699068 0.006849534
   4 0.014445491 0.004127283 0.004127283
   5 0.040241012 0.040241012 0.000000000

Allow that 60% of the documents do not use a word.

dtm_reviews_dense <- removeSparseTerms(dtm_reviews, sparse= 0.6)
inspect(dtm_reviews_dense)

<<DocumentTermMatrix (documents: 5, terms: 5)>>
Non-/sparse entries: 18/7
Sparsity           : 28%
Maximal term length: 8
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
Sample             :
    Terms
Docs   keyboard         pad         use        well        work
   1 0.02047127 0.008942447 0.017884894 0.000000000 0.008942447
   2 0.00000000 0.000000000 0.000000000 0.066996872 0.029266190
   3 0.06272048 0.006849534 0.013699068 0.015680119 0.006849534
   4 0.02362069 0.014445491 0.004127283 0.004724138 0.004127283
   5 0.00000000 0.040241012 0.040241012 0.000000000 0.000000000

Allow that 90% of the documents do not use a word.

dtm_reviews_dense <- removeSparseTerms(dtm_reviews, sparse= 0.9)
inspect(dtm_reviews_dense)

<<DocumentTermMatrix (documents: 5, terms: 169)>>
Non-/sparse entries: 195/650
Sparsity           : 77%
Maximal term length: 13
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
Sample             :
    Terms
Docs    anyway      find    great      hope    laptop   lighter matchbook      near    travel     turn
   1 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000  0.000000 0.0000000 0.0000000 0.000000
   2 0.2110844 0.2110844 0.000000 0.2110844 0.2110844 0.2110844  0.000000 0.2110844 0.2110844 0.000000
   3 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000  0.000000 0.0000000 0.0000000 0.000000
   4 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000  0.000000 0.0000000 0.0000000 0.000000
   5 0.0000000 0.0000000 0.290241 0.0000000 0.0000000 0.0000000  0.290241 0.0000000 0.0000000 0.290241

There actually is no way to know in advance how dense the dtm should be. Therefore, it is a reasonable approach not to remove too many sparse terms and rely on the dimension reduction step that comes in the following step.

Exercise

Create the dtm_reviews_dense variable, when it is required that at least 45% of the documents need to use the word for the word to stay in the dtm.

To download the productreviews dataset click here¹.

Assume that:

The dtm_reviews variable is given.
The SnowballC, slam, tm, and dplyr packages are loaded.