When creating the dtm, you can immediately choose to make a tf-idf matrix by setting the weighting function in the control list. The control list also specifies only to include words with a minimum length of 2 characters.
(dtm_reviews <- DocumentTermMatrix(
reviews_spell_checked,
control = list(wordLengths = c(2, Inf), weighting=function(x) weightTfIdf(x))
))
<<DocumentTermMatrix (documents: 5, terms: 169)>>
Non-/sparse entries: 195/650
Sparsity : 77%
Maximal term length: 13
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
We see that the dtm is created. We also get the sparsity. We will remove terms with too much sparsity. To do so, we need to define how much sparsity that we will allow. Let’s try different values for the sparse parameter.
dtm_reviews_dense <- removeSparseTerms(dtm_reviews, sparse= 0.3)
inspect(dtm_reviews_dense)
<<DocumentTermMatrix (documents: 5, terms: 3)>>
Non-/sparse entries: 12/3
Sparsity : 20%
Maximal term length: 4
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Sample :
Terms
Docs pad use work
1 0.008942447 0.017884894 0.008942447
2 0.000000000 0.000000000 0.029266190
3 0.006849534 0.013699068 0.006849534
4 0.014445491 0.004127283 0.004127283
5 0.040241012 0.040241012 0.000000000
dtm_reviews_dense <- removeSparseTerms(dtm_reviews, sparse= 0.6)
inspect(dtm_reviews_dense)
<<DocumentTermMatrix (documents: 5, terms: 5)>>
Non-/sparse entries: 18/7
Sparsity : 28%
Maximal term length: 8
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Sample :
Terms
Docs keyboard pad use well work
1 0.02047127 0.008942447 0.017884894 0.000000000 0.008942447
2 0.00000000 0.000000000 0.000000000 0.066996872 0.029266190
3 0.06272048 0.006849534 0.013699068 0.015680119 0.006849534
4 0.02362069 0.014445491 0.004127283 0.004724138 0.004127283
5 0.00000000 0.040241012 0.040241012 0.000000000 0.000000000
dtm_reviews_dense <- removeSparseTerms(dtm_reviews, sparse= 0.9)
inspect(dtm_reviews_dense)
<<DocumentTermMatrix (documents: 5, terms: 169)>>
Non-/sparse entries: 195/650
Sparsity : 77%
Maximal term length: 13
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Sample :
Terms
Docs anyway find great hope laptop lighter matchbook near travel turn
1 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.000000
2 0.2110844 0.2110844 0.000000 0.2110844 0.2110844 0.2110844 0.000000 0.2110844 0.2110844 0.000000
3 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.000000
4 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.000000
5 0.0000000 0.0000000 0.290241 0.0000000 0.0000000 0.0000000 0.290241 0.0000000 0.0000000 0.290241
There actually is no way to know in advance how dense the dtm should be. Therefore, it is a reasonable approach not to remove too many sparse terms and rely on the dimension reduction step that comes in the following step.
Create the dtm_reviews_dense
variable, when it is required that at least
45% of the documents need to use the word for the word to stay in the dtm.
To download the productreviews
dataset click
here1.
Assume that:
dtm_reviews
variable is given.