Weighted Document by Term Matrix

In this exercise, we will continue from the previous exercise on Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF). We will now focus on creating a weighted Document by Term Matrix (DTM) where the weight of a term is inversely related to the number of documents in which the term occurred.

Transforming IDF into a Matrix

First, we transform the calculated vector of IDF into a matrix. This allows us to multiply it elementwise with the DTM.

(idf_mat <- matrix(rep(idf, 4), nrow = nrow(dtm), ncol = ncol(dtm), byrow = TRUE))
[,1] [,2] [,3] [,4] [,5]
[1,] 2 1.333333 4 2 1
[2,] 2 1.333333 4 2 1
[3,] 2 1.333333 4 2 1
[4,] 2 1.333333 4 2 1

Calculating Weighted DTM

Next, we calculate dtm_weighted. In this matrix, the weight of a term is inversely related to the number of documents in which the term occurred. This is done by multiplying the DTM with the IDF matrix elementwise.

(dtm_weighted <- dtm * idf_mat)
Term 1 Term 2 Term 3 Term 4 Term 5
Doc 1 0 2.666667 8 12 8
Doc 2 2 2.666667 0 0 7
Doc 3 6 2.666667 0 12 8
Doc 4 0 0.000000 0 0 7

Reducing Impact of Document Length and Extreme Frequencies

To reduce the impact of the length of different documents we can apply the logarithm to the tf values: tf_td = log(1+tf_td) = log1p(tf_td). We can also reduce the effect of the raw idf by taking the logarithm: idf_t = log(n/df_t) + 1. We add a 1 for the cases were the term appears in all documents (n = df_i). The idf will be 1 and the log(1) = 0. Notice that (log(idf)+1) gives a more nuanced weight to extreme frequencies.

idf <- function(doc= 100, t = 1, type = c('raw','log')) {
  if (type == 'raw') return(doc/t)
  if (type == 'log') return(1+log(doc/t))
}

par(mfrow = c(1,2))
plot(x = 1:100, y = idf(doc = 100, t=1:100, type = 'raw'), 
     type = 'l', col = 'blue', 
     ylab = 'idf(t)', 
     xlab = 'Number of documents containting t',
     main = "Raw IDF")

plot(x = 1:100, y = idf(doc = 100, t=1:100, type = 'log'), 
     type = 'l', col = 'red', 
     ylab = '1+log(idf(t))', 
     xlab = 'Number of documents containting t', 
     main = "Log(IDF)")
par(mfrow = c(1,1))

raw idf and log(idf)

Exercise

Calculate the final weighted dtm, by applying the logarithms, and store it as dtm_weighted.