Drop links or images here to add them to the editor.

Weighted Document by Term Matrix

In this exercise, we will continue from the previous exercise on Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF). We will now focus on creating a weighted Document by Term Matrix (DTM) where the weight of a term is inversely related to the number of documents in which the term occurred.

Transforming IDF into a Matrix

First, we transform the calculated vector of IDF into a matrix. This allows us to multiply it elementwise with the DTM.

(idf_mat <- matrix(rep(idf, 4), nrow = nrow(dtm), ncol = ncol(dtm), byrow = TRUE))
[,1] [,2] [,3] [,4] [,5]
[1,] 2 1.333333 4 2 1
[2,] 2 1.333333 4 2 1
[3,] 2 1.333333 4 2 1
[4,] 2 1.333333 4 2 1

Calculating Weighted DTM

Next, we calculate dtm_weighted. In this matrix, the weight of a term is inversely related to the number of documents in which the term occurred. This is done by multiplying the DTM with the IDF matrix elementwise.

(dtm_weighted <- dtm * idf_mat)
Term 1 Term 2 Term 3 Term 4 Term 5
Doc 1 0 2.666667 8 12 8
Doc 2 2 2.666667 0 0 7
Doc 3 6 2.666667 0 12 8
Doc 4 0 0.000000 0 0 7

Reducing Impact of Document Length and Extreme Frequencies

To reduce the impact of the length of different documents we can apply the logarithm to the tf values: tf_td = log(1+tf_td) = log1p(tf_td). We can also reduce the effect of the raw idf by taking the logarithm: idf_t = log(n/df_t) + 1. We add a 1 for the cases were the term appears in all documents (n = df_i). The idf will be 1 and the log(1) = 0. Notice that (log(idf)+1) gives a more nuanced weight to extreme frequencies.

idf <- function(doc= 100, t = 1, type = c('raw','log')) {
  if (type == 'raw') return(doc/t)
  if (type == 'log') return(1+log(doc/t))
}

par(mfrow = c(1,2))
plot(x = 1:100, y = idf(doc = 100, t=1:100, type = 'raw'), 
     type = 'l', col = 'blue', 
     ylab = 'idf(t)', 
     xlab = 'Number of documents containting t',
     main = "Raw IDF")

plot(x = 1:100, y = idf(doc = 100, t=1:100, type = 'log'), 
     type = 'l', col = 'red', 
     ylab = '1+log(idf(t))', 
     xlab = 'Number of documents containting t', 
     main = "Log(IDF)")
par(mfrow = c(1,1))

raw idf and log(idf)

Exercise

Calculate the final weighted dtm, by applying the logarithms, and store it as dtm_weighted.