Weighted Document by Term Matrix

In this exercise, we will continue from the previous exercise on Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF). We will now focus on creating a weighted Document by Term Matrix (DTM) where the weight of a term is inversely related to the number of documents in which the term occurred.

Transforming IDF into a Matrix

First, we transform the calculated vector of IDF into a matrix. This allows us to multiply it elementwise with the DTM.

(idf_mat <- matrix(rep(idf, 4), nrow = nrow(dtm), ncol = ncol(dtm), byrow = TRUE))

	[,1]	[,2]	[,3]	[,4]	[,5]
[1,]	2	1.333333	4	2	1
[2,]	2	1.333333	4	2	1
[3,]	2	1.333333	4	2	1
[4,]	2	1.333333	4	2	1

Calculating Weighted DTM

Next, we calculate dtm_weighted. In this matrix, the weight of a term is inversely related to the number of documents in which the term occurred. This is done by multiplying the DTM with the IDF matrix elementwise.

(dtm_weighted <- dtm * idf_mat)

	Term 1	Term 2	Term 3	Term 4	Term 5
Doc 1	0	2.666667	8	12	8
Doc 2	2	2.666667	0	0	7
Doc 3	6	2.666667	0	12	8
Doc 4	0	0.000000	0	0	7

Reducing Impact of Document Length and Extreme Frequencies

To reduce the impact of the length of different documents we can apply the logarithm to the tf values: tf_td = log(1+tf_td) = log1p(tf_td). We can also reduce the effect of the raw idf by taking the logarithm: idf_t = log(n/df_t) + 1. We add a 1 for the cases were the term appears in all documents (n = df_i). The idf will be 1 and the log(1) = 0. Notice that (log(idf)+1) gives a more nuanced weight to extreme frequencies.

idf <- function(doc= 100, t = 1, type = c('raw','log')) {
  if (type == 'raw') return(doc/t)
  if (type == 'log') return(1+log(doc/t))
}

par(mfrow = c(1,2))
plot(x = 1:100, y = idf(doc = 100, t=1:100, type = 'raw'), 
     type = 'l', col = 'blue', 
     ylab = 'idf(t)', 
     xlab = 'Number of documents containting t',
     main = "Raw IDF")

plot(x = 1:100, y = idf(doc = 100, t=1:100, type = 'log'), 
     type = 'l', col = 'red', 
     ylab = '1+log(idf(t))', 
     xlab = 'Number of documents containting t', 
     main = "Log(IDF)")
par(mfrow = c(1,1))

raw idf and log(idf)

Exercise

Calculate the final weighted dtm, by applying the logarithms, and store it as dtm_weighted.