In this exercise, we will continue from the previous exercise on Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF). We will now focus on creating a weighted Document by Term Matrix (DTM) where the weight of a term is inversely related to the number of documents in which the term occurred.
First, we transform the calculated vector of IDF into a matrix. This allows us to multiply it elementwise with the DTM.
(idf_mat <- matrix(rep(idf, 4), nrow = nrow(dtm), ncol = ncol(dtm), byrow = TRUE))
[,1] | [,2] | [,3] | [,4] | [,5] | |
---|---|---|---|---|---|
[1,] | 2 | 1.333333 | 4 | 2 | 1 |
[2,] | 2 | 1.333333 | 4 | 2 | 1 |
[3,] | 2 | 1.333333 | 4 | 2 | 1 |
[4,] | 2 | 1.333333 | 4 | 2 | 1 |
Next, we calculate dtm_weighted. In this matrix, the weight of a term is inversely related to the number of documents in which the term occurred. This is done by multiplying the DTM with the IDF matrix elementwise.
(dtm_weighted <- dtm * idf_mat)
Term 1 | Term 2 | Term 3 | Term 4 | Term 5 | |
---|---|---|---|---|---|
Doc 1 | 0 | 2.666667 | 8 | 12 | 8 |
Doc 2 | 2 | 2.666667 | 0 | 0 | 7 |
Doc 3 | 6 | 2.666667 | 0 | 12 | 8 |
Doc 4 | 0 | 0.000000 | 0 | 0 | 7 |
To reduce the impact of the length of different documents we can apply the logarithm to the tf values: tf_td = log(1+tf_td) = log1p(tf_td). We can also reduce the effect of the raw idf by taking the logarithm: idf_t = log(n/df_t) + 1. We add a 1 for the cases were the term appears in all documents (n = df_i). The idf will be 1 and the log(1) = 0. Notice that (log(idf)+1) gives a more nuanced weight to extreme frequencies.
idf <- function(doc= 100, t = 1, type = c('raw','log')) {
if (type == 'raw') return(doc/t)
if (type == 'log') return(1+log(doc/t))
}
par(mfrow = c(1,2))
plot(x = 1:100, y = idf(doc = 100, t=1:100, type = 'raw'),
type = 'l', col = 'blue',
ylab = 'idf(t)',
xlab = 'Number of documents containting t',
main = "Raw IDF")
plot(x = 1:100, y = idf(doc = 100, t=1:100, type = 'log'),
type = 'l', col = 'red',
ylab = '1+log(idf(t))',
xlab = 'Number of documents containting t',
main = "Log(IDF)")
par(mfrow = c(1,1))
Calculate the final weighted dtm, by applying the logarithms,
and store it as dtm_weighted
.