Term Weighting: TF and TF-IDF

In this exercise, we will explore the concept of term weighting, specifically focusing on Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF). These are fundamental concepts in text mining and natural language processing, used to quantify the importance of terms in a document.

Document by Term Matrix

Consider the following document by term matrix from the lecture slides. This matrix represents the frequency of terms (words) in different documents.

dtm <- matrix(c(0,1,3,0,2,2,2,0,2,0,0,0,6,0,6,0,8,7,8,7), ncol=5)
colnames(dtm) <- c("Term 1","Term 2","Term 3","Term 4","Term 5")
rownames(dtm) <- c("Doc 1","Doc 2","Doc 3","Doc 4")
dtm

	Term 1	Term 2	Term 3	Term 4	Term 5
Doc 1	0	2	2	6	8
Doc 2	1	2	0	0	7
Doc 3	3	2	0	6	8
Doc 4	0	0	0	0	7

Term Frequency (TF)

The values in the dtm are called weights and are at this point simply the raw frequencies of appearance in a document. The weight of a term t in document d (w_td) equals the term frequency of term t in document d (tf_td). In other words, it’s the count of a specific term in a document.

Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

The TF-IDF weight is the product of two terms: the first computes the normalized Term Frequency (TF), the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

w_td = tf_td * idf_t, with:

tf_td the term frequencies of term t in document d,
idf_t the inverse document frequencies of term t.

The latter equals n/df_t, with:

n the total number of documents (4 in the example above),
df_t the number of documents where term t was present.

Linking back to the example, we notice that:

the first term appears in 2 documents,
the second term appears in 3 documents,
the third term appears in 1 document,
the fourth term appears in 2 documents,
the fifth term appears in 4 documents.

From this data, we would expect that the third term would get the highest idf and that the fifth term would get the lowest idf. This is because the third term is the rarest (appears in the least number of documents) and the fifth term is the most common (appears in the most number of documents). Let’s check this.

Exercise

Calculate the inverse document frequencies (idf) of the five terms in the dtm given above and store it as idf.