In this exercise, we will explore how to model donation behavior using
Singular Value Decomposition (SVD) and logistic regression.
We will use the Donation.Rdata
dataset for this purpose. You can find the related paper
here1.
load("Donation.RData")
if (!require("pacman")) install.packages("pacman"); require("pacman")
p_load(glmnet, tidyverse, data.table, AUC, lift, ggplot2)
We also set a seed to ensure consistent outputs across multiple runs
set.seed(seed = 68)
To avoid running out of RAM memory and to decrease computation time, we select the top 1000 pages for each matrix.
select_top_n <- function(data, n = 1000) {
data %>%
map_df(function(x) sum(x)) %>%
pivot_longer(cols = everything(), names_to = 'column', values_to = 'freq') %>%
arrange(-freq) %>%
slice_head(n = n) %>%
pull(column)
}
top_pages <- select_top_n(pagesTRAIN)
top_groups <- select_top_n(groupsTRAIN)
top_categories <- select_top_n(categoriesTRAIN)
Depending on the capacity of your machine, you can select different top pages/groups/categories.
Now to SVD. SVD is used as a dimensionality reduction technique to reduce the number of features.
Remember: A = UDV ^ T
.
U
: The user-to-concept matrix. It presents how user i scores on dimension j.
U will be used as the input of the classification model.V
: The page-to-concept matrix. It Presents how page i scores on dimension j.
V can be used to interpret the dimensions resulting from SVD.D
: The concept strength matrix.The dimensions are reduced by keeping the first k singular values.
The singular values are sorted in decreasing order of explained variance in the original data.
k
is set equal to 50.
k <- 50
SVD is applied to the pagesTRAIN dataframe, which is a binary matrix indicating whether user i likes Facebook pages j. Let’s have a look at the first 10 top_pages for a random sample of 10 users.
pagesTRAIN %>%
select(all_of(top_pages)) %>%
select(1:10) %>%
slice_sample(n = 10)
Plan International Belgium HLN.be Belgian Red Devils E\\u00e9n Nieuwsblad.be GuiHome vous d\\u00e9tend Tasty
5291 1 0 0 0 0 0 1
3976 1 0 0 0 0 1 0
2109.2 1 1 0 0 0 0 0
1518.1 1 1 0 1 0 0 0
6695.2 1 0 0 0 0 0 0
5288 1 0 0 0 0 0 0
5706.6 1 0 0 0 0 0 0
6747.1 1 0 0 0 1 0 0
2597 1 0 0 0 0 0 0
1498.2 1 0 0 0 0 0 0
De Standaard Lidl Belgium De Morgen
5291 0 1 0
3976 0 1 0
2109.2 0 0 0
1518.1 0 0 0
6695.2 0 0 0
5288 0 0 0
5706.6 0 0 0
6747.1 0 0 0
2597 0 0 0
1498.2 0 0 0
Note that matrices U and V have to be computed by applying SVD to the TRAIN (!) set to make sure there is no information leakage between the train and test data.
svd_pages <- svd(pagesTRAIN %>% select(all_of(top_pages)))
Only the first k
dimensions in matrix U are selected. These are the new features.
svd_pagesTRAIN <- svd_pages$u[, c(1:k)]
The features in the test set should also be transformed. This is done by matrix multiplying the test matrix with matrix V.
svd_pagesTEST <- as.matrix(pagesTEST %>% select(all_of(top_pages))) %*%
svd_pages$v[, c(1:k)] %*%
solve(diag(svd_pages$d[c(1:k)]))
Same computation for the categories of Facebook pages. Note that SVD is applied to pages, categories and groups separately. This is done to be able to distinguish between their respective value/importances in the classification model.
svd_categories <- svd(categoriesTRAIN %>% select(all_of(top_categories)))
svd_categoriesTRAIN <- svd_categories$u[, c(1:k)]
svd_categoriesTEST <- as.matrix(categoriesTEST %>% select(all_of(top_categories))) %*%
svd_categories$v[, c(1:k)] %*%
solve(diag(svd_categories$d[c(1:k)]))
svd_groups <- svd(groupsTRAIN %>% select(all_of(top_groups)))
svd_groupsTRAIN <- svd_groups$u[, c(1:k)]
svd_groupsTEST <- as.matrix(groupsTEST %>% select(all_of(top_groups))) %*%
svd_groups$v[, c(1:k)] %*%
solve(diag(svd_groups$d[c(1:k)]))
The data for this exercise contains a train and test basetable of photo features for cats and dogs.
Apply SVD and store the result in svd
.
Reduce the dimensions by keeping the first 100 singular values.
Store the results in svd_train
and svd_test
.
To download the train basetable for the cats and dogs click: here2
To download the test basetable for the cats and dogs click: here3
Assume that: