Singular Value Decomposition for Donation Behavior Analysis

In this exercise, we will explore how to model donation behavior using Singular Value Decomposition (SVD) and logistic regression. We will use the Donation.Rdata dataset for this purpose. You can find the related paper here1.

Loading Required Packages and Data

load("Donation.RData")
if (!require("pacman")) install.packages("pacman"); require("pacman")
p_load(glmnet, tidyverse, data.table, AUC, lift, ggplot2)

We also set a seed to ensure consistent outputs across multiple runs

set.seed(seed = 68)

Selecting Top Pages, Groups, and Categories

To avoid running out of RAM memory and to decrease computation time, we select the top 1000 pages for each matrix.

select_top_n <- function(data, n = 1000) {
  data %>%
    map_df(function(x) sum(x)) %>%
    pivot_longer(cols = everything(), names_to = 'column', values_to = 'freq') %>%
    arrange(-freq) %>%
    slice_head(n = n) %>%
    pull(column)
}

top_pages <- select_top_n(pagesTRAIN)
top_groups <- select_top_n(groupsTRAIN)
top_categories <- select_top_n(categoriesTRAIN)

Depending on the capacity of your machine, you can select different top pages/groups/categories.

Applying Singular Value Decomposition (SVD)

Now to SVD. SVD is used as a dimensionality reduction technique to reduce the number of features. Remember: A = UDV ^ T.

The dimensions are reduced by keeping the first k singular values. The singular values are sorted in decreasing order of explained variance in the original data. k is set equal to 50.

k <- 50

SVD is applied to the pagesTRAIN dataframe, which is a binary matrix indicating whether user i likes Facebook pages j. Let’s have a look at the first 10 top_pages for a random sample of 10 users.

pagesTRAIN %>%
  select(all_of(top_pages)) %>%
  select(1:10) %>%
  slice_sample(n = 10)
           Plan International   Belgium HLN.be   Belgian Red Devils   E\\u00e9n   Nieuwsblad.be   GuiHome vous d\\u00e9tend   Tasty
5291       1                    0                0                    0           0               0                           1
3976       1                    0                0                    0           0               1                           0
2109.2     1                    1                0                    0           0               0                           0
1518.1     1                    1                0                    1           0               0                           0
6695.2     1                    0                0                    0           0               0                           0
5288       1                    0                0                    0           0               0                           0
5706.6     1                    0                0                    0           0               0                           0
6747.1     1                    0                0                    0           1               0                           0
2597       1                    0                0                    0           0               0                           0
1498.2     1                    0                0                    0           0               0                           0
           De Standaard   Lidl Belgium   De Morgen
5291       0              1              0
3976       0              1              0
2109.2     0              0              0
1518.1     0              0              0
6695.2     0              0              0
5288       0              0              0
5706.6     0              0              0
6747.1     0              0              0
2597       0              0              0
1498.2     0              0              0

Note that matrices U and V have to be computed by applying SVD to the TRAIN (!) set to make sure there is no information leakage between the train and test data.

svd_pages <- svd(pagesTRAIN %>% select(all_of(top_pages)))

Only the first k dimensions in matrix U are selected. These are the new features.

svd_pagesTRAIN <- svd_pages$u[, c(1:k)]

The features in the test set should also be transformed. This is done by matrix multiplying the test matrix with matrix V.

svd_pagesTEST <- as.matrix(pagesTEST %>% select(all_of(top_pages))) %*%
  svd_pages$v[, c(1:k)] %*%
  solve(diag(svd_pages$d[c(1:k)]))

Same computation for the categories of Facebook pages. Note that SVD is applied to pages, categories and groups separately. This is done to be able to distinguish between their respective value/importances in the classification model.

svd_categories <- svd(categoriesTRAIN %>% select(all_of(top_categories)))
svd_categoriesTRAIN <- svd_categories$u[, c(1:k)]
svd_categoriesTEST <- as.matrix(categoriesTEST %>% select(all_of(top_categories))) %*%
  svd_categories$v[, c(1:k)] %*%
  solve(diag(svd_categories$d[c(1:k)]))

svd_groups <- svd(groupsTRAIN %>% select(all_of(top_groups)))
svd_groupsTRAIN <- svd_groups$u[, c(1:k)]
svd_groupsTEST <- as.matrix(groupsTEST %>% select(all_of(top_groups))) %*%
  svd_groups$v[, c(1:k)] %*%
  solve(diag(svd_groups$d[c(1:k)]))

Exercise

The data for this exercise contains a train and test basetable of photo features for cats and dogs. Apply SVD and store the result in svd. Reduce the dimensions by keeping the first 100 singular values. Store the results in svd_train and svd_test.

To download the train basetable for the cats and dogs click: here2

To download the test basetable for the cats and dogs click: here3


Assume that: