Singular Value Decomposition

SVD will help to reduce the common sparse matrices that were created in the previous exercise to a selected number of terms. Note that we implemented an approximation with the package irlba, since the ‘normal’ svd gets stuck with very large datasets.

p_load(irlba)

We set k to a high number (20 in our case).

trainer <- irlba(t(sm_train), nu=20, nv=20)
str(trainer)
List of 5
 $ d    : num [1:20] 43.42 16.66 15.96 10.64 9.99 ...
 $ u    : num [1:794, 1:20] 0.000338 0.017823 0.000067 0.001432 0.000067 ...
 $ v    : num [1:57, 1:20] 2.91e-03 6.11e-02 4.05e-05 4.95e-02 1.29e-03 ...
 $ iter : int 5
 $ mprod: int 86

We are interested in the V.

str(trainer$v)
num [1:57, 1:20] 2.91e-03 6.11e-02 4.05e-05 4.95e-02 1.29e-03 ...

tester <- as.data.frame(as.matrix(sm_test) %*% trainer$u %*% solve(diag(trainer$d)))
head(tester)
           V1           V2           V3          V4          V5           V6           V7           V8          V9         V10
1 0.000000000  0.000000000  0.000000000  0.00000000  0.00000000  0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
2 0.004638223 -0.001838058  0.001808587  0.01134719 -0.04593843 -0.040340958 -0.002985707  0.032481679 -0.04327483  0.03081427
3 0.032073328  0.014390644 -0.001614435 -0.03877729 -0.01980233 -0.006939413  0.029395594 -0.009106234 -0.05035071 -0.05923115
4 0.102062718 -0.059287985  0.098865964  0.01234581 -0.06688715  0.008638357 -0.021466135  0.062304890  0.03959509  0.04814645
5 0.131434554 -0.138102331 -0.154235650  0.16613758 -0.06415589  0.073331011 -0.039856198  0.063213181 -0.31122543 -0.04374198
6 0.022558175 -0.050621175 -0.055580420  0.07558074 -0.12697553 -0.112014066  0.031387910  0.027835233 -0.14114773  0.05139434

Exercise

Compute the trainer and the tester variables for the oxfam data, with a k equal to 2.

To download the SentimentReal dataset click here¹.

To download the oxfam dataset click here².

Assume that:

The variables sm_train_oxfam and sm_train_oxfam, that were calculated in the previous exercise, are given.
The irlba package is loaded.