Drop links or images here to add them to the editor.

Singular Value Decomposition

SVD will help to reduce the common sparse matrices that were created in the previous exercise to a selected number of terms. Note that we implemented an approximation with the package irlba, since the ‘normal’ svd gets stuck with very large datasets.

p_load(irlba)

We set k to a high number (20 in our case).

trainer <- irlba(t(sm_train), nu=20, nv=20)
str(trainer)
List of 5
 $ d    : num [1:20] 43.42 16.66 15.96 10.64 9.99 ...
 $ u    : num [1:794, 1:20] 0.000338 0.017823 0.000067 0.001432 0.000067 ...
 $ v    : num [1:57, 1:20] 2.91e-03 6.11e-02 4.05e-05 4.95e-02 1.29e-03 ...
 $ iter : int 5
 $ mprod: int 86

We are interested in the V.

str(trainer$v)
num [1:57, 1:20] 2.91e-03 6.11e-02 4.05e-05 4.95e-02 1.29e-03 ...

tester <- as.data.frame(as.matrix(sm_test) %*% trainer$u %*% solve(diag(trainer$d)))
head(tester)
           V1           V2           V3          V4          V5           V6           V7           V8          V9         V10
1 0.000000000  0.000000000  0.000000000  0.00000000  0.00000000  0.000000000  0.000000000  0.000000000  0.00000000  0.00000000
2 0.004638223 -0.001838058  0.001808587  0.01134719 -0.04593843 -0.040340958 -0.002985707  0.032481679 -0.04327483  0.03081427
3 0.032073328  0.014390644 -0.001614435 -0.03877729 -0.01980233 -0.006939413  0.029395594 -0.009106234 -0.05035071 -0.05923115
4 0.102062718 -0.059287985  0.098865964  0.01234581 -0.06688715  0.008638357 -0.021466135  0.062304890  0.03959509  0.04814645
5 0.131434554 -0.138102331 -0.154235650  0.16613758 -0.06415589  0.073331011 -0.039856198  0.063213181 -0.31122543 -0.04374198
6 0.022558175 -0.050621175 -0.055580420  0.07558074 -0.12697553 -0.112014066  0.031387910  0.027835233 -0.14114773  0.05139434

Exercise

Compute the trainer and the tester variables for the oxfam data, with a k equal to 2.

To download the SentimentReal dataset click here.

To download the oxfam dataset click here.

Assume that:

The variables sm_train_oxfam and sm_train_oxfam, that were calculated in the previous exercise, are given.
The irlba package is loaded.