SVD will help to reduce the common sparse matrices that were created in the previous exercise to a selected number of terms. Note that we implemented an approximation with the package irlba, since the ‘normal’ svd gets stuck with very large datasets.
p_load(irlba)
We set k to a high number (20 in our case).
trainer <- irlba(t(sm_train), nu=20, nv=20)
str(trainer)
List of 5
$ d : num [1:20] 43.42 16.66 15.96 10.64 9.99 ...
$ u : num [1:794, 1:20] 0.000338 0.017823 0.000067 0.001432 0.000067 ...
$ v : num [1:57, 1:20] 2.91e-03 6.11e-02 4.05e-05 4.95e-02 1.29e-03 ...
$ iter : int 5
$ mprod: int 86
We are interested in the V.
str(trainer$v)
num [1:57, 1:20] 2.91e-03 6.11e-02 4.05e-05 4.95e-02 1.29e-03 ...
tester <- as.data.frame(as.matrix(sm_test) %*% trainer$u %*% solve(diag(trainer$d)))
head(tester)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000 0.000000000 0.000000000 0.000000000 0.00000000 0.00000000
2 0.004638223 -0.001838058 0.001808587 0.01134719 -0.04593843 -0.040340958 -0.002985707 0.032481679 -0.04327483 0.03081427
3 0.032073328 0.014390644 -0.001614435 -0.03877729 -0.01980233 -0.006939413 0.029395594 -0.009106234 -0.05035071 -0.05923115
4 0.102062718 -0.059287985 0.098865964 0.01234581 -0.06688715 0.008638357 -0.021466135 0.062304890 0.03959509 0.04814645
5 0.131434554 -0.138102331 -0.154235650 0.16613758 -0.06415589 0.073331011 -0.039856198 0.063213181 -0.31122543 -0.04374198
6 0.022558175 -0.050621175 -0.055580420 0.07558074 -0.12697553 -0.112014066 0.031387910 0.027835233 -0.14114773 0.05139434
Compute the trainer
and the tester
variables
for the oxfam
data, with a k equal to 2.
To download the SentimentReal
dataset click
here1.
To download the oxfam
dataset click
here2.
Assume that:
sm_train_oxfam
and sm_train_oxfam
,
that were calculated in the previous exercise, are given.