While it would be possible to carry out this lab using the prcomp()
function,
here we use the svd()
function in order to illustrate its use.
We now omit 20 entries in the \(50 \times 2\) data matrix at random. We do so by first selecting 20 rows (states) at random, and then selecting one of the four entries in each row at random. This ensures that every row has at least three observed values.
nomit <- 20
set.seed(15)
ina <- sample(seq(50), nomit)
inb <- sample(1:4, nomit , replace = TRUE)
Xna <- X
index.na <- cbind(ina, inb)
Xna[index.na] <- NA
Here, ina
contains 20 integers from 1 to 50; this represents the states that
are selected to contain missing values. And inb
contains 20 integers from
1 to 4, representing the features that contain the missing values for each
of the selected states. To perform the final indexing, we create index.na
,
a two-column matrix whose columns are ina
and inb
. We have indexed a
matrix with a matrix of indices!
We now write some code to implement Algorithm 12.1 from the book. We first write a
function that takes in a matrix, and returns an approximation to the matrix
using the svd()
function. This will be needed in Step 2 of Algorithm 12.1.
As mentioned earlier, we could do this using the prcomp()
function, but
instead we use the svd()
function for illustration.
fit.svd <- function(X, M = 1) {
svdob <- svd(X)
with(svdob,
u[, 1:M, drop = FALSE] %*%
(d[1:M] * t(v[, 1:M, drop = FALSE]))
)
}
Here, we did not bother to explicitly call the return()
function to return
a value from fit.svd()
; however, the computed quantity is automatically
returned by R. We use the with()
function to make it a little easier to index
the elements of svdob. As an alternative to using with()
, we could have
written
svdob$u[, 1:M, drop = FALSE] %*% (svdob$d[1:M] * t(svdob$v[, 1:M, drop = FALSE]))
inside the fit.svd()
function.
To conduct Step 1 of the algorithm, we initialize Xhat
— this is \(\tilde X\) in
Algorithm 12.1 — by replacing the missing values with the column means
of the non-missing entries.
Xhat <- Xna
xbar <- colMeans(Xna, na.rm = TRUE)
Xhat[index.na] <- xbar[inb]
Before we begin Step 2, we set ourselves up to measure the progress of our iterations:
thresh <- 1e-7
rel_err <- 1
iter <- 0
ismiss <- is.na(Xna)
mssold <- mean ((scale(Xna, xbar, FALSE)[!ismiss])^2)
mss0 <- mean(Xna[!ismiss]^2)
Here ismiss
is a new logical matrix with the same dimensions as Xna
; a
given element equals TRUE
if the corresponding matrix element is missing.
This is useful because it allows us to access both the missing and non-missing
entries. We store the mean of the squared non-missing elements
in mss0
. We store the mean squared error of the non-missing elements of
the old version of Xhat
in mssold
. We plan to store the mean squared error
of the non-missing elements of the current version of Xhat
in mss
, and will
then iterate Step 2 of Algorithm 12.1 until the relative error, defined as
(mssold - mss) / mss0
, falls below thresh = 1e-7
.
Xhat
using fit.svd()
, set the M
parameter to 1.
Store the result in Xapp
.Xapp
to update the estimates for elements in Xhat
that are missing in Xna
.
Remember that the missing indices are stored in ismiss
.mss
between Xna
and Xapp
for the non-missing indices.
Second, calculate the relative error rel_err
by taking the difference of mssold
and mss
and dividing this by mss0
.
Third, replace the value of mssold
with the new value mss
.n_iter
.