The keras package comes with a number of example datasets, including the MNIST digit data. Our first step is to load the MNIST data. The dataset mnist() function is provided for this purpose.

mnist <- dataset_mnist()
x_train <- mnist$train$x
g_train <- mnist$train$y
x_test <- mnist$test$x
g_test <- mnist$test$y
dim(x_train)
[1] 60000    28    28
dim(x_test)
[1] 10000    28    28

There are 60,000 images in the training data and 10,000 in the test data. The images are \(28 \times 28\), and stored as a three-dimensional array, so we need to reshape them into a matrix. Also, we need to “one-hot” encode the class label. Luckily keras has a lot of built-in functions that do this for us.

x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))
y_train <- to_categorical(g_train, 10)
y_test <- to_categorical(g_test, 10)

Neural networks are somewhat sensitive to the scale of the inputs. For example, ridge and lasso regularization are affected by scaling. Here the inputs are eight-bit grayscale values between 0 and 255, so we rescale to the unit interval.

x_train <- x_train / 255
x_test <- x_test / 255

Now we are ready to fit our neural network.

modelnn <- keras_model_sequential()
modelnn %>%
  layer_dense(units = 256, activation = "relu",
              input_shape = c(784)) %>%
  layer_dropout(rate = 0.4) %>%
  layer_dense(units = 128, activation = "relu") %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 10, activation = "softmax")

The first layer goes from \(28 \times 28 = 784\) input units to a hidden layer of 256 units, which uses the ReLU activation function. This is specified by a call to layer_dense(), which takes as input a modelnn object, and returns a modified modelnn object. This is then piped through layer dropout() to perform dropout regularization. The second hidden layer comes next, with 128 hidden units, followed by a dropout layer. The final layer is the output layer, with activation "softmax"

\[f_m(X) = \Pr(Y=m|X) = \frac{e_{Z_m}}{\sum_{\ell = 0}^{9}e^{Z_\ell}}\]

for the 10-class classification problem, which defines the map from the second hidden layer to class probabilities. Finally, we use summary() to summarize the model, and to make sure we got it all right.

> summary(modelnn)
Model: "sequential_1"
_____________________________________________________________________
Layer (type)                   Output Shape               Param #    
=====================================================================
dense_4 (Dense)                (None, 256)                200960     
_____________________________________________________________________
dropout_2 (Dropout)            (None, 256)                0          
_____________________________________________________________________
dense_3 (Dense)                (None, 128)                32896      
_____________________________________________________________________
dropout_1 (Dropout)            (None, 128)                0          
_____________________________________________________________________
dense_2 (Dense)                (None, 10)                 1290       
=====================================================================
Total params: 235,146
Trainable params: 235,146
Non-trainable params: 0
_____________________________________________________________________

The parameters for each layer include a bias term, which results in a parameter count of 235,146. For example, the first hidden layer involves \((784 + 1) \times 256 = 200,960\) parameters.

Notice that the layer names such as dropout_1 and dense_2 have subscripts. These may appear somewhat random; in fact, if you fit the same model again, these will change. They are of no consequence: they vary because the model specification code is run in python, and these subscripts are incremented every time keras_model_sequential() is called.

Questions

Similarly as above, use the MNIST dataset with the same preprocessing steps.
Create a neural network modelnn with three hidden layers of 64 and 32 hidden units, respectively. Add a ReLU activation function to each hidden layer.
Add a dropout layer after each hidden layer, in which a random 40% and 30% of the activations from the previous layer are set to zero during each iteration of the optimization algorithm, respectively.
Add an output layer with the appropriate number of units corresponding to the number of classes and a softmax activation function.