The keras package comes with a number of example datasets, including the MNIST digit data. Our first step is to load the MNIST data. The dataset mnist() function is provided for this purpose.

mnist <- dataset_mnist()
x_train <- mnist$train$x
g_train <- mnist$train$y
x_test <- mnist$test$x
g_test <- mnist$test$y
dim(x_train)
[1] 60000    28    28
dim(x_test)
[1] 10000    28    28

There are 60,000 images in the training data and 10,000 in the test data. The images are \(28 \times 28\), and stored as a three-dimensional array, so we need to reshape them into a matrix. Also, we need to “one-hot” encode the class label. Luckily keras has a lot of built-in functions that do this for us.

x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))
y_train <- to_categorical(g_train, 10)
y_test <- to_categorical(g_test, 10)

Neural networks are somewhat sensitive to the scale of the inputs. For example, ridge and lasso regularization are affected by scaling. Here the inputs are eight-bit grayscale values between 0 and 255, so we rescale to the unit interval.

x_train <- x_train / 255
x_test <- x_test / 255

Now we are ready to fit our neural network.

modelnn <- keras_model_sequential()
modelnn %>%
  layer_dense(units = 256, activation = "relu",
              input_shape = c(784)) %>%
  layer_dropout(rate = 0.4) %>%
  layer_dense(units = 128, activation = "relu") %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 10, activation = "softmax")

The first layer goes from \(28 \times 28 = 784\) input units to a hidden layer of 256 units, which uses the ReLU activation function. This is specified by a call to layer_dense(), which takes as input a modelnn object, and returns a modified modelnn object. This is then piped through layer dropout() to perform dropout regularization. The second hidden layer comes next, with 128 hidden units, followed by a dropout layer. The final layer is the output layer, with activation "softmax"

\[f_m(X) = \Pr(Y=m|X) = \frac{e_{Z_m}}{\sum_{\ell = 0}^{9}e^{Z_\ell}}\]

for the 10-class classification problem, which defines the map from the second hidden layer to class probabilities. Finally, we use summary() to summarize the model, and to make sure we got it all right.

> summary(modelnn)
Model: "sequential_1"
_____________________________________________________________________
Layer (type)                   Output Shape               Param #    
=====================================================================
dense_4 (Dense)                (None, 256)                200960     
_____________________________________________________________________
dropout_2 (Dropout)            (None, 256)                0          
_____________________________________________________________________
dense_3 (Dense)                (None, 128)                32896      
_____________________________________________________________________
dropout_1 (Dropout)            (None, 128)                0          
_____________________________________________________________________
dense_2 (Dense)                (None, 10)                 1290       
=====================================================================
Total params: 235,146
Trainable params: 235,146
Non-trainable params: 0
_____________________________________________________________________

The parameters for each layer include a bias term, which results in a parameter count of 235,146. For example, the first hidden layer involves \((784 + 1) \times 256 = 200,960\) parameters.

Notice that the layer names such as dropout_1 and dense_2 have subscripts. These may appear somewhat random; in fact, if you fit the same model again, these will change. They are of no consequence: they vary because the model specification code is run in python, and these subscripts are incremented every time keras_model_sequential() is called.

Questions