Lecture 22

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

Lecture22

April 7, 2021

1 Lecture 22
1.1 ID 5059
Tom Kelsey - April 2020
Chapter 11 – Training Deep Neural Networks
This notebook contains all the sample code and solutions to the exercises in chapter 11.
Run in Google Colab

2 Setup
First, let’s import a few common modules, ensure MatplotLib plots figures inline and prepare a
function to save the figures. We also check that Python 3.5 or later is installed (although Python
2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as
Scikit-Learn �0.20 and TensorFlow �2.0.
[1]: # Python �3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn �0.20 is required


import sklearn
assert sklearn.__version__ >= "0.20"

try:
# %tensorflow_version only exists in Colab.
%tensorflow_version 2.x
except Exception:
pass

# TensorFlow �2.0 is required


import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

%load_ext tensorboard

1
# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs


np.random.seed(42)

# To plot pretty figures


%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures


PROJECT_ROOT_DIR = "."
CHAPTER_ID = "deep"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):


path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
print("Saving figure", fig_id)
if tight_layout:
plt.tight_layout()
plt.savefig(path, format=fig_extension, dpi=resolution)

INFO:tensorflow:Enabling eager execution


INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2

3 Vanishing/Exploding Gradients Problem


[2]: def logit(z):
return 1 / (1 + np.exp(-z))

def derivative(f, z, eps=0.000001):


return (f(z + eps) - f(z - eps))/(2 * eps)

z = np.linspace(-5, 5, 200)
plt.figure(figsize=(11,4))

plt.subplot(121)

2
plt.plot(z, logit(z), "g--", linewidth=2, label="Sigmoid")
plt.grid(True)
plt.legend(loc="center right", fontsize=14)
plt.title("Activation function", fontsize=14)
plt.axis([-5, 5, -1.2, 1.2])

plt.subplot(122)
plt.plot(z, derivative(logit, z), "r--", linewidth=2, label="Sigmoid")
plt.grid(True)
plt.title("Derivative", fontsize=14)
plt.axis([-5, 5, -0.2, 1.2])

save_fig("activation_function_plot")
plt.show()

Saving figure activation_function_plot

• The derivative values are close to zero for large inputs to the sigmoid activation function
– the inputs are weighted sums from potentially many nodes, so can have large absolute
value
• For a shallow (i.e. few layers) network this is not a big problem
• For deep networks the gradients can be so small that learning effectively stops
• For back propagation the gradients from each layer are multiplied
– as a consequence of the chain rule
• Multiplying small numbers by small numbers soon gives very small numbers
• This is the vanishing gradient problem
• We consider three common approaches to resolving this:
– make sure that the initial weights have a suitable size
– change the activation function(s)
– apply batch normalisation to ensure that the weighted-sum inputs are small…
– …so that the gradients are always relatively large

3
[3]: z = np.linspace(-5, 5, 200)

plt.plot([-5, 5], [0, 0], 'k-')


plt.plot([-5, 5], [1, 1], 'k--')
plt.plot([0, 0], [-0.2, 1.2], 'k-')
plt.plot([-5, 5], [-3/4, 7/4], 'g--')
plt.plot(z, logit(z), "b-", linewidth=2)
props = dict(facecolor='black', shrink=0.1)
plt.annotate('Saturating', xytext=(3.5, 0.7), xy=(5, 1), arrowprops=props,␣
,→fontsize=14, ha="center")

plt.annotate('Saturating', xytext=(-3.5, 0.3), xy=(-5, 0), arrowprops=props,␣


,→fontsize=14, ha="center")

plt.annotate('Linear', xytext=(2, 0.2), xy=(0, 0.5), arrowprops=props,␣


,→fontsize=14, ha="center")

plt.grid(True)
plt.title("Sigmoid activation function", fontsize=14)
plt.axis([-5, 5, -0.2, 1.2])

save_fig("sigmoid_saturation_plot")
plt.show()

Saving figure sigmoid_saturation_plot

4
3.1 Xavier and He Initialization
• Xavier initialization works better for layers with sigmoid/tanh activation
• The idea is to select weights from a Gaussian random distribution
– instead of uniform random from [−1, 1]
• He initialization works better for layers with ReLu activation
– small modification of the Gaussian approach
• There are many other initialisation options
• Which one (if any) is best depends on the data and the NN design

[4]: [name for name in dir(keras.initializers) if not name.startswith("_")]

[4]: ['Constant',
'GlorotNormal',
'GlorotUniform',
'HeNormal',
'HeUniform',
'Identity',
'Initializer',
'LecunNormal',
'LecunUniform',
'Ones',
'Orthogonal',
'RandomNormal',
'RandomUniform',
'TruncatedNormal',
'VarianceScaling',
'Zeros',
'constant',
'deserialize',
'get',
'glorot_normal',
'glorot_uniform',
'he_normal',
'he_uniform',
'identity',
'lecun_normal',
'lecun_uniform',
'ones',
'orthogonal',
'random_normal',
'random_uniform',
'serialize',
'truncated_normal',
'variance_scaling',
'zeros']

[5]: keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

5
[5]: <tensorflow.python.keras.layers.core.Dense at 0x1896b8f70>

[6]: init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg',


distribution='uniform')
keras.layers.Dense(10, activation="relu", kernel_initializer=init)

[6]: <tensorflow.python.keras.layers.core.Dense at 0x105868640>

3.2 Nonsaturating Activation Functions


• Saturating means “restricted to a small range”

3.2.1 Leaky ReLU

[7]: def leaky_relu(z, alpha=0.01):


return np.maximum(alpha*z, z)

[8]: plt.plot(z, leaky_relu(z, 0.05), "b-", linewidth=2)


plt.plot([-5, 5], [0, 0], 'k-')
plt.plot([0, 0], [-0.5, 4.2], 'k-')
plt.grid(True)
props = dict(facecolor='black', shrink=0.1)
plt.annotate('Leak', xytext=(-3.5, 0.5), xy=(-5, -0.2), arrowprops=props,␣
,→fontsize=14, ha="center")

plt.title("Leaky ReLU activation function", fontsize=14)


plt.axis([-5, 5, -0.5, 4.2])

save_fig("leaky_relu_plot")
plt.show()

Saving figure leaky_relu_plot

6
[9]: [m for m in dir(keras.activations) if not m.startswith("_")]

[9]: ['deserialize',
'elu',
'exponential',
'gelu',
'get',
'hard_sigmoid',
'linear',
'relu',
'selu',
'serialize',
'sigmoid',
'softmax',
'softplus',
'softsign',
'swish',
'tanh']

[10]: [m for m in dir(keras.layers) if "relu" in m.lower()]

[10]: ['LeakyReLU', 'PReLU', 'ReLU', 'ThresholdedReLU']

Let’s train a neural network on Fashion MNIST using the Leaky ReLU:

7
[11]: (X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.
,→load_data()

X_train_full = X_train_full / 255.0


X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

[12]: tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(300, kernel_initializer="he_normal"),
keras.layers.LeakyReLU(),
keras.layers.Dense(100, kernel_initializer="he_normal"),
keras.layers.LeakyReLU(),
keras.layers.Dense(10, activation="softmax")
])

[13]: model.compile(loss="sparse_categorical_crossentropy",
optimizer=keras.optimizers.SGD(learning_rate=1e-3),
metrics=["accuracy"])

[14]: %%time
history = model.fit(X_train, y_train, epochs=10,
validation_data=(X_valid, y_valid))

Epoch 1/10
1719/1719 [==============================] - 2s 1ms/step - loss: 1.6314 -
accuracy: 0.5054 - val_loss: 0.8886 - val_accuracy: 0.7160
Epoch 2/10
1719/1719 [==============================] - 2s 968us/step - loss: 0.8416 -
accuracy: 0.7246 - val_loss: 0.7130 - val_accuracy: 0.7656
Epoch 3/10
1719/1719 [==============================] - 2s 959us/step - loss: 0.7053 -
accuracy: 0.7638 - val_loss: 0.6427 - val_accuracy: 0.7898
Epoch 4/10
1719/1719 [==============================] - 2s 955us/step - loss: 0.6325 -
accuracy: 0.7908 - val_loss: 0.5900 - val_accuracy: 0.8066
Epoch 5/10
1719/1719 [==============================] - 2s 961us/step - loss: 0.5992 -
accuracy: 0.8020 - val_loss: 0.5582 - val_accuracy: 0.8202
Epoch 6/10
1719/1719 [==============================] - 2s 959us/step - loss: 0.5624 -
accuracy: 0.8142 - val_loss: 0.5350 - val_accuracy: 0.8236
Epoch 7/10
1719/1719 [==============================] - 2s 965us/step - loss: 0.5379 -

8
accuracy: 0.8218 - val_loss: 0.5157 - val_accuracy: 0.8300
Epoch 8/10
1719/1719 [==============================] - 2s 959us/step - loss: 0.5152 -
accuracy: 0.8296 - val_loss: 0.5079 - val_accuracy: 0.8284
Epoch 9/10
1719/1719 [==============================] - 2s 961us/step - loss: 0.5100 -
accuracy: 0.8268 - val_loss: 0.4895 - val_accuracy: 0.8388
Epoch 10/10
1719/1719 [==============================] - 2s 965us/step - loss: 0.4918 -
accuracy: 0.8339 - val_loss: 0.4817 - val_accuracy: 0.8396
CPU times: user 40.9 s, sys: 9.87 s, total: 50.8 s
Wall time: 17.5 s
Now let’s try PReLU:
[16]: tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(300, kernel_initializer="he_normal"),
keras.layers.PReLU(),
keras.layers.Dense(100, kernel_initializer="he_normal"),
keras.layers.PReLU(),
keras.layers.Dense(10, activation="softmax")
])

[17]: model.compile(loss="sparse_categorical_crossentropy",
optimizer=keras.optimizers.SGD(learning_rate=1e-3),
metrics=["accuracy"])

[18]: %%time
history = model.fit(X_train, y_train, epochs=10,
validation_data=(X_valid, y_valid))

Epoch 1/10
1719/1719 [==============================] - 2s 1ms/step - loss: 1.6969 -
accuracy: 0.4974 - val_loss: 0.9255 - val_accuracy: 0.7186
Epoch 2/10
1719/1719 [==============================] - 2s 1ms/step - loss: 0.8706 -
accuracy: 0.7247 - val_loss: 0.7305 - val_accuracy: 0.7630
Epoch 3/10
1719/1719 [==============================] - 2s 1ms/step - loss: 0.7211 -
accuracy: 0.7621 - val_loss: 0.6564 - val_accuracy: 0.7882
Epoch 4/10
1719/1719 [==============================] - 2s 1ms/step - loss: 0.6447 -
accuracy: 0.7879 - val_loss: 0.6003 - val_accuracy: 0.8048
Epoch 5/10

9
1719/1719 [==============================] - 2s 1ms/step - loss: 0.6077 -
accuracy: 0.8004 - val_loss: 0.5656 - val_accuracy: 0.8182
Epoch 6/10
1719/1719 [==============================] - 2s 1ms/step - loss: 0.5692 -
accuracy: 0.8118 - val_loss: 0.5406 - val_accuracy: 0.8236
Epoch 7/10
1719/1719 [==============================] - 2s 1ms/step - loss: 0.5427 -
accuracy: 0.8193 - val_loss: 0.5195 - val_accuracy: 0.8310
Epoch 8/10
1719/1719 [==============================] - 2s 1ms/step - loss: 0.5193 -
accuracy: 0.8283 - val_loss: 0.5113 - val_accuracy: 0.8320
Epoch 9/10
1719/1719 [==============================] - 2s 1ms/step - loss: 0.5128 -
accuracy: 0.8273 - val_loss: 0.4916 - val_accuracy: 0.8376
Epoch 10/10
1719/1719 [==============================] - 2s 1ms/step - loss: 0.4940 -
accuracy: 0.8314 - val_loss: 0.4826 - val_accuracy: 0.8398
CPU times: user 46.6 s, sys: 11.3 s, total: 57.9 s
Wall time: 19.6 s

3.2.2 ELU
[19]: def elu(z, alpha=1):
return np.where(z < 0, alpha * (np.exp(z) - 1), z)

[20]: plt.plot(z, elu(z), "b-", linewidth=2)


plt.plot([-5, 5], [0, 0], 'k-')
plt.plot([-5, 5], [-1, -1], 'k--')
plt.plot([0, 0], [-2.2, 3.2], 'k-')
plt.grid(True)
plt.title(r"ELU activation function ($\alpha=1$)", fontsize=14)
plt.axis([-5, 5, -2.2, 3.2])

save_fig("elu_plot")
plt.show()

Saving figure elu_plot

10
Implementing ELU in TensorFlow is trivial, just specify the activation function when building each
layer:
[21]: keras.layers.Dense(10, activation="elu")

[21]: <tensorflow.python.keras.layers.core.Dense at 0x10583cb80>

3.2.3 SELU
This activation function was proposed in this great paper by Günter Klambauer, Thomas Un-
terthiner and Andreas Mayr, published in June 2017. During training, a neural network composed
exclusively of a stack of dense layers using the SELU activation function and LeCun initialization
will self-normalize: the output of each layer will tend to preserve the same mean and variance
during training, which solves the vanishing/exploding gradients problem. As a result, this activa-
tion function outperforms the other activation functions very significantly for such neural nets, so
you should really try it out. Unfortunately, the self-normalizing property of the SELU activation
function is easily broken: you cannot use �1 or �2 regularization, regular dropout, max-norm, skip
connections or other non-sequential topologies (so recurrent neural networks won’t self-normalize).
However, in practice it works quite well with sequential CNNs. If you break self-normalization,
SELU will not necessarily outperform other activation functions.
[22]: from scipy.special import erfc

# alpha and scale to self normalize with mean 0 and standard deviation 1

11
# (see equation 14 in the paper):
alpha_0_1 = -np.sqrt(2 / np.pi) / (erfc(1/np.sqrt(2)) * np.exp(1/2) - 1)
scale_0_1 = (1 - erfc(1 / np.sqrt(2)) * np.sqrt(np.e)) * np.sqrt(2 * np.pi) *␣
,→(2 * erfc(np.sqrt(2))*np.e**2 + np.pi*erfc(1/np.sqrt(2))**2*np.e - 2*(2+np.

,→pi)*erfc(1/np.sqrt(2))*np.sqrt(np.e)+np.pi+2)**(-1/2)

[23]: def selu(z, scale=scale_0_1, alpha=alpha_0_1):


return scale * elu(z, alpha)

[24]: plt.plot(z, selu(z), "b-", linewidth=2)


plt.plot([-5, 5], [0, 0], 'k-')
plt.plot([-5, 5], [-1.758, -1.758], 'k--')
plt.plot([0, 0], [-2.2, 3.2], 'k-')
plt.grid(True)
plt.title("SELU activation function", fontsize=14)
plt.axis([-5, 5, -2.2, 3.2])

save_fig("selu_plot")
plt.show()

Saving figure selu_plot

By default, the SELU hyperparameters (scale and alpha) are tuned in such a way that the mean
output of each neuron remains close to 0, and the standard deviation remains close to 1 (assuming

12
the inputs are standardized with mean 0 and standard deviation 1 too). Using this activation
function, even a 1,000 layer deep neural network preserves roughly mean 0 and standard deviation
1 across all layers, avoiding the exploding/vanishing gradients problem:

[25]: np.random.seed(42)
Z = np.random.normal(size=(500, 100)) # standardized inputs
for layer in range(1000):
W = np.random.normal(size=(100, 100), scale=np.sqrt(1 / 100)) # LeCun␣
,→initialization

Z = selu(np.dot(Z, W))
means = np.mean(Z, axis=0).mean()
stds = np.std(Z, axis=0).mean()
if layer % 100 == 0:
print("Layer {}: mean {:.2f}, std deviation {:.2f}".format(layer,␣
,→means, stds))

Layer 0: mean -0.00, std deviation 1.00


Layer 100: mean 0.02, std deviation 0.96
Layer 200: mean 0.01, std deviation 0.90
Layer 300: mean -0.02, std deviation 0.92
Layer 400: mean 0.05, std deviation 0.89
Layer 500: mean 0.01, std deviation 0.93
Layer 600: mean 0.02, std deviation 0.92
Layer 700: mean -0.02, std deviation 0.90
Layer 800: mean 0.05, std deviation 0.83
Layer 900: mean 0.02, std deviation 1.00
Using SELU is easy:
[26]: keras.layers.Dense(10, activation="selu",
kernel_initializer="lecun_normal")

[26]: <tensorflow.python.keras.layers.core.Dense at 0x1895f0670>

Let’s create a neural net for Fashion MNIST with 100 hidden layers, using the SELU activation
function:
[27]: np.random.seed(42)
tf.random.set_seed(42)

[28]: model = keras.models.Sequential()


model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, activation="selu",
kernel_initializer="lecun_normal"))
for layer in range(99):
model.add(keras.layers.Dense(100, activation="selu",
kernel_initializer="lecun_normal"))
model.add(keras.layers.Dense(10, activation="softmax"))

13
[29]: model.compile(loss="sparse_categorical_crossentropy",
optimizer=keras.optimizers.SGD(learning_rate=1e-3),
metrics=["accuracy"])

Now let’s train it. Do not forget to scale the inputs to mean 0 and standard deviation 1:
[30]: pixel_means = X_train.mean(axis=0, keepdims=True)
pixel_stds = X_train.std(axis=0, keepdims=True)
X_train_scaled = (X_train - pixel_means) / pixel_stds
X_valid_scaled = (X_valid - pixel_means) / pixel_stds
X_test_scaled = (X_test - pixel_means) / pixel_stds

[31]: %%time
history = model.fit(X_train_scaled, y_train, epochs=5,
validation_data=(X_valid_scaled, y_valid))

Epoch 1/5
1719/1719 [==============================] - 13s 7ms/step - loss: 1.5628 -
accuracy: 0.4012 - val_loss: 1.0999 - val_accuracy: 0.5816
Epoch 2/5
1719/1719 [==============================] - 11s 6ms/step - loss: 0.8384 -
accuracy: 0.6754 - val_loss: 0.6646 - val_accuracy: 0.7500
Epoch 3/5
1719/1719 [==============================] - 11s 6ms/step - loss: 0.7810 -
accuracy: 0.7119 - val_loss: 0.7026 - val_accuracy: 0.7446
Epoch 4/5
1719/1719 [==============================] - 11s 6ms/step - loss: 0.6288 -
accuracy: 0.7726 - val_loss: 0.5619 - val_accuracy: 0.7974
Epoch 5/5
1719/1719 [==============================] - 11s 6ms/step - loss: 0.5652 -
accuracy: 0.7892 - val_loss: 0.5099 - val_accuracy: 0.8190
CPU times: user 1min 55s, sys: 15.4 s, total: 2min 10s
Wall time: 56.7 s
Now look at what happens if we try to use the ReLU activation function instead:
[32]: np.random.seed(42)
tf.random.set_seed(42)

[34]: model = keras.models.Sequential()


model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, activation="relu",␣
,→kernel_initializer="he_normal"))

for layer in range(99):


model.add(keras.layers.Dense(100, activation="relu",␣
,→kernel_initializer="he_normal"))

model.add(keras.layers.Dense(10, activation="softmax"))

14
[35]: model.compile(loss="sparse_categorical_crossentropy",
optimizer=keras.optimizers.SGD(learning_rate=1e-3),
metrics=["accuracy"])

[36]: %%time
history = model.fit(X_train_scaled, y_train, epochs=5,
validation_data=(X_valid_scaled, y_valid))

Epoch 1/5
1719/1719 [==============================] - 11s 6ms/step - loss: 2.1142 -
accuracy: 0.1927 - val_loss: 1.2794 - val_accuracy: 0.4636
Epoch 2/5
1719/1719 [==============================] - 9s 5ms/step - loss: 1.3122 -
accuracy: 0.4585 - val_loss: 0.8995 - val_accuracy: 0.6168
Epoch 3/5
1719/1719 [==============================] - 9s 5ms/step - loss: 1.0267 -
accuracy: 0.5727 - val_loss: 0.8842 - val_accuracy: 0.6278
Epoch 4/5
1719/1719 [==============================] - 9s 5ms/step - loss: 1.0195 -
accuracy: 0.5909 - val_loss: 0.9205 - val_accuracy: 0.6226
Epoch 5/5
1719/1719 [==============================] - 9s 5ms/step - loss: 0.8486 -
accuracy: 0.6575 - val_loss: 1.4146 - val_accuracy: 0.3760
CPU times: user 1min 46s, sys: 15.3 s, total: 2min 2s
Wall time: 48.7 s
Not great at all, we suffered from the vanishing/exploding gradients problem.

4 Batch Normalization
[37]: model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.BatchNormalization(),
keras.layers.Dense(300, activation="relu"),
keras.layers.BatchNormalization(),
keras.layers.Dense(100, activation="relu"),
keras.layers.BatchNormalization(),
keras.layers.Dense(10, activation="softmax")
])

[38]: model.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
flatten_6 (Flatten) (None, 784) 0

15
_________________________________________________________________
batch_normalization (BatchNo (None, 784) 3136
_________________________________________________________________
dense_316 (Dense) (None, 300) 235500
_________________________________________________________________
batch_normalization_1 (Batch (None, 300) 1200
_________________________________________________________________
dense_317 (Dense) (None, 100) 30100
_________________________________________________________________
batch_normalization_2 (Batch (None, 100) 400
_________________________________________________________________
dense_318 (Dense) (None, 10) 1010
=================================================================
Total params: 271,346
Trainable params: 268,978
Non-trainable params: 2,368
_________________________________________________________________

[39]: bn1 = model.layers[1]


[(var.name, var.trainable) for var in bn1.variables]

[39]: [('batch_normalization/gamma:0', True),


('batch_normalization/beta:0', True),
('batch_normalization/moving_mean:0', False),
('batch_normalization/moving_variance:0', False)]

[40]: bn1.updates

/usr/local/lib/python3.9/site-
packages/tensorflow/python/keras/engine/base_layer.py:1331: UserWarning:
`layer.updates` will be removed in a future version. This property should not be
used in TensorFlow 2.0, as `updates` are applied automatically.
warnings.warn('`layer.updates` will be removed in a future version. '

[40]: []

[41]: model.compile(loss="sparse_categorical_crossentropy",
optimizer=keras.optimizers.SGD(learning_rate=1e-3),
metrics=["accuracy"])

[42]: %%time
history = model.fit(X_train, y_train, epochs=10,
validation_data=(X_valid, y_valid))

Epoch 1/10
1719/1719 [==============================] - 3s 2ms/step - loss: 1.2126 -
accuracy: 0.6047 - val_loss: 0.5674 - val_accuracy: 0.8066
Epoch 2/10

16
1719/1719 [==============================] - 3s 2ms/step - loss: 0.5990 -
accuracy: 0.7935 - val_loss: 0.4864 - val_accuracy: 0.8376
Epoch 3/10
1719/1719 [==============================] - 3s 2ms/step - loss: 0.5274 -
accuracy: 0.8183 - val_loss: 0.4479 - val_accuracy: 0.8458
Epoch 4/10
1719/1719 [==============================] - 3s 2ms/step - loss: 0.4860 -
accuracy: 0.8304 - val_loss: 0.4258 - val_accuracy: 0.8560
Epoch 5/10
1719/1719 [==============================] - 3s 2ms/step - loss: 0.4661 -
accuracy: 0.8392 - val_loss: 0.4105 - val_accuracy: 0.8626
Epoch 6/10
1719/1719 [==============================] - 3s 2ms/step - loss: 0.4360 -
accuracy: 0.8474 - val_loss: 0.3985 - val_accuracy: 0.8656
Epoch 7/10
1719/1719 [==============================] - 3s 2ms/step - loss: 0.4218 -
accuracy: 0.8535 - val_loss: 0.3887 - val_accuracy: 0.8658
Epoch 8/10
1719/1719 [==============================] - 3s 2ms/step - loss: 0.4085 -
accuracy: 0.8570 - val_loss: 0.3813 - val_accuracy: 0.8702
Epoch 9/10
1719/1719 [==============================] - 3s 2ms/step - loss: 0.4045 -
accuracy: 0.8581 - val_loss: 0.3750 - val_accuracy: 0.8728
Epoch 10/10
1719/1719 [==============================] - 3s 2ms/step - loss: 0.3894 -
accuracy: 0.8607 - val_loss: 0.3683 - val_accuracy: 0.8734
CPU times: user 1min 15s, sys: 15.2 s, total: 1min 30s
Wall time: 28.8 s
Sometimes applying BN before the activation function works better (there’s a debate on this topic).
Moreover, the layer before a BatchNormalization layer does not need to have bias terms, since
the BatchNormalization layer some as well, it would be a waste of parameters, so you can set
use_bias=False when creating those layers:
[43]: model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.BatchNormalization(),
keras.layers.Dense(300, use_bias=False),
keras.layers.BatchNormalization(),
keras.layers.Activation("relu"),
keras.layers.Dense(100, use_bias=False),
keras.layers.BatchNormalization(),
keras.layers.Activation("relu"),
keras.layers.Dense(10, activation="softmax")
])

[44]: model.compile(loss="sparse_categorical_crossentropy",
optimizer=keras.optimizers.SGD(learning_rate=1e-3),

17
metrics=["accuracy"])

[45]: %%time
history = model.fit(X_train, y_train, epochs=10,
validation_data=(X_valid, y_valid))

Epoch 1/10
1719/1719 [==============================] - 3s 2ms/step - loss: 1.4225 -
accuracy: 0.5609 - val_loss: 0.6812 - val_accuracy: 0.7878
Epoch 2/10
1719/1719 [==============================] - 3s 2ms/step - loss: 0.7177 -
accuracy: 0.7732 - val_loss: 0.5585 - val_accuracy: 0.8230
Epoch 3/10
1719/1719 [==============================] - 3s 2ms/step - loss: 0.6161 -
accuracy: 0.7953 - val_loss: 0.5021 - val_accuracy: 0.8358
Epoch 4/10
1719/1719 [==============================] - 3s 2ms/step - loss: 0.5588 -
accuracy: 0.8129 - val_loss: 0.4672 - val_accuracy: 0.8440
Epoch 5/10
1719/1719 [==============================] - 3s 2ms/step - loss: 0.5307 -
accuracy: 0.8199 - val_loss: 0.4439 - val_accuracy: 0.8530
Epoch 6/10
1719/1719 [==============================] - 3s 2ms/step - loss: 0.4971 -
accuracy: 0.8287 - val_loss: 0.4264 - val_accuracy: 0.8596
Epoch 7/10
1719/1719 [==============================] - 3s 2ms/step - loss: 0.4773 -
accuracy: 0.8367 - val_loss: 0.4135 - val_accuracy: 0.8600
Epoch 8/10
1719/1719 [==============================] - 3s 2ms/step - loss: 0.4597 -
accuracy: 0.8422 - val_loss: 0.4034 - val_accuracy: 0.8636
Epoch 9/10
1719/1719 [==============================] - 3s 2ms/step - loss: 0.4525 -
accuracy: 0.8420 - val_loss: 0.3945 - val_accuracy: 0.8670
Epoch 10/10
1719/1719 [==============================] - 3s 2ms/step - loss: 0.4334 -
accuracy: 0.8486 - val_loss: 0.3866 - val_accuracy: 0.8678
CPU times: user 1min 14s, sys: 15 s, total: 1min 29s
Wall time: 28.5 s

4.1 Gradient Clipping


• Gradients can become too large as well as too small
• This is the exploding gradient problem
– often caused by trying to avoid a vanishing gradient!
• A simple approach is to check error gardient values against a threshold value and clip them
(i.e. set to that threshold value) if the threshold is exceeded
• Another approach (that we don’t look at here) is weight normalisation
– use ℓ1 or ℓ2 norm penalisation of weight values

18
All Keras optimizers accept clipnorm or clipvalue arguments:
[46]: optimizer = keras.optimizers.SGD(clipvalue=1.0)

[47]: optimizer = keras.optimizers.SGD(clipnorm=1.0)

4.2 Reusing Pretrained Layers


4.2.1 Reusing a Keras model
Let’s split the fashion MNIST training set in two: * X_train_A: all images of all items except for
sandals and shirts (classes 5 and 6). * X_train_B: a much smaller training set of just the first 200
images of sandals or shirts.
The validation set and the test set are also split this way, but without restricting the number of
images.
We will train a model on set A (classification task with 8 classes), and try to reuse it to tackle
set B (binary classification). We hope to transfer a little bit of knowledge from task A to task B,
since classes in set A (sneakers, ankle boots, coats, t-shirts, etc.) are somewhat similar to classes in
set B (sandals and shirts). However, since we are using Dense layers, only patterns that occur at
the same location can be reused (in contrast, convolutional layers will transfer much better, since
learned patterns can be detected anywhere on the image, as we will see in the CNN chapter).

[48]: def split_dataset(X, y):


y_5_or_6 = (y == 5) | (y == 6) # sandals or shirts
y_A = y[~y_5_or_6]
y_A[y_A > 6] -= 2 # class indices 7, 8, 9 should be moved to 5, 6, 7
y_B = (y[y_5_or_6] == 6).astype(np.float32) # binary classification task:␣
,→is it a shirt (class 6)?

return ((X[~y_5_or_6], y_A),


(X[y_5_or_6], y_B))

(X_train_A, y_train_A), (X_train_B, y_train_B) = split_dataset(X_train, y_train)


(X_valid_A, y_valid_A), (X_valid_B, y_valid_B) = split_dataset(X_valid, y_valid)
(X_test_A, y_test_A), (X_test_B, y_test_B) = split_dataset(X_test, y_test)
X_train_B = X_train_B[:200]
y_train_B = y_train_B[:200]

[49]: X_train_A.shape

[49]: (43986, 28, 28)

[50]: X_train_B.shape

[50]: (200, 28, 28)

[51]: y_train_A[:30]

19
[51]: array([4, 0, 5, 7, 7, 7, 4, 4, 3, 4, 0, 1, 6, 3, 4, 3, 2, 6, 5, 3, 4, 5,
1, 3, 4, 2, 0, 6, 7, 1], dtype=uint8)

[52]: y_train_B[:30]

[52]: array([1., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0., 0., 0., 0.,
0., 0., 1., 1., 0., 0., 1., 1., 0., 1., 1., 1., 1.], dtype=float32)

[53]: tf.random.set_seed(42)
np.random.seed(42)

[54]: model_A = keras.models.Sequential()


model_A.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
model_A.add(keras.layers.Dense(n_hidden, activation="selu"))
model_A.add(keras.layers.Dense(8, activation="softmax"))

[55]: model_A.compile(loss="sparse_categorical_crossentropy",
optimizer=keras.optimizers.SGD(learning_rate=1e-3),
metrics=["accuracy"])

[56]: %%time
history = model_A.fit(X_train_A, y_train_A, epochs=20,
validation_data=(X_valid_A, y_valid_A))

Epoch 1/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.9249 -
accuracy: 0.6994 - val_loss: 0.3896 - val_accuracy: 0.8662
Epoch 2/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.3651 -
accuracy: 0.8745 - val_loss: 0.3289 - val_accuracy: 0.8827
Epoch 3/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.3182 -
accuracy: 0.8897 - val_loss: 0.3014 - val_accuracy: 0.8986
Epoch 4/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.3049 -
accuracy: 0.8957 - val_loss: 0.2895 - val_accuracy: 0.9021
Epoch 5/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.2804 -
accuracy: 0.9028 - val_loss: 0.2776 - val_accuracy: 0.9066
Epoch 6/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.2701 -
accuracy: 0.9079 - val_loss: 0.2733 - val_accuracy: 0.9066
Epoch 7/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.2626 -
accuracy: 0.9093 - val_loss: 0.2716 - val_accuracy: 0.9086
Epoch 8/20

20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.2609 -
accuracy: 0.9120 - val_loss: 0.2588 - val_accuracy: 0.9138
Epoch 9/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.2558 -
accuracy: 0.9109 - val_loss: 0.2563 - val_accuracy: 0.9145
Epoch 10/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.2512 -
accuracy: 0.9140 - val_loss: 0.2543 - val_accuracy: 0.9160
Epoch 11/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.2431 -
accuracy: 0.9168 - val_loss: 0.2496 - val_accuracy: 0.9153
Epoch 12/20
1375/1375 [==============================] - 1s 1ms/step - loss: 0.2422 -
accuracy: 0.9169 - val_loss: 0.2512 - val_accuracy: 0.9128
Epoch 13/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.2360 -
accuracy: 0.9182 - val_loss: 0.2444 - val_accuracy: 0.9155
Epoch 14/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.2266 -
accuracy: 0.9233 - val_loss: 0.2415 - val_accuracy: 0.9175
Epoch 15/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.2225 -
accuracy: 0.9239 - val_loss: 0.2446 - val_accuracy: 0.9190
Epoch 16/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.2262 -
accuracy: 0.9214 - val_loss: 0.2385 - val_accuracy: 0.9195
Epoch 17/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.2191 -
accuracy: 0.9250 - val_loss: 0.2412 - val_accuracy: 0.9175
Epoch 18/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.2171 -
accuracy: 0.9251 - val_loss: 0.2430 - val_accuracy: 0.9153
Epoch 19/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.2180 -
accuracy: 0.9247 - val_loss: 0.2329 - val_accuracy: 0.9195
Epoch 20/20
1375/1375 [==============================] - 2s 1ms/step - loss: 0.2112 -
accuracy: 0.9271 - val_loss: 0.2333 - val_accuracy: 0.9208
CPU times: user 1min 11s, sys: 16.7 s, total: 1min 28s
Wall time: 30.7 s

[57]: model_A.save("my_model_A.h5")

[58]: model_B = keras.models.Sequential()


model_B.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
model_B.add(keras.layers.Dense(n_hidden, activation="selu"))

21
model_B.add(keras.layers.Dense(1, activation="sigmoid"))

[59]: model_B.compile(loss="binary_crossentropy",
optimizer=keras.optimizers.SGD(learning_rate=1e-3),
metrics=["accuracy"])

[60]: %%time
history = model_B.fit(X_train_B, y_train_B, epochs=20,
validation_data=(X_valid_B, y_valid_B))

Epoch 1/20
7/7 [==============================] - 0s 22ms/step - loss: 1.0360 - accuracy:
0.4975 - val_loss: 0.6314 - val_accuracy: 0.6004
Epoch 2/20
7/7 [==============================] - 0s 7ms/step - loss: 0.5883 - accuracy:
0.6971 - val_loss: 0.4784 - val_accuracy: 0.8529
Epoch 3/20
7/7 [==============================] - 0s 7ms/step - loss: 0.4380 - accuracy:
0.8854 - val_loss: 0.4102 - val_accuracy: 0.8945
Epoch 4/20
7/7 [==============================] - 0s 7ms/step - loss: 0.4021 - accuracy:
0.8712 - val_loss: 0.3647 - val_accuracy: 0.9178
Epoch 5/20
7/7 [==============================] - 0s 7ms/step - loss: 0.3361 - accuracy:
0.9348 - val_loss: 0.3300 - val_accuracy: 0.9320
Epoch 6/20
7/7 [==============================] - 0s 7ms/step - loss: 0.3113 - accuracy:
0.9233 - val_loss: 0.3019 - val_accuracy: 0.9402
Epoch 7/20
7/7 [==============================] - 0s 7ms/step - loss: 0.2817 - accuracy:
0.9299 - val_loss: 0.2804 - val_accuracy: 0.9422
Epoch 8/20
7/7 [==============================] - 0s 7ms/step - loss: 0.2632 - accuracy:
0.9379 - val_loss: 0.2606 - val_accuracy: 0.9473
Epoch 9/20
7/7 [==============================] - 0s 7ms/step - loss: 0.2373 - accuracy:
0.9481 - val_loss: 0.2428 - val_accuracy: 0.9523
Epoch 10/20
7/7 [==============================] - 0s 7ms/step - loss: 0.2229 - accuracy:
0.9657 - val_loss: 0.2281 - val_accuracy: 0.9544
Epoch 11/20
7/7 [==============================] - 0s 7ms/step - loss: 0.2155 - accuracy:
0.9590 - val_loss: 0.2150 - val_accuracy: 0.9584
Epoch 12/20
7/7 [==============================] - 0s 7ms/step - loss: 0.1834 - accuracy:
0.9738 - val_loss: 0.2036 - val_accuracy: 0.9584
Epoch 13/20

22
7/7 [==============================] - 0s 7ms/step - loss: 0.1671 - accuracy:
0.9828 - val_loss: 0.1931 - val_accuracy: 0.9615
Epoch 14/20
7/7 [==============================] - 0s 7ms/step - loss: 0.1527 - accuracy:
0.9915 - val_loss: 0.1838 - val_accuracy: 0.9635
Epoch 15/20
7/7 [==============================] - 0s 7ms/step - loss: 0.1595 - accuracy:
0.9904 - val_loss: 0.1746 - val_accuracy: 0.9686
Epoch 16/20
7/7 [==============================] - 0s 7ms/step - loss: 0.1473 - accuracy:
0.9937 - val_loss: 0.1674 - val_accuracy: 0.9686
Epoch 17/20
7/7 [==============================] - 0s 7ms/step - loss: 0.1412 - accuracy:
0.9944 - val_loss: 0.1604 - val_accuracy: 0.9706
Epoch 18/20
7/7 [==============================] - 0s 7ms/step - loss: 0.1242 - accuracy:
0.9931 - val_loss: 0.1539 - val_accuracy: 0.9706
Epoch 19/20
7/7 [==============================] - 0s 7ms/step - loss: 0.1224 - accuracy:
0.9931 - val_loss: 0.1482 - val_accuracy: 0.9716
Epoch 20/20
7/7 [==============================] - 0s 7ms/step - loss: 0.1096 - accuracy:
0.9912 - val_loss: 0.1431 - val_accuracy: 0.9716
CPU times: user 1.87 s, sys: 261 ms, total: 2.14 s
Wall time: 1.27 s

[61]: model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
flatten_7 (Flatten) (None, 784) 0
_________________________________________________________________
batch_normalization_3 (Batch (None, 784) 3136
_________________________________________________________________
dense_319 (Dense) (None, 300) 235200
_________________________________________________________________
batch_normalization_4 (Batch (None, 300) 1200
_________________________________________________________________
activation (Activation) (None, 300) 0
_________________________________________________________________
dense_320 (Dense) (None, 100) 30000
_________________________________________________________________
batch_normalization_5 (Batch (None, 100) 400
_________________________________________________________________
activation_1 (Activation) (None, 100) 0
_________________________________________________________________

23
dense_321 (Dense) (None, 10) 1010
=================================================================
Total params: 270,946
Trainable params: 268,578
Non-trainable params: 2,368
_________________________________________________________________

[62]: model_A = keras.models.load_model("my_model_A.h5")


model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

[63]: model_A_clone = keras.models.clone_model(model_A)


model_A_clone.set_weights(model_A.get_weights())

[64]: for layer in model_B_on_A.layers[:-1]:


layer.trainable = False

model_B_on_A.compile(loss="binary_crossentropy",
optimizer=keras.optimizers.SGD(learning_rate=1e-3),
metrics=["accuracy"])

[65]: %%time
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,
validation_data=(X_valid_B, y_valid_B))

for layer in model_B_on_A.layers[:-1]:


layer.trainable = True

model_B_on_A.compile(loss="binary_crossentropy",
optimizer=keras.optimizers.SGD(lr=1e-3),
metrics=["accuracy"])
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
validation_data=(X_valid_B, y_valid_B))

Epoch 1/4
7/7 [==============================] - 0s 22ms/step - loss: 0.6172 - accuracy:
0.6233 - val_loss: 0.5858 - val_accuracy: 0.6308
Epoch 2/4
7/7 [==============================] - 0s 7ms/step - loss: 0.5562 - accuracy:
0.6557 - val_loss: 0.5481 - val_accuracy: 0.6775
Epoch 3/4
7/7 [==============================] - 0s 7ms/step - loss: 0.4906 - accuracy:
0.7531 - val_loss: 0.5158 - val_accuracy: 0.7089
Epoch 4/4
7/7 [==============================] - 0s 6ms/step - loss: 0.4907 - accuracy:
0.7355 - val_loss: 0.4869 - val_accuracy: 0.7333
Epoch 1/16

24
/usr/local/lib/python3.9/site-
packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:374: UserWarning:
The `lr` argument is deprecated, use `learning_rate` instead.
warnings.warn(
7/7 [==============================] - 0s 22ms/step - loss: 0.4391 - accuracy:
0.7774 - val_loss: 0.3471 - val_accuracy: 0.8641
Epoch 2/16
7/7 [==============================] - 0s 7ms/step - loss: 0.2979 - accuracy:
0.9143 - val_loss: 0.2611 - val_accuracy: 0.9290
Epoch 3/16
7/7 [==============================] - 0s 7ms/step - loss: 0.2039 - accuracy:
0.9669 - val_loss: 0.2115 - val_accuracy: 0.9544
Epoch 4/16
7/7 [==============================] - 0s 7ms/step - loss: 0.1754 - accuracy:
0.9789 - val_loss: 0.1795 - val_accuracy: 0.9696
Epoch 5/16
7/7 [==============================] - 0s 7ms/step - loss: 0.1348 - accuracy:
0.9809 - val_loss: 0.1566 - val_accuracy: 0.9757
Epoch 6/16
7/7 [==============================] - 0s 7ms/step - loss: 0.1173 - accuracy:
0.9973 - val_loss: 0.1397 - val_accuracy: 0.9797
Epoch 7/16
7/7 [==============================] - 0s 7ms/step - loss: 0.1137 - accuracy:
0.9931 - val_loss: 0.1270 - val_accuracy: 0.9848
Epoch 8/16
7/7 [==============================] - 0s 7ms/step - loss: 0.1000 - accuracy:
0.9931 - val_loss: 0.1167 - val_accuracy: 0.9858
Epoch 9/16
7/7 [==============================] - 0s 7ms/step - loss: 0.0835 - accuracy:
1.0000 - val_loss: 0.1069 - val_accuracy: 0.9888
Epoch 10/16
7/7 [==============================] - 0s 7ms/step - loss: 0.0777 - accuracy:
1.0000 - val_loss: 0.1003 - val_accuracy: 0.9899
Epoch 11/16
7/7 [==============================] - 0s 7ms/step - loss: 0.0691 - accuracy:
1.0000 - val_loss: 0.0942 - val_accuracy: 0.9899
Epoch 12/16
7/7 [==============================] - 0s 7ms/step - loss: 0.0719 - accuracy:
1.0000 - val_loss: 0.0891 - val_accuracy: 0.9899
Epoch 13/16
7/7 [==============================] - 0s 7ms/step - loss: 0.0566 - accuracy:
1.0000 - val_loss: 0.0841 - val_accuracy: 0.9899
Epoch 14/16
7/7 [==============================] - 0s 7ms/step - loss: 0.0494 - accuracy:
1.0000 - val_loss: 0.0805 - val_accuracy: 0.9899
Epoch 15/16
7/7 [==============================] - 0s 7ms/step - loss: 0.0545 - accuracy:

25
1.0000 - val_loss: 0.0771 - val_accuracy: 0.9899
Epoch 16/16
7/7 [==============================] - 0s 7ms/step - loss: 0.0472 - accuracy:
1.0000 - val_loss: 0.0741 - val_accuracy: 0.9899
CPU times: user 2.24 s, sys: 273 ms, total: 2.52 s
Wall time: 1.63 s
So, what’s the final verdict?
[66]: model_B.evaluate(X_test_B, y_test_B)

63/63 [==============================] - 0s 761us/step - loss: 0.1408 -


accuracy: 0.9705

[66]: [0.1408407837152481, 0.9704999923706055]

[67]: model_B_on_A.evaluate(X_test_B, y_test_B)

63/63 [==============================] - 0s 759us/step - loss: 0.0683 -


accuracy: 0.9930

[67]: [0.0683266744017601, 0.9929999709129333]

Great! We got quite a bit of transfer: the error rate dropped by a factor of 4!
[68]: (100 - 96.95) / (100 - 99.25)

[68]: 4.066666666666663

5 Faster Optimizers
• The details are beyond the scope of the module
• Optimizers should
– be efficient
– avoid saddle points
– avoid local minima
• These desirable features are often mutually exclusive
• The design and empirical evaluation of optimizers is a large area of research

5.1 Momentum optimization


[69]: optimizer = keras.optimizers.SGD(learning_rate=0.001, momentum=0.9)

26
5.2 Nesterov Accelerated Gradient
[70]: optimizer = keras.optimizers.SGD(learning_rate=0.001, momentum=0.9,␣
,→nesterov=True)

5.3 AdaGrad
[71]: optimizer = keras.optimizers.Adagrad(learning_rate=0.001)

5.4 RMSProp
[72]: optimizer = keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9)

5.5 Adam Optimization


[73]: optimizer = keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

5.6 Adamax Optimization


[74]: optimizer = keras.optimizers.Adamax(learning_rate=0.001, beta_1=0.9, beta_2=0.
,→999)

5.7 Nadam Optimization


[75]: optimizer = keras.optimizers.Nadam(learning_rate=0.001, beta_1=0.9, beta_2=0.
,→999)

5.8 Learning Rate Scheduling


• The default in Keras is constant rate
• Using a schedule can speed up learning and avoid gradient issues
• Adaptive learning rates are also possible
• We’ll look at performance scheduling
– the idea is to choose a rate that seems suitable for the current state of weight learning

5.8.1 Power Scheduling


lr = lr0 / (1 + steps / s)**c * Keras uses c=1 and s = 1 / decay
[76]: optimizer = keras.optimizers.SGD(learning_rate=0.01, decay=1e-4)

[77]: model = keras.models.Sequential([


keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(300, activation="selu",␣
,→kernel_initializer="lecun_normal"),

27
keras.layers.Dense(100, activation="selu",␣
kernel_initializer="lecun_normal"),
,→

keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,␣
,→metrics=["accuracy"])

[78]: %%time
n_epochs = 25
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
validation_data=(X_valid_scaled, y_valid))

Epoch 1/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.5980 -
accuracy: 0.7934 - val_loss: 0.4029 - val_accuracy: 0.8592
Epoch 2/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.3831 -
accuracy: 0.8633 - val_loss: 0.3715 - val_accuracy: 0.8736
Epoch 3/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.3492 -
accuracy: 0.8774 - val_loss: 0.3752 - val_accuracy: 0.8746
Epoch 4/25
1719/1719 [==============================] - 2s 979us/step - loss: 0.3277 -
accuracy: 0.8816 - val_loss: 0.3501 - val_accuracy: 0.8800
Epoch 5/25
1719/1719 [==============================] - 2s 979us/step - loss: 0.3173 -
accuracy: 0.8859 - val_loss: 0.3446 - val_accuracy: 0.8788
Epoch 6/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.2923 -
accuracy: 0.8940 - val_loss: 0.3413 - val_accuracy: 0.8830
Epoch 7/25
1719/1719 [==============================] - 2s 998us/step - loss: 0.2871 -
accuracy: 0.8973 - val_loss: 0.3355 - val_accuracy: 0.8866
Epoch 8/25
1719/1719 [==============================] - 2s 990us/step - loss: 0.2721 -
accuracy: 0.9035 - val_loss: 0.3411 - val_accuracy: 0.8832
Epoch 9/25
1719/1719 [==============================] - 2s 981us/step - loss: 0.2729 -
accuracy: 0.9003 - val_loss: 0.3292 - val_accuracy: 0.8878
Epoch 10/25
1719/1719 [==============================] - 2s 984us/step - loss: 0.2585 -
accuracy: 0.9070 - val_loss: 0.3263 - val_accuracy: 0.8882
Epoch 11/25
1719/1719 [==============================] - 2s 993us/step - loss: 0.2529 -
accuracy: 0.9094 - val_loss: 0.3270 - val_accuracy: 0.8868
Epoch 12/25
1719/1719 [==============================] - 2s 999us/step - loss: 0.2485 -

28
accuracy: 0.9104 - val_loss: 0.3335 - val_accuracy: 0.8836
Epoch 13/25
1719/1719 [==============================] - 2s 980us/step - loss: 0.2419 -
accuracy: 0.9148 - val_loss: 0.3257 - val_accuracy: 0.8894
Epoch 14/25
1719/1719 [==============================] - 2s 989us/step - loss: 0.2372 -
accuracy: 0.9148 - val_loss: 0.3288 - val_accuracy: 0.8904
Epoch 15/25
1719/1719 [==============================] - 2s 983us/step - loss: 0.2363 -
accuracy: 0.9157 - val_loss: 0.3244 - val_accuracy: 0.8878
Epoch 16/25
1719/1719 [==============================] - 2s 980us/step - loss: 0.2310 -
accuracy: 0.9175 - val_loss: 0.3206 - val_accuracy: 0.8896
Epoch 17/25
1719/1719 [==============================] - 2s 979us/step - loss: 0.2235 -
accuracy: 0.9211 - val_loss: 0.3237 - val_accuracy: 0.8904
Epoch 18/25
1719/1719 [==============================] - 2s 995us/step - loss: 0.2248 -
accuracy: 0.9190 - val_loss: 0.3189 - val_accuracy: 0.8932
Epoch 19/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.2235 -
accuracy: 0.9213 - val_loss: 0.3228 - val_accuracy: 0.8902
Epoch 20/25
1719/1719 [==============================] - 2s 993us/step - loss: 0.2228 -
accuracy: 0.9225 - val_loss: 0.3210 - val_accuracy: 0.8916
Epoch 21/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.2194 -
accuracy: 0.9228 - val_loss: 0.3208 - val_accuracy: 0.8920
Epoch 22/25
1719/1719 [==============================] - 2s 984us/step - loss: 0.2164 -
accuracy: 0.9227 - val_loss: 0.3181 - val_accuracy: 0.8948
Epoch 23/25
1719/1719 [==============================] - 2s 998us/step - loss: 0.2130 -
accuracy: 0.9248 - val_loss: 0.3195 - val_accuracy: 0.8908
Epoch 24/25
1719/1719 [==============================] - 2s 987us/step - loss: 0.2077 -
accuracy: 0.9272 - val_loss: 0.3215 - val_accuracy: 0.8898
Epoch 25/25
1719/1719 [==============================] - 2s 986us/step - loss: 0.2103 -
accuracy: 0.9255 - val_loss: 0.3214 - val_accuracy: 0.8930
CPU times: user 1min 42s, sys: 24.5 s, total: 2min 7s
Wall time: 42.9 s

[79]: learning_rate = 0.01


decay = 1e-4
batch_size = 32
n_steps_per_epoch = len(X_train) // batch_size

29
epochs = np.arange(n_epochs)
lrs = learning_rate / (1 + decay * epochs * n_steps_per_epoch)

plt.plot(epochs, lrs, "o-")


plt.axis([0, n_epochs - 1, 0, 0.01])
plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.title("Power Scheduling", fontsize=14)
plt.grid(True)
plt.show()

5.8.2 Exponential Scheduling


lr = lr0 * 0.1**(epoch / s)
[80]: def exponential_decay_fn(epoch):
return 0.01 * 0.1**(epoch / 20)

[81]: def exponential_decay(lr0, s):


def exponential_decay_fn(epoch):
return lr0 * 0.1**(epoch / s)
return exponential_decay_fn

30
exponential_decay_fn = exponential_decay(lr0=0.01, s=20)

[82]: model = keras.models.Sequential([


keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(300, activation="selu",␣
,→kernel_initializer="lecun_normal"),

keras.layers.Dense(100, activation="selu",␣
,→kernel_initializer="lecun_normal"),

keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",␣
,→metrics=["accuracy"])

n_epochs = 25

[83]: %%time
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
validation_data=(X_valid_scaled, y_valid),
callbacks=[lr_scheduler])

Epoch 1/25
1719/1719 [==============================] - 4s 2ms/step - loss: 1.1350 -
accuracy: 0.7298 - val_loss: 1.0083 - val_accuracy: 0.7442
Epoch 2/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.6934 -
accuracy: 0.7903 - val_loss: 0.5700 - val_accuracy: 0.8232
Epoch 3/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.5910 -
accuracy: 0.8155 - val_loss: 0.6708 - val_accuracy: 0.8084
Epoch 4/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.5455 -
accuracy: 0.8299 - val_loss: 0.5847 - val_accuracy: 0.8410
Epoch 5/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.5145 -
accuracy: 0.8355 - val_loss: 0.5314 - val_accuracy: 0.8590
Epoch 6/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.4288 -
accuracy: 0.8623 - val_loss: 0.5024 - val_accuracy: 0.8612
Epoch 7/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.4253 -
accuracy: 0.8658 - val_loss: 0.6026 - val_accuracy: 0.8406
Epoch 8/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.3863 -
accuracy: 0.8736 - val_loss: 0.5721 - val_accuracy: 0.8342
Epoch 9/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.3644 -
accuracy: 0.8785 - val_loss: 0.4727 - val_accuracy: 0.8616

31
Epoch 10/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.3240 -
accuracy: 0.8905 - val_loss: 0.4730 - val_accuracy: 0.8732
Epoch 11/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2990 -
accuracy: 0.8968 - val_loss: 0.4244 - val_accuracy: 0.8736
Epoch 12/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2759 -
accuracy: 0.9037 - val_loss: 0.4435 - val_accuracy: 0.8666
Epoch 13/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2609 -
accuracy: 0.9111 - val_loss: 0.5508 - val_accuracy: 0.8678
Epoch 14/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2503 -
accuracy: 0.9160 - val_loss: 0.4832 - val_accuracy: 0.8632
Epoch 15/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2353 -
accuracy: 0.9182 - val_loss: 0.4600 - val_accuracy: 0.8838
Epoch 16/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.2139 -
accuracy: 0.9276 - val_loss: 0.4832 - val_accuracy: 0.8898
Epoch 17/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.1895 -
accuracy: 0.9346 - val_loss: 0.4910 - val_accuracy: 0.8878
Epoch 18/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.1777 -
accuracy: 0.9388 - val_loss: 0.4717 - val_accuracy: 0.8902
Epoch 19/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.1704 -
accuracy: 0.9440 - val_loss: 0.4908 - val_accuracy: 0.8934
Epoch 20/25
1719/1719 [==============================] - 4s 3ms/step - loss: 0.1638 -
accuracy: 0.9450 - val_loss: 0.4904 - val_accuracy: 0.8916
Epoch 21/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.1519 -
accuracy: 0.9481 - val_loss: 0.5197 - val_accuracy: 0.8964
Epoch 22/25
1719/1719 [==============================] - 4s 2ms/step - loss: 0.1432 -
accuracy: 0.9519 - val_loss: 0.5392 - val_accuracy: 0.8856
Epoch 23/25
1719/1719 [==============================] - 4s 3ms/step - loss: 0.1336 -
accuracy: 0.9552 - val_loss: 0.5704 - val_accuracy: 0.8912
Epoch 24/25
1719/1719 [==============================] - 4s 3ms/step - loss: 0.1253 -
accuracy: 0.9591 - val_loss: 0.5957 - val_accuracy: 0.8942
Epoch 25/25
1719/1719 [==============================] - 4s 3ms/step - loss: 0.1209 -
accuracy: 0.9614 - val_loss: 0.6145 - val_accuracy: 0.8930

32
CPU times: user 6min 9s, sys: 2min 39s, total: 8min 48s
Wall time: 1min 43s

[84]: plt.plot(history.epoch, history.history["lr"], "o-")


plt.axis([0, n_epochs - 1, 0, 0.011])
plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.title("Exponential Scheduling", fontsize=14)
plt.grid(True)
plt.show()

The schedule function can take the current learning rate as a second argument:
[85]: def exponential_decay_fn(epoch, lr):
return lr * 0.1**(1 / 20)

If you want to update the learning rate at each iteration rather than at each epoch, you must write
your own callback class:
[86]: K = keras.backend

class ExponentialDecay(keras.callbacks.Callback):
def __init__(self, s=40000):
super().__init__()

33
self.s = s

def on_batch_begin(self, batch, logs=None):


# Note: the `batch` argument is reset at each epoch
lr = K.get_value(self.model.optimizer.lr)
K.set_value(self.model.optimizer.lr, lr * 0.1**(1 / s))

def on_epoch_end(self, epoch, logs=None):


logs = logs or {}
logs['lr'] = K.get_value(self.model.optimizer.lr)

model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(300, activation="selu",␣
,→kernel_initializer="lecun_normal"),

keras.layers.Dense(100, activation="selu",␣
,→kernel_initializer="lecun_normal"),

keras.layers.Dense(10, activation="softmax")
])
lr0 = 0.01
optimizer = keras.optimizers.Nadam(lr=lr0)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,␣
,→metrics=["accuracy"])

n_epochs = 25

/usr/local/lib/python3.9/site-
packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:374: UserWarning:
The `lr` argument is deprecated, use `learning_rate` instead.
warnings.warn(

[87]: s = 20 * len(X_train) // 32 # number of steps in 20 epochs (batch size = 32)


exp_decay = ExponentialDecay(s)
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
validation_data=(X_valid_scaled, y_valid),
callbacks=[exp_decay])

Epoch 1/25
1719/1719 [==============================] - 5s 3ms/step - loss: 1.1034 -
accuracy: 0.7405 - val_loss: 1.1014 - val_accuracy: 0.6896
Epoch 2/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.6554 -
accuracy: 0.7999 - val_loss: 0.6157 - val_accuracy: 0.8188
Epoch 3/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.5847 -
accuracy: 0.8169 - val_loss: 0.6588 - val_accuracy: 0.7998
Epoch 4/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.5607 -

34
accuracy: 0.8289 - val_loss: 0.5370 - val_accuracy: 0.8464
Epoch 5/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.5004 -
accuracy: 0.8451 - val_loss: 0.4278 - val_accuracy: 0.8660
Epoch 6/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.3996 -
accuracy: 0.8705 - val_loss: 0.4836 - val_accuracy: 0.8630
Epoch 7/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.3862 -
accuracy: 0.8766 - val_loss: 0.4362 - val_accuracy: 0.8594
Epoch 8/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.3457 -
accuracy: 0.8826 - val_loss: 0.4961 - val_accuracy: 0.8526
Epoch 9/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.3308 -
accuracy: 0.8902 - val_loss: 0.4187 - val_accuracy: 0.8666
Epoch 10/25
1719/1719 [==============================] - 6s 3ms/step - loss: 0.3019 -
accuracy: 0.8984 - val_loss: 0.4183 - val_accuracy: 0.8812
Epoch 11/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.2712 -
accuracy: 0.9066 - val_loss: 0.4769 - val_accuracy: 0.8720
Epoch 12/25
1719/1719 [==============================] - 6s 3ms/step - loss: 0.2563 -
accuracy: 0.9118 - val_loss: 0.4167 - val_accuracy: 0.8738
Epoch 13/25
1719/1719 [==============================] - 6s 3ms/step - loss: 0.2395 -
accuracy: 0.9171 - val_loss: 0.4461 - val_accuracy: 0.8868
Epoch 14/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.2223 -
accuracy: 0.9248 - val_loss: 0.4119 - val_accuracy: 0.8858
Epoch 15/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.2076 -
accuracy: 0.9285 - val_loss: 0.4286 - val_accuracy: 0.8832
Epoch 16/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.1905 -
accuracy: 0.9344 - val_loss: 0.4611 - val_accuracy: 0.8856
Epoch 17/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.1715 -
accuracy: 0.9414 - val_loss: 0.4713 - val_accuracy: 0.8836
Epoch 18/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.1606 -
accuracy: 0.9461 - val_loss: 0.4923 - val_accuracy: 0.8862
Epoch 19/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.1533 -
accuracy: 0.9493 - val_loss: 0.4746 - val_accuracy: 0.8908
Epoch 20/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.1421 -

35
accuracy: 0.9539 - val_loss: 0.4961 - val_accuracy: 0.8874
Epoch 21/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.1351 -
accuracy: 0.9560 - val_loss: 0.5376 - val_accuracy: 0.8880
Epoch 22/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.1233 -
accuracy: 0.9604 - val_loss: 0.5715 - val_accuracy: 0.8930
Epoch 23/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.1144 -
accuracy: 0.9633 - val_loss: 0.5564 - val_accuracy: 0.8906
Epoch 24/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.1069 -
accuracy: 0.9655 - val_loss: 0.5917 - val_accuracy: 0.8906
Epoch 25/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.1043 -
accuracy: 0.9666 - val_loss: 0.6143 - val_accuracy: 0.8928

[88]: n_steps = n_epochs * len(X_train) // 32


steps = np.arange(n_steps)
lrs = lr0 * 0.1**(steps / s)

[89]: plt.plot(steps, lrs, "-", linewidth=2)


plt.axis([0, n_steps - 1, 0, lr0 * 1.1])
plt.xlabel("Batch")
plt.ylabel("Learning Rate")
plt.title("Exponential Scheduling (per batch)", fontsize=14)
plt.grid(True)
plt.show()

36
5.8.3 Piecewise Constant Scheduling

[90]: def piecewise_constant_fn(epoch):


if epoch < 5:
return 0.01
elif epoch < 15:
return 0.005
else:
return 0.001

[91]: def piecewise_constant(boundaries, values):


boundaries = np.array([0] + boundaries)
values = np.array(values)
def piecewise_constant_fn(epoch):
return values[np.argmax(boundaries > epoch) - 1]
return piecewise_constant_fn

piecewise_constant_fn = piecewise_constant([5, 15], [0.01, 0.005, 0.001])

[92]: %%time
lr_scheduler = keras.callbacks.LearningRateScheduler(piecewise_constant_fn)

37
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(300, activation="selu",␣
,→kernel_initializer="lecun_normal"),

keras.layers.Dense(100, activation="selu",␣
,→kernel_initializer="lecun_normal"),

keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",␣
,→metrics=["accuracy"])

n_epochs = 25

history = model.fit(X_train_scaled, y_train, epochs=n_epochs,


validation_data=(X_valid_scaled, y_valid),
callbacks=[lr_scheduler])

Epoch 1/25
1719/1719 [==============================] - 6s 3ms/step - loss: 1.1657 -
accuracy: 0.7315 - val_loss: 0.8066 - val_accuracy: 0.7212
Epoch 2/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.7137 -
accuracy: 0.7839 - val_loss: 0.8368 - val_accuracy: 0.7528
Epoch 3/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.8323 -
accuracy: 0.7709 - val_loss: 1.0008 - val_accuracy: 0.6874
Epoch 4/25
1719/1719 [==============================] - 5s 3ms/step - loss: 1.0637 -
accuracy: 0.6757 - val_loss: 1.3762 - val_accuracy: 0.6172
Epoch 5/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.9993 -
accuracy: 0.6573 - val_loss: 1.1529 - val_accuracy: 0.5702
Epoch 6/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.9075 -
accuracy: 0.6371 - val_loss: 0.8080 - val_accuracy: 0.7176
Epoch 7/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.7330 -
accuracy: 0.7341 - val_loss: 0.6843 - val_accuracy: 0.7420
Epoch 8/25
1719/1719 [==============================] - 6s 3ms/step - loss: 0.6619 -
accuracy: 0.7560 - val_loss: 0.7742 - val_accuracy: 0.7308
Epoch 9/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.6794 -
accuracy: 0.7506 - val_loss: 0.6803 - val_accuracy: 0.7514
Epoch 10/25
1719/1719 [==============================] - 6s 3ms/step - loss: 0.6079 -
accuracy: 0.7689 - val_loss: 0.7341 - val_accuracy: 0.7580
Epoch 11/25

38
1719/1719 [==============================] - 6s 3ms/step - loss: 0.5980 -
accuracy: 0.7747 - val_loss: 0.8540 - val_accuracy: 0.7460
Epoch 12/25
1719/1719 [==============================] - 6s 3ms/step - loss: 0.6234 -
accuracy: 0.7746 - val_loss: 0.7887 - val_accuracy: 0.7626
Epoch 13/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.5759 -
accuracy: 0.7806 - val_loss: 0.7147 - val_accuracy: 0.7686
Epoch 14/25
1719/1719 [==============================] - 6s 3ms/step - loss: 0.6123 -
accuracy: 0.7703 - val_loss: 0.7174 - val_accuracy: 0.7562
Epoch 15/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.5889 -
accuracy: 0.7745 - val_loss: 0.7534 - val_accuracy: 0.7694
Epoch 16/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.4829 -
accuracy: 0.7983 - val_loss: 0.6005 - val_accuracy: 0.8412
Epoch 17/25
1719/1719 [==============================] - 6s 4ms/step - loss: 0.3853 -
accuracy: 0.8753 - val_loss: 0.5423 - val_accuracy: 0.8534
Epoch 18/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.3593 -
accuracy: 0.8871 - val_loss: 0.4990 - val_accuracy: 0.8626
Epoch 19/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.3320 -
accuracy: 0.8961 - val_loss: 0.5499 - val_accuracy: 0.8588
Epoch 20/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.3187 -
accuracy: 0.8994 - val_loss: 0.5222 - val_accuracy: 0.8700
Epoch 21/25
1719/1719 [==============================] - 5s 3ms/step - loss: 0.3030 -
accuracy: 0.9015 - val_loss: 0.5297 - val_accuracy: 0.8768
Epoch 22/25
1719/1719 [==============================] - 6s 3ms/step - loss: 0.2926 -
accuracy: 0.9072 - val_loss: 0.5465 - val_accuracy: 0.8784
Epoch 23/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2748 -
accuracy: 0.9142 - val_loss: 0.5298 - val_accuracy: 0.8742
Epoch 24/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2653 -
accuracy: 0.9178 - val_loss: 0.5602 - val_accuracy: 0.8774
Epoch 25/25
1719/1719 [==============================] - 7s 4ms/step - loss: 0.2589 -
accuracy: 0.9187 - val_loss: 0.6008 - val_accuracy: 0.8786
CPU times: user 6min 17s, sys: 3min 24s, total: 9min 42s
Wall time: 2min 27s

39
[93]: plt.plot(history.epoch, [piecewise_constant_fn(epoch) for epoch in history.
,→epoch], "o-")

plt.axis([0, n_epochs - 1, 0, 0.011])


plt.xlabel("Epoch")
plt.ylabel("Learning Rate")
plt.title("Piecewise Constant Scheduling", fontsize=14)
plt.grid(True)
plt.show()

5.8.4 Performance Scheduling

[94]: tf.random.set_seed(42)
np.random.seed(42)

[95]: %%time
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(300, activation="selu",␣
,→kernel_initializer="lecun_normal"),

40
keras.layers.Dense(100, activation="selu",␣
,→kernel_initializer="lecun_normal"),
keras.layers.Dense(10, activation="softmax")
])
optimizer = keras.optimizers.SGD(learning_rate=0.02, momentum=0.9)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,␣
,→metrics=["accuracy"])

n_epochs = 25
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
validation_data=(X_valid_scaled, y_valid),
callbacks=[lr_scheduler])

Epoch 1/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.7109 -
accuracy: 0.7772 - val_loss: 0.4637 - val_accuracy: 0.8510
Epoch 2/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.4888 -
accuracy: 0.8403 - val_loss: 0.5779 - val_accuracy: 0.8368
Epoch 3/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.4976 -
accuracy: 0.8437 - val_loss: 0.5128 - val_accuracy: 0.8538
Epoch 4/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.4981 -
accuracy: 0.8495 - val_loss: 0.5305 - val_accuracy: 0.8484
Epoch 5/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.5240 -
accuracy: 0.8470 - val_loss: 0.5335 - val_accuracy: 0.8396
Epoch 6/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.5144 -
accuracy: 0.8541 - val_loss: 0.5877 - val_accuracy: 0.8492
Epoch 7/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.3364 -
accuracy: 0.8852 - val_loss: 0.4007 - val_accuracy: 0.8746
Epoch 8/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.2494 -
accuracy: 0.9074 - val_loss: 0.4152 - val_accuracy: 0.8704
Epoch 9/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.2369 -
accuracy: 0.9111 - val_loss: 0.3974 - val_accuracy: 0.8870
Epoch 10/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.2164 -
accuracy: 0.9195 - val_loss: 0.3922 - val_accuracy: 0.8892
Epoch 11/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.1965 -
accuracy: 0.9264 - val_loss: 0.4122 - val_accuracy: 0.8814
Epoch 12/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.1906 -

41
accuracy: 0.9283 - val_loss: 0.4515 - val_accuracy: 0.8752
Epoch 13/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.1834 -
accuracy: 0.9332 - val_loss: 0.4425 - val_accuracy: 0.8838
Epoch 14/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.1747 -
accuracy: 0.9332 - val_loss: 0.4461 - val_accuracy: 0.8824
Epoch 15/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.1758 -
accuracy: 0.9336 - val_loss: 0.4602 - val_accuracy: 0.8830
Epoch 16/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.1220 -
accuracy: 0.9515 - val_loss: 0.4352 - val_accuracy: 0.8964
Epoch 17/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.0952 -
accuracy: 0.9632 - val_loss: 0.4543 - val_accuracy: 0.8894
Epoch 18/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.0900 -
accuracy: 0.9651 - val_loss: 0.4574 - val_accuracy: 0.8932
Epoch 19/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.0824 -
accuracy: 0.9680 - val_loss: 0.4707 - val_accuracy: 0.8962
Epoch 20/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.0761 -
accuracy: 0.9703 - val_loss: 0.5006 - val_accuracy: 0.8960
Epoch 21/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.0621 -
accuracy: 0.9777 - val_loss: 0.5000 - val_accuracy: 0.8992
Epoch 22/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.0533 -
accuracy: 0.9800 - val_loss: 0.5170 - val_accuracy: 0.8976
Epoch 23/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.0505 -
accuracy: 0.9817 - val_loss: 0.5193 - val_accuracy: 0.8986
Epoch 24/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.0453 -
accuracy: 0.9841 - val_loss: 0.5368 - val_accuracy: 0.8964
Epoch 25/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.0428 -
accuracy: 0.9851 - val_loss: 0.5464 - val_accuracy: 0.8994
CPU times: user 2min 2s, sys: 31 s, total: 2min 33s
Wall time: 47.9 s

[96]: plt.plot(history.epoch, history.history["lr"], "bo-")


plt.xlabel("Epoch")
plt.ylabel("Learning Rate", color='b')
plt.tick_params('y', colors='b')

42
plt.gca().set_xlim(0, n_epochs - 1)
plt.grid(True)

ax2 = plt.gca().twinx()
ax2.plot(history.epoch, history.history["val_loss"], "r^-")
ax2.set_ylabel('Validation Loss', color='r')
ax2.tick_params('y', colors='r')

plt.title("Reduce LR on Plateau", fontsize=14)


plt.show()

5.8.5 tf.keras schedulers


class ExponentialDecay: A LearningRateSchedule that uses an exponential decay schedule.
class InverseTimeDecay: A LearningRateSchedule that uses an inverse time decay schedule.
class LearningRateSchedule: A serializable learning rate decay schedule.
class PiecewiseConstantDecay: A LearningRateSchedule that uses a piecewise constant decay sched-
ule.
class PolynomialDecay: A LearningRateSchedule that uses a polynomial decay schedule.
[97]: %%time
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),

43
keras.layers.Dense(300, activation="selu",␣
,→kernel_initializer="lecun_normal"),
keras.layers.Dense(100, activation="selu",␣
,→kernel_initializer="lecun_normal"),

keras.layers.Dense(10, activation="softmax")
])
s = 20 * len(X_train) // 32 # number of steps in 20 epochs (batch size = 32)
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
optimizer = keras.optimizers.SGD(learning_rate)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,␣
,→metrics=["accuracy"])

n_epochs = 25
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
validation_data=(X_valid_scaled, y_valid))

Epoch 1/25
1719/1719 [==============================] - 2s 1ms/step - loss: 0.5995 -
accuracy: 0.7923 - val_loss: 0.4092 - val_accuracy: 0.8604
Epoch 2/25
1719/1719 [==============================] - 2s 984us/step - loss: 0.3890 -
accuracy: 0.8614 - val_loss: 0.3739 - val_accuracy: 0.8692
Epoch 3/25
1719/1719 [==============================] - 2s 994us/step - loss: 0.3531 -
accuracy: 0.8776 - val_loss: 0.3733 - val_accuracy: 0.8686
Epoch 4/25
1719/1719 [==============================] - 2s 987us/step - loss: 0.3297 -
accuracy: 0.8815 - val_loss: 0.3493 - val_accuracy: 0.8802
Epoch 5/25
1719/1719 [==============================] - 2s 993us/step - loss: 0.3177 -
accuracy: 0.8868 - val_loss: 0.3430 - val_accuracy: 0.8794
Epoch 6/25
1719/1719 [==============================] - 2s 988us/step - loss: 0.2929 -
accuracy: 0.8953 - val_loss: 0.3415 - val_accuracy: 0.8808
Epoch 7/25
1719/1719 [==============================] - 2s 992us/step - loss: 0.2854 -
accuracy: 0.8988 - val_loss: 0.3354 - val_accuracy: 0.8818
Epoch 8/25
1719/1719 [==============================] - 2s 988us/step - loss: 0.2713 -
accuracy: 0.9037 - val_loss: 0.3364 - val_accuracy: 0.8818
Epoch 9/25
1719/1719 [==============================] - 2s 995us/step - loss: 0.2714 -
accuracy: 0.9042 - val_loss: 0.3264 - val_accuracy: 0.8854
Epoch 10/25
1719/1719 [==============================] - 2s 996us/step - loss: 0.2570 -
accuracy: 0.9084 - val_loss: 0.3240 - val_accuracy: 0.8848
Epoch 11/25
1719/1719 [==============================] - 2s 983us/step - loss: 0.2501 -

44
accuracy: 0.9116 - val_loss: 0.3252 - val_accuracy: 0.8858
Epoch 12/25
1719/1719 [==============================] - 2s 978us/step - loss: 0.2452 -
accuracy: 0.9147 - val_loss: 0.3300 - val_accuracy: 0.8810
Epoch 13/25
1719/1719 [==============================] - 2s 984us/step - loss: 0.2409 -
accuracy: 0.9153 - val_loss: 0.3217 - val_accuracy: 0.8864
Epoch 14/25
1719/1719 [==============================] - 2s 977us/step - loss: 0.2379 -
accuracy: 0.9154 - val_loss: 0.3220 - val_accuracy: 0.8868
Epoch 15/25
1719/1719 [==============================] - 2s 977us/step - loss: 0.2377 -
accuracy: 0.9165 - val_loss: 0.3207 - val_accuracy: 0.8874
Epoch 16/25
1719/1719 [==============================] - 2s 992us/step - loss: 0.2317 -
accuracy: 0.9192 - val_loss: 0.3182 - val_accuracy: 0.8890
Epoch 17/25
1719/1719 [==============================] - 2s 990us/step - loss: 0.2265 -
accuracy: 0.9215 - val_loss: 0.3194 - val_accuracy: 0.8892
Epoch 18/25
1719/1719 [==============================] - 2s 978us/step - loss: 0.2284 -
accuracy: 0.9184 - val_loss: 0.3166 - val_accuracy: 0.8906
Epoch 19/25
1719/1719 [==============================] - 2s 981us/step - loss: 0.2285 -
accuracy: 0.9202 - val_loss: 0.3194 - val_accuracy: 0.8884
Epoch 20/25
1719/1719 [==============================] - 2s 991us/step - loss: 0.2287 -
accuracy: 0.9215 - val_loss: 0.3167 - val_accuracy: 0.8894
Epoch 21/25
1719/1719 [==============================] - 2s 991us/step - loss: 0.2265 -
accuracy: 0.9209 - val_loss: 0.3177 - val_accuracy: 0.8910
Epoch 22/25
1719/1719 [==============================] - 2s 993us/step - loss: 0.2257 -
accuracy: 0.9201 - val_loss: 0.3161 - val_accuracy: 0.8912
Epoch 23/25
1719/1719 [==============================] - 2s 985us/step - loss: 0.2222 -
accuracy: 0.9228 - val_loss: 0.3169 - val_accuracy: 0.8896
Epoch 24/25
1719/1719 [==============================] - 2s 986us/step - loss: 0.2181 -
accuracy: 0.9245 - val_loss: 0.3164 - val_accuracy: 0.8906
Epoch 25/25
1719/1719 [==============================] - 2s 996us/step - loss: 0.2222 -
accuracy: 0.9230 - val_loss: 0.3163 - val_accuracy: 0.8900
CPU times: user 1min 42s, sys: 24.1 s, total: 2min 7s
Wall time: 42.8 s
For piecewise constant scheduling, try this:

45
[98]: learning_rate = keras.optimizers.schedules.PiecewiseConstantDecay(
boundaries=[5. * n_steps_per_epoch, 15. * n_steps_per_epoch],
values=[0.01, 0.005, 0.001])

6 Avoiding Overfitting Through Regularization


6.1 ℓ1 and ℓ2 regularization
[99]: layer = keras.layers.Dense(100, activation="elu",
kernel_initializer="he_normal",
kernel_regularizer=keras.regularizers.l2(0.01))
# or l1(0.1) for �1 regularization with a factor or 0.1
# or l1_l2(0.1, 0.01) for both �1 and �2 regularization, with factors 0.1 and 0.
,→01 respectively

[100]: %%time
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dense(300, activation="elu",
kernel_initializer="he_normal",
kernel_regularizer=keras.regularizers.l2(0.01)),
keras.layers.Dense(100, activation="elu",
kernel_initializer="he_normal",
kernel_regularizer=keras.regularizers.l2(0.01)),
keras.layers.Dense(10, activation="softmax",
kernel_regularizer=keras.regularizers.l2(0.01))
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",␣
,→metrics=["accuracy"])

n_epochs = 2
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
validation_data=(X_valid_scaled, y_valid))

Epoch 1/2
1719/1719 [==============================] - 6s 3ms/step - loss: 3.1835 -
accuracy: 0.7944 - val_loss: 0.7197 - val_accuracy: 0.8302
Epoch 2/2
1719/1719 [==============================] - 6s 3ms/step - loss: 0.7294 -
accuracy: 0.8244 - val_loss: 0.6862 - val_accuracy: 0.8360
CPU times: user 37.3 s, sys: 18.3 s, total: 55.6 s
Wall time: 12.4 s

[101]: %%time
from functools import partial

RegularizedDense = partial(keras.layers.Dense,

46
activation="elu",
kernel_initializer="he_normal",
kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
RegularizedDense(300),
RegularizedDense(100),
RegularizedDense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",␣
,→metrics=["accuracy"])

n_epochs = 2
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
validation_data=(X_valid_scaled, y_valid))

Epoch 1/2
1719/1719 [==============================] - 6s 3ms/step - loss: 3.3555 -
accuracy: 0.7931 - val_loss: 0.7201 - val_accuracy: 0.8308
Epoch 2/2
1719/1719 [==============================] - 5s 3ms/step - loss: 0.7282 -
accuracy: 0.8235 - val_loss: 0.6806 - val_accuracy: 0.8402
CPU times: user 37.1 s, sys: 16.9 s, total: 54.1 s
Wall time: 11.6 s

6.2 Dropout
• During training, randomly zero some of the values of the input and/or weighted sums with
probability p
• The idea is to reduce overfitting
• It would be great if we could average the predictions from thousands of neural nets
– as done in Random Forests
• But this would be too expensive computationally
• So we mimic the randomness introduced in ensemble methods by dropping values here and
there
• As always, its use is data, architecture and human expert dependent
[103]: %%time
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dropout(rate=0.2),
keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
keras.layers.Dropout(rate=0.2),
keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
keras.layers.Dropout(rate=0.2),
keras.layers.Dense(10, activation="softmax")
])

47
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",␣
,→metrics=["accuracy"])

n_epochs = 2
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
validation_data=(X_valid_scaled, y_valid))

Epoch 1/2
1719/1719 [==============================] - 7s 4ms/step - loss: 0.7216 -
accuracy: 0.7648 - val_loss: 0.3605 - val_accuracy: 0.8646
Epoch 2/2
1719/1719 [==============================] - 6s 4ms/step - loss: 0.4274 -
accuracy: 0.8413 - val_loss: 0.3425 - val_accuracy: 0.8698
CPU times: user 31.5 s, sys: 16.4 s, total: 47.9 s
Wall time: 13.3 s

6.3 Alpha Dropout


• Keep mean and variance of inputs to their original values, in order to ensure the self-
normalizing property, even after a dropout event
[104]: tf.random.set_seed(42)
np.random.seed(42)

[105]: %%time
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.AlphaDropout(rate=0.2),
keras.layers.Dense(300, activation="selu",␣
,→kernel_initializer="lecun_normal"),

keras.layers.AlphaDropout(rate=0.2),
keras.layers.Dense(100, activation="selu",␣
,→kernel_initializer="lecun_normal"),

keras.layers.AlphaDropout(rate=0.2),
keras.layers.Dense(10, activation="softmax")
])
optimizer = keras.optimizers.SGD(lr=0.01, momentum=0.9, nesterov=True)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,␣
,→metrics=["accuracy"])

n_epochs = 20
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
validation_data=(X_valid_scaled, y_valid))

/usr/local/lib/python3.9/site-
packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:374: UserWarning:
The `lr` argument is deprecated, use `learning_rate` instead.
warnings.warn(
Epoch 1/20

48
1719/1719 [==============================] - 3s 2ms/step - loss: 0.8023 -
accuracy: 0.7146 - val_loss: 0.5783 - val_accuracy: 0.8444
Epoch 2/20
1719/1719 [==============================] - 2s 1ms/step - loss: 0.5664 -
accuracy: 0.7901 - val_loss: 0.5205 - val_accuracy: 0.8508
Epoch 3/20
1719/1719 [==============================] - 3s 1ms/step - loss: 0.5263 -
accuracy: 0.8054 - val_loss: 0.4876 - val_accuracy: 0.8616
Epoch 4/20
1719/1719 [==============================] - 3s 1ms/step - loss: 0.5120 -
accuracy: 0.8094 - val_loss: 0.4800 - val_accuracy: 0.8616
Epoch 5/20
1719/1719 [==============================] - 3s 1ms/step - loss: 0.5069 -
accuracy: 0.8124 - val_loss: 0.4243 - val_accuracy: 0.8696
Epoch 6/20
1719/1719 [==============================] - 2s 1ms/step - loss: 0.4792 -
accuracy: 0.8202 - val_loss: 0.4637 - val_accuracy: 0.8636
Epoch 7/20
1719/1719 [==============================] - 2s 1ms/step - loss: 0.4726 -
accuracy: 0.8264 - val_loss: 0.4721 - val_accuracy: 0.8606
Epoch 8/20
1719/1719 [==============================] - 2s 1ms/step - loss: 0.4569 -
accuracy: 0.8302 - val_loss: 0.4181 - val_accuracy: 0.8686
Epoch 9/20
1719/1719 [==============================] - 3s 1ms/step - loss: 0.4627 -
accuracy: 0.8274 - val_loss: 0.4336 - val_accuracy: 0.8744
Epoch 10/20
1719/1719 [==============================] - 3s 1ms/step - loss: 0.4542 -
accuracy: 0.8339 - val_loss: 0.4328 - val_accuracy: 0.8648
Epoch 11/20
1719/1719 [==============================] - 2s 1ms/step - loss: 0.4471 -
accuracy: 0.8323 - val_loss: 0.4209 - val_accuracy: 0.8720
Epoch 12/20
1719/1719 [==============================] - 2s 1ms/step - loss: 0.4415 -
accuracy: 0.8363 - val_loss: 0.5043 - val_accuracy: 0.8562
Epoch 13/20
1719/1719 [==============================] - 3s 1ms/step - loss: 0.4337 -
accuracy: 0.8398 - val_loss: 0.4415 - val_accuracy: 0.8740
Epoch 14/20
1719/1719 [==============================] - 2s 1ms/step - loss: 0.4319 -
accuracy: 0.8390 - val_loss: 0.4405 - val_accuracy: 0.8686
Epoch 15/20
1719/1719 [==============================] - 2s 1ms/step - loss: 0.4311 -
accuracy: 0.8399 - val_loss: 0.4470 - val_accuracy: 0.8690
Epoch 16/20
1719/1719 [==============================] - 2s 1ms/step - loss: 0.4261 -
accuracy: 0.8398 - val_loss: 0.4096 - val_accuracy: 0.8804
Epoch 17/20

49
1719/1719 [==============================] - 2s 1ms/step - loss: 0.4213 -
accuracy: 0.8432 - val_loss: 0.5317 - val_accuracy: 0.8580
Epoch 18/20
1719/1719 [==============================] - 2s 1ms/step - loss: 0.4347 -
accuracy: 0.8399 - val_loss: 0.4837 - val_accuracy: 0.8680
Epoch 19/20
1719/1719 [==============================] - 3s 1ms/step - loss: 0.4276 -
accuracy: 0.8415 - val_loss: 0.4674 - val_accuracy: 0.8734
Epoch 20/20
1719/1719 [==============================] - 2s 1ms/step - loss: 0.4199 -
accuracy: 0.8419 - val_loss: 0.4521 - val_accuracy: 0.8744
CPU times: user 2min 5s, sys: 31.3 s, total: 2min 36s
Wall time: 50.4 s

[106]: model.evaluate(X_test_scaled, y_test)

313/313 [==============================] - 0s 665us/step - loss: 0.4833 -


accuracy: 0.8587

[106]: [0.48326751589775085, 0.8586999773979187]

[107]: model.evaluate(X_train_scaled, y_train)

1719/1719 [==============================] - 1s 669us/step - loss: 0.3603 -


accuracy: 0.8821

[107]: [0.36031341552734375, 0.8821091055870056]

[108]: history = model.fit(X_train_scaled, y_train)

1719/1719 [==============================] - 2s 1ms/step - loss: 0.4219 -


accuracy: 0.8437

6.4 MC Dropout
• For Monte Carlo dropout, the dropout is applied at both training and test time
• Justification is Bayesian statistics
[109]: tf.random.set_seed(42)
np.random.seed(42)

[110]: y_probas = np.stack([model(X_test_scaled, training=True)


for sample in range(100)])
y_proba = y_probas.mean(axis=0)
y_std = y_probas.std(axis=0)

[111]: np.round(model.predict(X_test_scaled[:1]), 2)

50
[111]: array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.99]],
dtype=float32)

[112]: np.round(y_probas[:, :1], 2)

[112]: array([[[0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.45, 0. , 0.54]],

[[0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.93, 0. , 0.05]],

[[0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0. , 0. , 0.98]],

[[0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.12, 0. , 0.87]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.48, 0. , 0.52]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.42, 0. , 0.57]],

[[0. , 0. , 0. , 0. , 0. , 0.07, 0. , 0.39, 0. , 0.54]],

[[0. , 0. , 0. , 0. , 0. , 0.1 , 0. , 0.14, 0. , 0.76]],

[[0. , 0. , 0. , 0. , 0. , 0.07, 0. , 0.03, 0. , 0.9 ]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.99]],

[[0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.11, 0. , 0.88]],

[[0. , 0. , 0. , 0. , 0. , 0.04, 0. , 0.14, 0. , 0.82]],

[[0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.18, 0. , 0.81]],

[[0. , 0. , 0. , 0. , 0. , 0.14, 0. , 0.1 , 0. , 0.75]],

[[0. , 0. , 0. , 0. , 0. , 0.06, 0. , 0.15, 0. , 0.79]],

[[0. , 0. , 0. , 0. , 0. , 0.12, 0. , 0.07, 0. , 0.81]],

[[0. , 0. , 0. , 0. , 0. , 0.04, 0. , 0. , 0. , 0.96]],

[[0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.54, 0. , 0.44]],

[[0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.5 , 0. , 0.49]],

[[0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.02, 0. , 0.96]],

[[0. , 0. , 0. , 0. , 0. , 0.95, 0. , 0. , 0. , 0.05]],

51
[[0. , 0. , 0. , 0. , 0. , 0.03, 0. , 0.04, 0. , 0.93]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. ]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.1 , 0. , 0.9 ]],

[[0. , 0. , 0. , 0. , 0. , 0.04, 0. , 0.06, 0. , 0.89]],

[[0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.74, 0. , 0.23]],

[[0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.06, 0. , 0.92]],

[[0. , 0. , 0. , 0. , 0. , 0.13, 0. , 0.22, 0. , 0.65]],

[[0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.04, 0. , 0.94]],

[[0. , 0. , 0. , 0. , 0. , 0.28, 0. , 0.46, 0. , 0.26]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.99]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. ]],

[[0. , 0. , 0. , 0. , 0. , 0.35, 0. , 0.03, 0. , 0.62]],

[[0. , 0. , 0. , 0. , 0. , 0.11, 0. , 0.83, 0. , 0.06]],

[[0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.16, 0. , 0.81]],

[[0. , 0. , 0. , 0. , 0. , 0.05, 0. , 0.74, 0. , 0.22]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.98]],

[[0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.2 , 0. , 0.78]],

[[0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.22, 0. , 0.76]],

[[0. , 0. , 0. , 0. , 0. , 0.73, 0. , 0.02, 0. , 0.24]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.07, 0. , 0.92]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.98]],

[[0. , 0. , 0. , 0. , 0. , 0.26, 0. , 0.22, 0. , 0.52]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.99]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.99]],

52
[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.98]],

[[0. , 0. , 0. , 0. , 0. , 0.3 , 0. , 0.16, 0. , 0.54]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.99]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.29, 0. , 0.71]],

[[0. , 0. , 0. , 0. , 0. , 0.03, 0. , 0.2 , 0. , 0.77]],

[[0. , 0. , 0. , 0. , 0. , 0.16, 0. , 0.35, 0. , 0.49]],

[[0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.7 , 0. , 0.28]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.99]],

[[0. , 0. , 0. , 0. , 0. , 0.06, 0. , 0.07, 0. , 0.87]],

[[0. , 0. , 0. , 0. , 0. , 0.06, 0. , 0. , 0. , 0.94]],

[[0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.08, 0. , 0.92]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.27, 0. , 0.73]],

[[0. , 0. , 0. , 0. , 0. , 0.03, 0. , 0.2 , 0. , 0.77]],

[[0. , 0. , 0. , 0. , 0. , 0.03, 0. , 0.15, 0. , 0.83]],

[[0. , 0. , 0. , 0. , 0. , 0.07, 0. , 0.09, 0. , 0.85]],

[[0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.02, 0. , 0.97]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.11, 0. , 0.89]],

[[0. , 0. , 0. , 0. , 0. , 0.05, 0. , 0.02, 0. , 0.93]],

[[0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.22, 0. , 0.76]],

[[0. , 0. , 0. , 0. , 0. , 0.3 , 0. , 0.59, 0. , 0.11]],

[[0. , 0. , 0. , 0. , 0. , 0.41, 0. , 0.38, 0. , 0.22]],

[[0. , 0. , 0. , 0. , 0. , 0.05, 0. , 0.3 , 0. , 0.64]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.99]],

53
[[0. , 0. , 0. , 0. , 0. , 0.03, 0. , 0.29, 0. , 0.69]],

[[0. , 0. , 0. , 0. , 0. , 0.1 , 0. , 0.54, 0. , 0.35]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. ]],

[[0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.1 , 0. , 0.89]],

[[0. , 0. , 0. , 0. , 0. , 0.04, 0. , 0.13, 0. , 0.83]],

[[0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.23, 0. , 0.76]],

[[0. , 0. , 0. , 0. , 0. , 0.08, 0. , 0.22, 0. , 0.7 ]],

[[0. , 0. , 0. , 0. , 0. , 0.07, 0. , 0.01, 0. , 0.92]],

[[0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.1 , 0. , 0.88]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.6 , 0. , 0.39]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.39, 0. , 0.6 ]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.07, 0. , 0.93]],

[[0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.15, 0. , 0.85]],

[[0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.04, 0. , 0.94]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 1. ]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.99]],

[[0. , 0. , 0. , 0. , 0. , 0.03, 0. , 0.04, 0. , 0.93]],

[[0. , 0. , 0. , 0. , 0. , 0.05, 0. , 0.48, 0. , 0.47]],

[[0. , 0. , 0. , 0. , 0. , 0.88, 0. , 0.06, 0. , 0.06]],

[[0. , 0. , 0. , 0. , 0. , 0.21, 0. , 0.05, 0. , 0.75]],

[[0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.04, 0. , 0.95]],

[[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.98]],

[[0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.04, 0. , 0.96]],

[[0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.06, 0. , 0.93]],

54
[[0. , 0. , 0. , 0. , 0. , 0.06, 0. , 0.19, 0. , 0.75]],

[[0. , 0. , 0. , 0. , 0. , 0.15, 0. , 0.14, 0. , 0.71]],

[[0. , 0. , 0. , 0. , 0. , 0.02, 0. , 0.14, 0. , 0.85]],

[[0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.01, 0. , 0.98]],

[[0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.24, 0. , 0.76]],

[[0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.06, 0. , 0.93]],

[[0. , 0. , 0. , 0. , 0. , 0.11, 0. , 0.15, 0. , 0.75]],

[[0. , 0. , 0. , 0. , 0. , 0.07, 0. , 0.07, 0. , 0.87]]],


dtype=float32)

[113]: np.round(y_proba[:1], 2)

[113]: array([[0. , 0. , 0. , 0. , 0. , 0.07, 0. , 0.18, 0. , 0.75]],


dtype=float32)

[114]: y_std = y_probas.std(axis=0)


np.round(y_std[:1], 2)

[114]: array([[0. , 0. , 0. , 0. , 0. , 0.16, 0. , 0.21, 0. , 0.26]],


dtype=float32)

[115]: y_pred = np.argmax(y_proba, axis=1)

[116]: accuracy = np.sum(y_pred == y_test) / len(y_test)


accuracy

[116]: 0.8673

[117]: class MCDropout(keras.layers.Dropout):


def call(self, inputs):
return super().call(inputs, training=True)

class MCAlphaDropout(keras.layers.AlphaDropout):
def call(self, inputs):
return super().call(inputs, training=True)

[118]: tf.random.set_seed(42)
np.random.seed(42)

55
[119]: mc_model = keras.models.Sequential([
MCAlphaDropout(layer.rate) if isinstance(layer, keras.layers.AlphaDropout)␣
,→else layer

for layer in model.layers


])

[120]: mc_model.summary()

Model: "sequential_22"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
flatten_20 (Flatten) (None, 784) 0
_________________________________________________________________
mc_alpha_dropout (MCAlphaDro (None, 784) 0
_________________________________________________________________
dense_366 (Dense) (None, 300) 235500
_________________________________________________________________
mc_alpha_dropout_1 (MCAlphaD (None, 300) 0
_________________________________________________________________
dense_367 (Dense) (None, 100) 30100
_________________________________________________________________
mc_alpha_dropout_2 (MCAlphaD (None, 100) 0
_________________________________________________________________
dense_368 (Dense) (None, 10) 1010
=================================================================
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________

[121]: optimizer = keras.optimizers.SGD(lr=0.01, momentum=0.9, nesterov=True)


mc_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,␣
,→metrics=["accuracy"])

[122]: mc_model.set_weights(model.get_weights())

Now we can use the model with MC Dropout:


[123]: np.round(np.mean([mc_model.predict(X_test_scaled[:1]) for sample in␣
,→range(100)], axis=0), 2)

[123]: array([[0. , 0. , 0. , 0. , 0. , 0.11, 0. , 0.21, 0. , 0.68]],


dtype=float32)

56
7 Exercises
7.1 1. to 7.
See appendix A.

7.2 8. Deep Learning on CIFAR10


7.2.1 a.
Exercise: Build a DNN with 20 hidden layers of 100 neurons each (that’s too many, but it’s the
point of this exercise). Use He initialization and the ELU activation function.

[ ]: keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
model.add(keras.layers.Dense(100,
activation="elu",
kernel_initializer="he_normal"))

7.2.2 b.
Exercise: Using Nadam optimization and early stopping, train the network on the CIFAR10 dataset.
You can load it with keras.datasets.cifar10.load_data(). The dataset is composed of 60,000
32 × 32–pixel color images (50,000 for training, 10,000 for testing) with 10 classes, so you’ll need
a softmax output layer with 10 neurons. Remember to search for the right learning rate each time
you change the model’s architecture or hyperparameters.
Let’s add the output layer to the model:
[ ]: model.add(keras.layers.Dense(10, activation="softmax"))

Let’s use a Nadam optimizer with a learning rate of 5e-5. I tried learning rates 1e-5, 3e-5, 1e-4, 3e-4,
1e-3, 3e-3 and 1e-2, and I compared their learning curves for 10 epochs each (using the TensorBoard
callback, below). The learning rates 3e-5 and 1e-4 were pretty good, so I tried 5e-5, which turned
out to be slightly better.
[ ]: optimizer = keras.optimizers.Nadam(lr=5e-5)
model.compile(loss="sparse_categorical_crossentropy",
optimizer=optimizer,
metrics=["accuracy"])

Let’s load the CIFAR10 dataset. We also want to use early stopping, so we need a validation set.
Let’s use the first 5,000 images of the original training set as the validation set:

57
[ ]: (X_train_full, y_train_full), (X_test, y_test) = keras.datasets.cifar10.
,→load_data()

X_train = X_train_full[5000:]
y_train = y_train_full[5000:]
X_valid = X_train_full[:5000]
y_valid = y_train_full[:5000]

Now we can create the callbacks we need and train the model:
[ ]: early_stopping_cb = keras.callbacks.EarlyStopping(patience=20)
model_checkpoint_cb = keras.callbacks.ModelCheckpoint("my_cifar10_model.h5",␣
,→save_best_only=True)

run_index = 1 # increment every time you train the model


run_logdir = os.path.join(os.curdir, "my_cifar10_logs", "run_{:03d}".
,→format(run_index))

tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]

[ ]: %tensorboard --logdir=./my_cifar10_logs --port=6006

[ ]: model.fit(X_train, y_train, epochs=100,


validation_data=(X_valid, y_valid),
callbacks=callbacks)

[ ]: model = keras.models.load_model("my_cifar10_model.h5")
model.evaluate(X_valid, y_valid)

The model with the lowest validation loss gets about 47% accuracy on the validation set. It took
39 epochs to reach the lowest validation loss, with roughly 10 seconds per epoch on my laptop
(without a GPU). Let’s see if we can improve performance using Batch Normalization.

7.2.3 c.
Exercise: Now try adding Batch Normalization and compare the learning curves: Is it converging
faster than before? Does it produce a better model? How does it affect training speed?
The code below is very similar to the code above, with a few changes:
• I added a BN layer after every Dense layer (before the activation function), except for the
output layer. I also added a BN layer before the first hidden layer.
• I changed the learning rate to 5e-4. I experimented with 1e-5, 3e-5, 5e-5, 1e-4, 3e-4, 5e-4,
1e-3 and 3e-3, and I chose the one with the best validation performance after 20 epochs.
• I renamed the run directories to run_bn_* and the model file name to
my_cifar10_bn_model.h5.
[ ]: keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

58
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
model.add(keras.layers.BatchNormalization())
for _ in range(20):
model.add(keras.layers.Dense(100, kernel_initializer="he_normal"))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation("elu"))
model.add(keras.layers.Dense(10, activation="softmax"))

optimizer = keras.optimizers.Nadam(lr=5e-4)
model.compile(loss="sparse_categorical_crossentropy",
optimizer=optimizer,
metrics=["accuracy"])

early_stopping_cb = keras.callbacks.EarlyStopping(patience=20)
model_checkpoint_cb = keras.callbacks.ModelCheckpoint("my_cifar10_bn_model.h5",␣
,→save_best_only=True)

run_index = 1 # increment every time you train the model


run_logdir = os.path.join(os.curdir, "my_cifar10_logs", "run_bn_{:03d}".
,→format(run_index))

tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]

model.fit(X_train, y_train, epochs=100,


validation_data=(X_valid, y_valid),
callbacks=callbacks)

model = keras.models.load_model("my_cifar10_bn_model.h5")
model.evaluate(X_valid, y_valid)

• Is the model converging faster than before? Much faster! The previous model took 39 epochs
to reach the lowest validation loss, while the new model with BN took 18 epochs. That’s
more than twice as fast as the previous model. The BN layers stabilized training and allowed
us to use a much larger learning rate, so convergence was faster.
• Does BN produce a better model? Yes! The final model is also much better, with 55% accuracy
instead of 47%. It’s still not a very good model, but at least it’s much better than before (a
Convolutional Neural Network would do much better, but that’s a different topic, see chapter
14).
• How does BN affect training speed? Although the model converged twice as fast, each epoch
took about 16s instead of 10s, because of the extra computations required by the BN layers.
So overall, although the number of epochs was reduced by 50%, the training time (wall time)
was shortened by 30%. Which is still pretty significant!

7.2.4 d.
Exercise: Try replacing Batch Normalization with SELU, and make the necessary adjustements to
ensure the network self-normalizes (i.e., standardize the input features, use LeCun normal initial-

59
ization, make sure the DNN contains only a sequence of dense layers, etc.).

[ ]: keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
model.add(keras.layers.Dense(100,
kernel_initializer="lecun_normal",
activation="selu"))
model.add(keras.layers.Dense(10, activation="softmax"))

optimizer = keras.optimizers.Nadam(lr=7e-4)
model.compile(loss="sparse_categorical_crossentropy",
optimizer=optimizer,
metrics=["accuracy"])

early_stopping_cb = keras.callbacks.EarlyStopping(patience=20)
model_checkpoint_cb = keras.callbacks.ModelCheckpoint("my_cifar10_selu_model.
,→h5", save_best_only=True)

run_index = 1 # increment every time you train the model


run_logdir = os.path.join(os.curdir, "my_cifar10_logs", "run_selu_{:03d}".
,→format(run_index))

tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]

X_means = X_train.mean(axis=0)
X_stds = X_train.std(axis=0)
X_train_scaled = (X_train - X_means) / X_stds
X_valid_scaled = (X_valid - X_means) / X_stds
X_test_scaled = (X_test - X_means) / X_stds

model.fit(X_train_scaled, y_train, epochs=100,


validation_data=(X_valid_scaled, y_valid),
callbacks=callbacks)

model = keras.models.load_model("my_cifar10_selu_model.h5")
model.evaluate(X_valid_scaled, y_valid)

[ ]: model = keras.models.load_model("my_cifar10_selu_model.h5")
model.evaluate(X_valid_scaled, y_valid)

We get 51.4% accuracy, which is better than the original model, but not quite as good as the model
using batch normalization. Moreover, it took 13 epochs to reach the best model, which is much
faster than both the original model and the BN model, plus each epoch took only 10 seconds, just
like the original model. So it’s by far the fastest model to train (both in terms of epochs and wall

60
time).

7.2.5 e.
Exercise: Try regularizing the model with alpha dropout. Then, without retraining your model, see
if you can achieve better accuracy using MC Dropout.
[ ]: keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
model.add(keras.layers.Dense(100,
kernel_initializer="lecun_normal",
activation="selu"))

model.add(keras.layers.AlphaDropout(rate=0.1))
model.add(keras.layers.Dense(10, activation="softmax"))

optimizer = keras.optimizers.Nadam(lr=5e-4)
model.compile(loss="sparse_categorical_crossentropy",
optimizer=optimizer,
metrics=["accuracy"])

early_stopping_cb = keras.callbacks.EarlyStopping(patience=20)
model_checkpoint_cb = keras.callbacks.
,→ModelCheckpoint("my_cifar10_alpha_dropout_model.h5", save_best_only=True)

run_index = 1 # increment every time you train the model


run_logdir = os.path.join(os.curdir, "my_cifar10_logs", "run_alpha_dropout_{:
,→03d}".format(run_index))

tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]

X_means = X_train.mean(axis=0)
X_stds = X_train.std(axis=0)
X_train_scaled = (X_train - X_means) / X_stds
X_valid_scaled = (X_valid - X_means) / X_stds
X_test_scaled = (X_test - X_means) / X_stds

model.fit(X_train_scaled, y_train, epochs=100,


validation_data=(X_valid_scaled, y_valid),
callbacks=callbacks)

model = keras.models.load_model("my_cifar10_alpha_dropout_model.h5")
model.evaluate(X_valid_scaled, y_valid)

61
The model reaches 50.8% accuracy on the validation set. That’s very slightly worse than without
dropout (51.4%). With an extensive hyperparameter search, it might be possible to do better (I
tried dropout rates of 5%, 10%, 20% and 40%, and learning rates 1e-4, 3e-4, 5e-4, and 1e-3), but
probably not much better in this case.
Let’s use MC Dropout now. We will need the MCAlphaDropout class we used earlier, so let’s just
copy it here for convenience:
[ ]: class MCAlphaDropout(keras.layers.AlphaDropout):
def call(self, inputs):
return super().call(inputs, training=True)

Now let’s create a new model, identical to the one we just trained (with the same weights), but
with MCAlphaDropout dropout layers instead of AlphaDropout layers:
[ ]: mc_model = keras.models.Sequential([
MCAlphaDropout(layer.rate) if isinstance(layer, keras.layers.AlphaDropout)␣
,→else layer

for layer in model.layers


])

Then let’s add a couple utility functions. The first will run the model many times (10 by default) and
it will return the mean predicted class probabilities. The second will use these mean probabilities
to predict the most likely class for each instance:
[ ]: def mc_dropout_predict_probas(mc_model, X, n_samples=10):
Y_probas = [mc_model.predict(X) for sample in range(n_samples)]
return np.mean(Y_probas, axis=0)

def mc_dropout_predict_classes(mc_model, X, n_samples=10):


Y_probas = mc_dropout_predict_probas(mc_model, X, n_samples)
return np.argmax(Y_probas, axis=1)

Now let’s make predictions for all the instances in the validation set, and compute the accuracy:
[ ]: keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

y_pred = mc_dropout_predict_classes(mc_model, X_valid_scaled)


accuracy = np.mean(y_pred == y_valid[:, 0])
accuracy

We only get virtually no accuracy improvement in this case (from 50.8% to 50.9%).
So the best model we got in this exercise is the Batch Normalization model.

62
7.2.6 f.
Exercise: Retrain your model using 1cycle scheduling and see if it improves training speed and
model accuracy.
[ ]: keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
model.add(keras.layers.Dense(100,
kernel_initializer="lecun_normal",
activation="selu"))

model.add(keras.layers.AlphaDropout(rate=0.1))
model.add(keras.layers.Dense(10, activation="softmax"))

optimizer = keras.optimizers.SGD(lr=1e-3)
model.compile(loss="sparse_categorical_crossentropy",
optimizer=optimizer,
metrics=["accuracy"])

[ ]: batch_size = 128
rates, losses = find_learning_rate(model, X_train_scaled, y_train, epochs=1,␣
,→batch_size=batch_size)

plot_lr_vs_loss(rates, losses)
plt.axis([min(rates), max(rates), min(losses), (losses[0] + min(losses)) / 1.4])

[ ]: keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
model.add(keras.layers.Dense(100,
kernel_initializer="lecun_normal",
activation="selu"))

model.add(keras.layers.AlphaDropout(rate=0.1))
model.add(keras.layers.Dense(10, activation="softmax"))

optimizer = keras.optimizers.SGD(lr=1e-2)
model.compile(loss="sparse_categorical_crossentropy",
optimizer=optimizer,
metrics=["accuracy"])

63
[ ]: n_epochs = 15
onecycle = OneCycleScheduler(len(X_train_scaled) // batch_size * n_epochs,␣
,→max_rate=0.05)

history = model.fit(X_train_scaled, y_train, epochs=n_epochs,␣


,→batch_size=batch_size,

validation_data=(X_valid_scaled, y_valid),
callbacks=[onecycle])

One cycle allowed us to train the model in just 15 epochs, each taking only 3 seconds (thanks to the
larger batch size). This is over 3 times faster than the fastest model we trained so far. Moreover,
we improved the model’s performance (from 50.8% to 52.8%). The batch normalized model reaches
a slightly better performance, but it’s much slower to train.
[ ]:

64

You might also like