Lec 6
Lec 6
Lec 6
Optimization Part 1
1
Story so far
• Neural networks are universal approximators
– Can model any odd thing
– Provided they have the right architecture
• We must train them to approximate any function
– Specify the architecture
– Learn their weights and biases
• Networks are trained to minimize total “loss” on a training
set
– We do so through empirical risk minimization
• We use variants of gradient descent to do so
• The gradient of the error with respect to network
parameters is computed through backpropagation
2
Recap: Gradient Descent Algorithm
• In order to minimize any function w.r.t.
• Initialize:
–
–
• Do
–
–
• while
3
Recap: Training Neural Nets by Gradient
Descent
Total training error:
• 𝑇
6
Poll 0
Backpropagating from the kth layer, which is the derivative for
the weights ?
: The product of the output of the th layer and the
derivative for the affine value of the th layer (in that order)
: The product of the derivative for the affine value at the
th layer and the output of the th layer (in that order)
: The product of the transpose of the output of the
th layer and the derivative for the affine value of the th layer (in that
order)
: The product of the derivative for the affine value at the
th layer and the transpose output of the th layer (in that order)
7
Poll 0
Backpropagating from the kth layer, which is the derivative for
the weights ?
𝒌 𝟏 𝒛𝒌 : The product of the output of the th layer and the
derivative for the affine value of the th layer (in that order)
: The product of the derivative for the affine value at the
th layer and the output of the th layer (in that order)
: The product of the transpose of the output of the
th layer and the derivative for the affine value of the th layer (in that
order)
: The product of the derivative for the affine value at the
th layer and the transpose output of the th layer (in that order)
8
Onward
9
Onward
• Does backprop always work?
• Convergence of gradient descent
– Rates, restrictions,
– Hessians
– Acceleration and Nestorov
– Alternate approaches
• Modifying the approach: Stochastic gradients
• Speedup extensions: RMSprop, Adagrad
10
Does backprop do the right thing?
• Is backprop always right?
– Assuming it actually finds the minimum of the
divergence function?
11
Recap: The differentiable activation
y y
T1 x T2 x
• Threshold activation: Equivalent to counting errors
– Shifting the threshold from T1 to T2 does not change classification error
– Does not indicate if moving the threshold left was good or not
0.5 0.5
T1 T2
• Differentiable activation: Computes “distance to answer”
– “Distance” == divergence
– Perturbing the function changes this quantity,
• Even if the classification error itself doesn’t change 12
Does backprop do the right thing?
• Is backprop always right?
– Assuming it actually finds the global minimum of the loss
(average divergence)?
13
Backprop fails to separate where
perceptron succeeds
(0,1), +1
(-1,0), -1
(1,0), +1
(-1,0), -1
(1,0), +1
(-1,0), -1
(1,0), +1
• Let
– E.g. 𝑢 = 𝑓 0.99 representing a 99% confidence in the class
• From the three points we get three independent equations:
(-1,0), -1
(1,0), +1
(0,-t), +1
(-1,0), -1
(1,0), +1
• Consider backprop:
• Contribution of fourth point
to derivative of L2 error: (0,-t), +1
18
Backprop
Notation:
= logistic activation
2
19
Backprop
(0,1), +1
(-1,0), -1
(1,0), +1
(0,-t), +1
(-1,0), -1
(1,0), +1
(0,-t), +1
(-1,0), -1
(1,0), +1
(0,-t), +1
30
Backprop fails to separate even when
possible
31
Backpropagation: Finding the separator
• Backpropagation will often not find a separating
solution even though the solution is within the
class of functions learnable by the network
Minimizing the (differentiable) loss function will also minimize classification error, true or false
True
False (true)
33
Poll 1
Minimizing the (differentiable) loss function will also minimize classification error, true or false
True
False (true)
34
The Loss Surface
• The example (and statements)
earlier assumed the loss
objective had a single global
optimum that could be found
– Statement about variance is
assuming global optimum
35
The Loss Surface
• Popular hypothesis:
– In large networks, saddle points are far more
common than local minima
• Frequency of occurrence exponential in network size
– Most local minima are equivalent
• And close to global minimum
– This is not true for small networks
• Dauphin et. al (2015), “Identifying and attacking the saddle point problem
in high-dimensional non-convex optimization” : An exponential number of
saddle points in large networks
• For large networks, the loss function may have a large number of
unpleasant saddle points or local minima
– Which backpropagation may find
38
Convergence
• In the discussion so far we have assumed the
training arrives at a local minimum
39
A quick tour of (convex) optimization
40
Convex Loss Functions
• A surface is “convex” if it is
continuously curving upward
– We can connect any two points Contour plot of convex function
41
Convergence of gradient descent
converging
• An iterative algorithm is said to
converge to a solution if the value
updates arrive at a fixed point
– Where the gradient is 0 and further
updates do not change the estimate
jittering
• The algorithm may not actually
converge
– It may jitter around the local
minimum
– It may even diverge diverging
42
Convergence and convergence rate
• Convergence rate: How fast the converging
iterations arrive at the solution
• Generally quantified as
– (
is the k-th iteration
)
43
Convergence for quadratic surfaces
( )
44
Convergence for quadratic surfaces
• Any quadratic objective can be written as
( ) ( )
( )
( ) ( ) ( ) ( )
– Taylor expansion
• Note:
( )
( )
• For we
have oscillating
convergence
• For we get
divergence
46
For generic differentiable convex
objectives
approx
𝑚𝑖𝑛
• Any differentiable convex objective can be approximated as
( ) ( )
( ) ( ) ( )
– Taylor expansion
• Using the same logic as before, we get (Newton’s method)
( )
• When is diagonal:
50
Multivariate Quadratic with Diagonal
51
“Descents” are uncoupled
, ,
54
Dependence on learning rate
• , ,
• ,
• ,
• ,
• ,
• ,
55
Problem with vector update rule
56
Dependence on learning rate
•
57
Generic differentiable multivariate
convex functions
• For generic convex multivariate functions (not necessarily quadratic), we can employ
quadratic Taylor series expansions and much of the analysis still applies
• Taylor expansion
(𝒌) (𝒌) (𝒌) (𝒌) 𝑻 (𝒌) (𝒌)
𝐰 𝑬
• The optimal step size is inversely proportional to the Eigen values of the Hessian
– The second derivative along the orthogonal coordinates
– For the smoothest convergence, these must all be equal
58
Convergence
• Convergence behaviors become increasingly unpredictable as dimensions
increase
• For the fastest convergence, ideally, the learning rate must be close to
both, the largest , and the smallest ,
– To ensure convergence in every direction
– Generally infeasible
,
• Convergence is particularly slow if is large
,
• The convergence for other kinds of functions can be viewed against this
benchmark
• Actual losses will not be quadratic, but may locally have other structure
– Local between current location and nearest local minimum
60
Quadratic convexity
From wikipedia
• Most functions are not strongly convex (if they are convex)
• Instead we will talk in terms of Lifschitz smoothness
• But first : a definition
• Lifschitz continuous: The function always lies outside a cone
– The slope of the outer surface is the Lifschitz constant
–
64
Lifschitz smoothness
• A function can be convex and Lifschitz smooth, but not strongly convex
– Convex, but upper bound on second derivative
– Weaker convergence guarantees, if any (at best linear)
67
– This is often a reasonable assumption for the local structure of your loss function
Types of smoothness
• A function can be convex and Lifschitz smooth, but not strongly convex
– Convex, but upper bound on second derivative
– Weaker convergence guarantees, if any (at best linear)
68
– This is often a reasonable assumption for the local structure of your loss function
Convergence Problems
• For quadratic (strongly) convex functions, gradient descent is exponentially
fast
– Linear convergence
• Assuming learning rate is non-divergent
• Second order methods will locally convert the loss function to quadratic
– Convergence behavior will still depend on the nature of the original function
• Scale (and rotate) the axes, such that all of them have identical (identity) “spread”
– Equal-value contours are circular
– Movement along the coordinate axes become independent
• Note: equation of a quadratic surface with circular equal-value contours can be
written as
72
Scaling the axes
• Original equation:
• And
73
Scaling the axes
• Original equation:
• And By inspection:
74
Scaling the axes
• We have
• Solving: ,
75
Scaling the axes
• We have
– Eigen decomposition
– is an orthogonal matrix
– is a diagonal matrix of non-zero diagonal entries
• Defining
– Check
• Defining
– Check:
78
Returning to our problem
•
79
Returning to our problem
•
• Using , and
80
Modified update rule
•
• Leads to the modified gradient descent rule
81
For non-axis-aligned quadratics..
• Taylor expansion
(𝒌) (𝒌) (𝒌) (𝒌) 𝑻 (𝒌) (𝒌)
𝐰 𝑬
84
Generic differentiable multivariate
convex functions
• Taylor expansion
(𝒌) (𝒌) (𝒌) (𝒌) 𝑻 (𝒌) (𝒌)
𝐰 𝑬
–
86
Minimization by Newton’s method
–
87
Minimization by Newton’s method
–
88
Minimization by Newton’s method
–
89
Minimization by Newton’s method
–
90
Minimization by Newton’s method
–
91
Minimization by Newton’s method
–
92
Minimization by Newton’s method
–
93
Minimization by Newton’s method
–
94
Minimization by Newton’s method
–
95
Minimization by Newton’s method
–
96
Issues: 1. The Hessian
• Normalized update rule
97
Issues: 1. The Hessian
101
Issues: 2. The learning rate
103
Decaying learning rate
• Typical decay schedules
– Linear decay:
– Quadratic decay:
– Exponential decay: , where
104
Story so far : Convergence
• Gradient descent can miss obvious answers
– And this may be a good thing
Slide 117
Step sizes that are greater than twice the inverse of the second derivative can cause gradient
descent to diverge (true)
This is always a bad thing
Gradient descent will not converge without decaying learning rates
106
Poll 2
Slide 117
Step sizes that are greater than twice the inverse of the second derivative can cause gradient
descent to diverge (true)
This is always a bad thing
Gradient descent will not converge without decaying learning rates
107
Story so far : Second-order methods
• Second-order methods “normalize” the variation
along the components to mitigate the problem of
different optimal learning rates for different
components
– But this requires computation of inverses of second-
order derivative matrices
– Computationally infeasible
– Not stable in non-convex regions of the loss surface
– Approximate methods address these issues, but
simpler solutions may be better
108
Story so far : Learning rate
• Divergence-causing learning rates may not be a
bad thing
– Particularly for ugly loss functions
• Decaying learning rates provide good
compromise between escaping poor local minima
and convergence
• Rprop
• Quick prop
111
RProp
• Resilient propagation
• Simple algorithm, to be followed independently for each
component
– I.e. steps in different directions are not coupled
• At each time
– If the derivative at the current location recommends continuing in the
same direction as before (i.e. has not changed sign from earlier):
• increase the step, and continue in the same direction
– If the derivative has changed sign (i.e. we’ve overshot a minimum)
• reduce the step and reverse direction
112
Rprop
– ,, ,,
– While not converged:
• 𝑤 , , = 𝑤 , , − ∆𝑤 , ,
( )
Ceiling and floor on step
,,
• 𝐷 𝑙, 𝑖, 𝑗 =
,,
– ,, ,,
Obtained via backprop
– While not converged:
Note: Different parameters updated
• 𝑤 , , = 𝑤 , , − ∆𝑤 , , independently
( ,, )
• 𝐷 𝑙, 𝑖, 𝑗 =
,,
122
Poll 3
The derivative of the loss w.r.t a parameter w, computed at the current estimate is positive. After taking
a step (updating the parameter by a increment dw) the sign of the derivative becomes negative. Mark
all true statements
Rprop will revert to the earlier estimate and take a smaller step (true)
Rprop will change direction and begin taking steps in the opposite direction
123
Poll 3
The derivative of the loss w.r.t a parameter w, computed at the current estimate is positive. After taking
a step (updating the parameter by a increment dw) the sign of the derivative becomes negative. Mark
all true statements
Rprop will revert to the earlier estimate and take a smaller step (true)
Rprop will change direction and begin taking steps in the opposite direction
124
QuickProp
125
QuickProp: Modification 1
Within each component
𝐸(𝑤)
𝑤 𝑤𝑘 𝑤
𝐸(𝑤)
𝑤 𝑤𝑘 𝑤
( ) ( ) ( )
, , ,
128
QuickProp
( )
( ) ( ) ( )
( )
( ) ( ) ( ) Computed using
, , , backprop
129
Quickprop
• Employs Newton updates with empirically
derived derivatives
130
Story so far : Convergence
• Gradient descent can miss obvious answers
– And this may be a good thing
( ) ( ) ( )
– Compute
138
Momentum Update
139
Momentum Update
142
Nestorov’s Accelerated Gradient
• Nestorov’s method
147
Nestorov’s Accelerated Gradient
– For all
• For every layer :
– Compute gradient
150
Poll 4
On a flat surface of constant slope momentum methods will converge faster than vanilla gradient
descent, true or false
True
False (correct) – momentum only changes step size
151
Poll 4
On a flat surface of constant slope momentum methods will converge faster than vanilla gradient
descent, true or false
True
False (correct) – momentum only changes step size
152
Story so far
• Gradient descent can miss obvious answers
– And this may be a good thing
154