Momentum
Momentum
(a) (b)
Figure 8.12: Illustration of the effect of condition number on the convergence speed of steepest descent with
exact line searches. (a) Large . (b) Small . Generated by lineSearchConditionNum.ipynb.
8.2.4.1 Momentum
One simple heuristic, known as the heavy ball or momentum method [Ber99], is to move faster
along directions that were previously good, and to slow down along directions where the gradient has
suddenly changed, just like a ball rolling downhill. This can be implemented as follows:
mt = mt 1 + gt 1 (8.30)
✓t = ✓t 1 ⌘ t mt (8.31)
where mt is the momentum (mass times velocity) and 0 < < 1. A typical value of is 0.9. For
= 0, the method reduces to gradient descent.
We see that mt is like an exponentially weighted moving average of the past gradients (see
Section 4.4.2.2):
t 1
X
2 ⌧
mt = mt 1 + gt 1 = mt 2 + gt 2 + gt 1 = ··· = gt ⌧ 1 (8.32)
⌧ =0
Cost
q2
Starting Regular
Point
momentum
h
Update
1 h
bm 1
h
2 Nesterov
Update
q1
Figure 8.13: Illustration of the Nesterov update. Adapted from Figure 11.6 of [Gér19].
Thus in the limit, we multiply the gradient by 1/(1 ). For example, if = 0.9, we scale the
gradient up by 10.
Since we update the parameters using the gradient average mt 1 , rather than just the most recent
gradient, gt 1 , we see that past gradients can exhibit some influence on the present. Furthermore,
when momentum is combined with SGD, discussed in Section 8.4, we will see that it can simulate
the effects of a larger minibatch, without the computational cost.
This is essentially a form of one-step “look ahead”, that can reduce the amount of oscillation, as
illustrated in Figure 8.13.
Nesterov accelerated gradient can also be rewritten in the same format as standard momentum. In
this case, the momentum term is updated using the gradient at the predicted new location,
This explains why the Nesterov accelerated gradient method is sometimes called Nesterov momentum.
It also shows how this method can be faster than standard momentum: the momentum vector
is already roughly pointing in the right direction, so measuring the gradient at the new location,
✓t + mt , rather than the current location, ✓t , can be more accurate.
The Nesterov accelerated gradient method is provably faster than steepest descent for convex
functions when and ⌘t are chosen appropriately. It is called “accelerated” because of this improved