0% found this document useful (0 votes)
2 views

Momentum

This document discusses first-order optimization methods, focusing on momentum techniques to improve gradient descent efficiency. It introduces the heavy ball or momentum method, which accelerates movement in favorable directions, and the Nesterov accelerated gradient method, which reduces oscillation by incorporating a look-ahead step. Both methods enhance convergence speed, particularly in flat regions of the loss landscape.

Uploaded by

Anamitra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Momentum

This document discusses first-order optimization methods, focusing on momentum techniques to improve gradient descent efficiency. It introduces the heavy ball or momentum method, which accelerates movement in favorable directions, and the Nesterov accelerated gradient method, which reduces oscillation by incorporating a look-ahead step. Both methods enhance convergence speed, particularly in flat regions of the loss landscape.

Uploaded by

Anamitra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

8.2.

First-order methods 287

(a) (b)

Figure 8.12: Illustration of the effect of condition number  on the convergence speed of steepest descent with
exact line searches. (a) Large . (b) Small . Generated by lineSearchConditionNum.ipynb.

8.2.4 Momentum methods


Gradient descent can move very slowly along flat regions of the loss landscape, as we illustrated in
Figure 8.11. We discuss some solutions to this below.

8.2.4.1 Momentum
One simple heuristic, known as the heavy ball or momentum method [Ber99], is to move faster
along directions that were previously good, and to slow down along directions where the gradient has
suddenly changed, just like a ball rolling downhill. This can be implemented as follows:

mt = mt 1 + gt 1 (8.30)
✓t = ✓t 1 ⌘ t mt (8.31)

where mt is the momentum (mass times velocity) and 0 < < 1. A typical value of is 0.9. For
= 0, the method reduces to gradient descent.
We see that mt is like an exponentially weighted moving average of the past gradients (see
Section 4.4.2.2):
t 1
X
2 ⌧
mt = mt 1 + gt 1 = mt 2 + gt 2 + gt 1 = ··· = gt ⌧ 1 (8.32)
⌧ =0

If all the past gradients are a constant, say g, this simplifies to


t 1
X

mt = g (8.33)
⌧ =0

The scaling factor is a geometric series, whose infinite sum is given by


1
X
2 i 1
1+ + + ··· = = (8.34)
i=0
1

Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license


288 Chapter 8. Optimization

Cost
q2
Starting Regular
Point
momentum
h
Update
1 h
bm 1

h
2 Nesterov
Update

q1
Figure 8.13: Illustration of the Nesterov update. Adapted from Figure 11.6 of [Gér19].

Thus in the limit, we multiply the gradient by 1/(1 ). For example, if = 0.9, we scale the
gradient up by 10.
Since we update the parameters using the gradient average mt 1 , rather than just the most recent
gradient, gt 1 , we see that past gradients can exhibit some influence on the present. Furthermore,
when momentum is combined with SGD, discussed in Section 8.4, we will see that it can simulate
the effects of a larger minibatch, without the computational cost.

8.2.4.2 Nesterov momentum


One problem with the standard momentum method is that it may not slow down enough at the
bottom of a valley, causing oscillation. The Nesterov accelerated gradient method of [Nes04]
instead modifies the gradient descent to include an extrapolation step, as follows:
˜t+1 = ✓t + (✓t ✓t 1 )
✓ (8.35)
˜t+1 ⌘t rL(✓
✓t+1 = ✓ ˜t+1 ) (8.36)

This is essentially a form of one-step “look ahead”, that can reduce the amount of oscillation, as
illustrated in Figure 8.13.
Nesterov accelerated gradient can also be rewritten in the same format as standard momentum. In
this case, the momentum term is updated using the gradient at the predicted new location,

mt+1 = mt ⌘t rL(✓t + mt ) (8.37)


✓t+1 = ✓t + mt+1 (8.38)

This explains why the Nesterov accelerated gradient method is sometimes called Nesterov momentum.
It also shows how this method can be faster than standard momentum: the momentum vector
is already roughly pointing in the right direction, so measuring the gradient at the new location,
✓t + mt , rather than the current location, ✓t , can be more accurate.
The Nesterov accelerated gradient method is provably faster than steepest descent for convex
functions when and ⌘t are chosen appropriately. It is called “accelerated” because of this improved

“Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024

You might also like