0% found this document useful (0 votes)

2 views

Momentum

This document discusses first-order optimization methods, focusing on momentum techniques to improve gradient descent efficiency. It introduces the heavy ball or momentum method, which accelerates movement in favorable directions, and the Nesterov accelerated gradient method, which reduces oscillation by incorporating a look-ahead step. Both methods enhance convergence speed, particularly in flat regions of the loss landscape.

Uploaded by

Anamitra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Momentum

Uploaded by

Anamitra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

8.2.

First-order methods 287

(a) (b)

Figure 8.12: Illustration of the eﬀect of condition number  on the convergence speed of steepest descent with
exact line searches. (a) Large . (b) Small . Generated by lineSearchConditionNum.ipynb.

8.2.4 Momentum methods

Gradient descent can move very slowly along flat regions of the loss landscape, as we illustrated in
Figure 8.11. We discuss some solutions to this below.

8.2.4.1 Momentum
One simple heuristic, known as the heavy ball or momentum method [Ber99], is to move faster
along directions that were previously good, and to slow down along directions where the gradient has
suddenly changed, just like a ball rolling downhill. This can be implemented as follows:

mt = mt 1 + gt 1 (8.30)
✓t = ✓t 1 ⌘ t mt (8.31)

where mt is the momentum (mass times velocity) and 0 < < 1. A typical value of is 0.9. For
= 0, the method reduces to gradient descent.
We see that mt is like an exponentially weighted moving average of the past gradients (see
Section 4.4.2.2):
t 1
X
2 ⌧
mt = mt 1 + gt 1 = mt 2 + gt 2 + gt 1 = ··· = gt ⌧ 1 (8.32)
⌧ =0

If all the past gradients are a constant, say g, this simplifies to

t 1
X
⌧
mt = g (8.33)
⌧ =0

The scaling factor is a geometric series, whose infinite sum is given by

1
X
2 i 1
1+ + + ··· = = (8.34)
i=0
1

Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license

288 Chapter 8. Optimization

Cost
q2
Starting Regular
Point
momentum
h
Update
1 h
bm 1

h
2 Nesterov
Update

q1
Figure 8.13: Illustration of the Nesterov update. Adapted from Figure 11.6 of [Gér19].

Thus in the limit, we multiply the gradient by 1/(1 ). For example, if = 0.9, we scale the
gradient up by 10.
Since we update the parameters using the gradient average mt 1 , rather than just the most recent
gradient, gt 1 , we see that past gradients can exhibit some influence on the present. Furthermore,
when momentum is combined with SGD, discussed in Section 8.4, we will see that it can simulate
the eﬀects of a larger minibatch, without the computational cost.

8.2.4.2 Nesterov momentum

One problem with the standard momentum method is that it may not slow down enough at the
bottom of a valley, causing oscillation. The Nesterov accelerated gradient method of [Nes04]
instead modifies the gradient descent to include an extrapolation step, as follows:
˜t+1 = ✓t + (✓t ✓t 1 )
✓ (8.35)
˜t+1 ⌘t rL(✓
✓t+1 = ✓ ˜t+1 ) (8.36)

This is essentially a form of one-step “look ahead”, that can reduce the amount of oscillation, as
illustrated in Figure 8.13.
Nesterov accelerated gradient can also be rewritten in the same format as standard momentum. In
this case, the momentum term is updated using the gradient at the predicted new location,

mt+1 = mt ⌘t rL(✓t + mt ) (8.37)

✓t+1 = ✓t + mt+1 (8.38)

This explains why the Nesterov accelerated gradient method is sometimes called Nesterov momentum.
It also shows how this method can be faster than standard momentum: the momentum vector
is already roughly pointing in the right direction, so measuring the gradient at the new location,
✓t + mt , rather than the current location, ✓t , can be more accurate.
The Nesterov accelerated gradient method is provably faster than steepest descent for convex
functions when and ⌘t are chosen appropriately. It is called “accelerated” because of this improved

“Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024

TM 9-2320-342-10-2
100% (4)
TM 9-2320-342-10-2
696 pages
Notes On Contrastive Divergence
No ratings yet
Notes On Contrastive Divergence
3 pages
Nesterov Momentum
No ratings yet
Nesterov Momentum
3 pages
optimization
No ratings yet
optimization
6 pages
Nesterov's Momentum Method
No ratings yet
Nesterov's Momentum Method
9 pages
Advanced Gradient Descent
No ratings yet
Advanced Gradient Descent
14 pages
Nestrov Gradient Descent
No ratings yet
Nestrov Gradient Descent
8 pages
Deep-Learning-book-part2
No ratings yet
Deep-Learning-book-part2
101 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
On The Momentum Term in Gradient Descent Learning Algorithms
No ratings yet
On The Momentum Term in Gradient Descent Learning Algorithms
7 pages
9 - Gradient Descent Part 3
No ratings yet
9 - Gradient Descent Part 3
31 pages
054 Report
No ratings yet
054 Report
6 pages
optim
No ratings yet
optim
33 pages
DL_26-09 (3)
No ratings yet
DL_26-09 (3)
22 pages
Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
No ratings yet
Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
45 pages
EDA Lecture Module 4
No ratings yet
EDA Lecture Module 4
20 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Wa0000.
No ratings yet
Wa0000.
4 pages
CS60010_Fitting-1
No ratings yet
CS60010_Fitting-1
39 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
06_23ECE216_GradientDescent_v2
No ratings yet
06_23ECE216_GradientDescent_v2
73 pages
Backpropagation_optimization_tutorial
No ratings yet
Backpropagation_optimization_tutorial
14 pages
DL Class2
No ratings yet
DL Class2
30 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
No ratings yet
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
25 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Adam: Adaptive Moment Estimation: The Error To Be Minimized
No ratings yet
Adam: Adaptive Moment Estimation: The Error To Be Minimized
4 pages
Why Momentum Really Works
No ratings yet
Why Momentum Really Works
17 pages
Lesson 4 Training ANNs
No ratings yet
Lesson 4 Training ANNs
34 pages
Chap_4_2
No ratings yet
Chap_4_2
214 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
Technical_writing (2)
No ratings yet
Technical_writing (2)
9 pages
4.1 - EDA Lecture Module 4 Vetri Sir New
No ratings yet
4.1 - EDA Lecture Module 4 Vetri Sir New
19 pages
Cpc Gpelab
No ratings yet
Cpc Gpelab
38 pages
Technical_writing (1)
No ratings yet
Technical_writing (1)
9 pages
Lecture Notes: Some Notes On Gradient Descent: Marc Toussaint
No ratings yet
Lecture Notes: Some Notes On Gradient Descent: Marc Toussaint
4 pages
Visualising SGD With Momentum, Adam and Learning Rate Annealing
No ratings yet
Visualising SGD With Momentum, Adam and Learning Rate Annealing
8 pages
optimization
No ratings yet
optimization
26 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
Training deep neural networks
No ratings yet
Training deep neural networks
14 pages
Berkeley-tutorial Optimization for Machine Learningpart2
No ratings yet
Berkeley-tutorial Optimization for Machine Learningpart2
35 pages
Lecture 7: Stochastic Gradient Descent
No ratings yet
Lecture 7: Stochastic Gradient Descent
4 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Mustapha_2021_J._Phys.__Conf._Ser._1743_012002
No ratings yet
Mustapha_2021_J._Phys.__Conf._Ser._1743_012002
13 pages
Momentum, AdaGrad, RMSProp, Adam
No ratings yet
Momentum, AdaGrad, RMSProp, Adam
27 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Lecture 7 (with notes)
No ratings yet
Lecture 7 (with notes)
39 pages
Technical_writing
No ratings yet
Technical_writing
8 pages
Gradient Descent Explained. A Comprehensive Guide To Gradient - by Daksh Trehan - Towards Data Science
No ratings yet
Gradient Descent Explained. A Comprehensive Guide To Gradient - by Daksh Trehan - Towards Data Science
9 pages
Optimization
No ratings yet
Optimization
21 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
Lecture 12
No ratings yet
Lecture 12
16 pages
All Optimizers
No ratings yet
All Optimizers
13 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet
Kava and Kratom Guide Shot of Joy
No ratings yet
Kava and Kratom Guide Shot of Joy
3 pages
Energy Activity - Exit Tickets
No ratings yet
Energy Activity - Exit Tickets
3 pages
List of Books On Air and Space Law
No ratings yet
List of Books On Air and Space Law
8 pages
Sales and Operation Planning
No ratings yet
Sales and Operation Planning
7 pages
PDS Maricel1
No ratings yet
PDS Maricel1
14 pages
Principles of Visual Message and Design Using Infographics
No ratings yet
Principles of Visual Message and Design Using Infographics
18 pages
Gautam Adani
No ratings yet
Gautam Adani
2 pages
Module 1 Cheatsheet - Data Science and Generative AI
No ratings yet
Module 1 Cheatsheet - Data Science and Generative AI
1 page
Proceeding of International Workshop 2019, LMTC
No ratings yet
Proceeding of International Workshop 2019, LMTC
277 pages
C1 Technopreneurship
50% (2)
C1 Technopreneurship
54 pages
Math 2 - Written Report
No ratings yet
Math 2 - Written Report
9 pages
Varta Opzv Range 4 Opzv 200... 24 Opzv 3000: Application Installation
No ratings yet
Varta Opzv Range 4 Opzv 200... 24 Opzv 3000: Application Installation
3 pages
Articulators in current use A review
No ratings yet
Articulators in current use A review
10 pages
21TSH102A-Professional Communication-B.tech Assignment Quiz- 1st Sem 2021
No ratings yet
21TSH102A-Professional Communication-B.tech Assignment Quiz- 1st Sem 2021
6 pages
EE3706 - Chapter 4 - Circuit Theorems
No ratings yet
EE3706 - Chapter 4 - Circuit Theorems
33 pages
Unit 1 Making and Taking Reservation
100% (1)
Unit 1 Making and Taking Reservation
7 pages
System Rehabilitation For NRW Reduction in South Part of Colombo City
No ratings yet
System Rehabilitation For NRW Reduction in South Part of Colombo City
1 page
2025-01-23 VMware Horizon Pricing, Packaging, and Licensing - Licenseware
No ratings yet
2025-01-23 VMware Horizon Pricing, Packaging, and Licensing - Licenseware
4 pages
Shafin Structure Example
No ratings yet
Shafin Structure Example
13 pages
Hyperloop Design Report Makers UPV Team
No ratings yet
Hyperloop Design Report Makers UPV Team
107 pages
Internship Report: Bangabandhu Sheikh Mujibur Rahman Science & Technology University, Gopalganj-8100
No ratings yet
Internship Report: Bangabandhu Sheikh Mujibur Rahman Science & Technology University, Gopalganj-8100
10 pages
Develop A Competencies Framework For Digital Transformation in The Banking Industry
No ratings yet
Develop A Competencies Framework For Digital Transformation in The Banking Industry
52 pages
A Reconstructed Historical Aetiology of The Sars-Cov-2 Spike
100% (2)
A Reconstructed Historical Aetiology of The Sars-Cov-2 Spike
8 pages
Mod 14 Course Summary - InstructorDeck
No ratings yet
Mod 14 Course Summary - InstructorDeck
17 pages
Dynamics - Lec PPT - Section 9.1-9.2
No ratings yet
Dynamics - Lec PPT - Section 9.1-9.2
15 pages
Azerbaijan EE 2013 ENG PDF
No ratings yet
Azerbaijan EE 2013 ENG PDF
114 pages
Bella Report
No ratings yet
Bella Report
4 pages
Top 50 Teacher Interview Questions and Answers - HiPeople
No ratings yet
Top 50 Teacher Interview Questions and Answers - HiPeople
29 pages
Sly Automatic Rotary Pre-Filters
No ratings yet
Sly Automatic Rotary Pre-Filters
5 pages

Momentum

Uploaded by

Momentum

Uploaded by

8.2.

First-order methods 287

8.2.4 Momentum methods

If all the past gradients are a constant, say g, this simplifies to

The scaling factor is a geometric series, whose infinite sum is given by

Author: Kevin P. Murphy. (C) MIT Press. CC-BY-NC-ND license

8.2.4.2 Nesterov momentum

mt+1 = mt ⌘t rL(✓t + mt ) (8.37)

“Probabilistic Machine Learning: An Introduction”. Online version. November 23, 2024

You might also like