0% found this document useful (0 votes)
8 views

Deep Learning Module 3-1

Uploaded by

prajwaloconner
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Deep Learning Module 3-1

Uploaded by

prajwaloconner
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

DEEP LEARNING

MODULE 3
OPTIMIZATION FOR TRAINING DEEP MODELS
8.1 How Learning Differs from Pure Optimization

• Optimization algorithms used for training of deep models differ from traditional optimization algorithms in several ways.
Machine learning usually acts indirectly.
• In most machine learning scenarios, we care about some performance measure P, that is defined with respect to the test set
and may also be intractable. We therefore optimize P only indirectly.
• We reduce a different cost function J(θ) in the hope that doing so will improve P. This is in contrast to pure optimization,
where minimizing J is a goal in and of itself.
• Optimization algorithms for training deep models also typically include some specialization on the specific structure of
machine learning objective functions.
Typically, the cost function can be written as an average over the training set, such as
• Equation 8.1 defines an objective function with respect to the training set. We would usually prefer to minimize the
corresponding objective function where the expectation is taken across the data generating distribution p data rather than just
over the finite training set:

8.1.1 Empirical Risk Minimization


• The goal of a machine learning algorithm is to reduce the expected generalization error given by equation 8.2. This quantity
is known as the risk. We emphasize here that the expectation is taken over the true underlying distribution p data. If we knew
the true distribution pdata(x, y), risk minimization would be an optimization task solvable by an optimization algorithm.
However, when we do not know pdata(x, y) but only have a training set of samples, we have a machine learning problem.

• The training process based on minimizing this average training error is known as empirical risk minimization. However,
empirical risk minimization is prone to overfitting. Models with high capacity can simply memorize the training set. In
many cases, empirical risk minimization is not really feasible.
8.1.2 Surrogate Loss Functions and Early Stopping
• Sometimes, the loss function we actually care about (say classification error) is not one that can be optimized efficiently.
For example, exactly minimizing expected 0-1 loss is typically intractable (exponential in the input dimension), even for a
linear classifier (Marcotte and Savard, 1992).
• In such situations, one typically optimizes a surrogate loss function instead, which acts as a proxy but has advantages. For
example, the negative log-likelihood of the correct class is typically used as a surrogate for the 0-1 loss.
• The negative log-likelihood allows the model to estimate the conditional probability of the classes, given the input, and if
the model can do that well, then it can pick the classes that yield the least classification error in expectation.
• A very important difference between optimization in general and optimization as we use it for training algorithms is that
training algorithms do not usually halt at a local minimum.
• Instead, a machine learning algorithm usually minimizes a surrogate loss function but halts when a convergence criterion
based on early stopping (section 7.8) is satisfied.
8.1.3 Batch and Minibatch Algorithms
• One aspect of machine learning algorithms that separates them from general optimization algorithms is that the objective
function usually decomposes as a sum over the training examples.
• Optimization algorithms for machine learning typically compute each update to the parameters based on an expected value
of the cost function estimated using only a subset of the terms of the full cost function.
• Optimization algorithms that use the entire training set are called batch or deterministic gradient methods, because they
process all of the training examples simultaneously in a large batch.
• This terminology can be somewhat confusing because the word “batch” is also often used to describe the minibatch used by
minibatch stochastic gradient descent.
• Typically the term “batch gradient descent” implies the use of the full training set, while the use of the term “batch” to
describe a group of examples does not. For example, it is very common to use the term “batch size” to describe the size of a
minibatch.
• Optimization algorithms that use only a single example at a time are sometimes called stochastic or sometimes online
methods. The term online is usually reserved for the case where the examples are drawn from a stream of continually
created examples rather than from a fixed-size training set over which several passes are made.
• Most algorithms used for deep learning fall somewhere in between, using morethan one but less than all of the training
examples. These were traditionally called minibatch or minibatch stochastic methods and it is now common to simply call
them stochastic methods.
• The fact that stochastic gradient descent minimizes generalization error is easiest to see in the online learning case, where
examples or minibatches are drawn from a stream of data.
• In other words, instead of receiving a fixed-size training set, the learner is similar to a living being who sees a new example
at each instant, with every example (x, y) coming from the data generating distribution p data(x, y). In this scenario, examples
are never repeated; every experience is a fair sample from p data.
8.2 Challenges in Neural Network Optimization
• Optimization in general is an extremely difficult task. Traditionally, machine learning has avoided the difficulty of general
optimization by carefully designing the objective function and constraints to ensure that the optimization problem is
convex.
• When training neural networks, we must confront the general non-convex case. Even convex optimization is not without its
complications. In this section, we summarize several of the most prominent challenges involved in optimization for training
deep models.

8.2.1 Ill-Conditioning
• Some challenges arise even when optimizing convex functions. Of these, the most prominent is ill-conditioning of the
Hessian matrix H. This is a very general problem in most numerical optimization, convex or otherwise, and is described in
more detail in section 4.3.1.
• The ill-conditioning problem is generally believed to be present in neural network training problems. Ill-conditioning can
manifest by causing SGD to get “stuck” in the sense that even very small steps increase the cost function. Recall from
equation 4.9 that a second-order Taylor series expansion of the cost function predicts that a gradient descent step of − € g
will add

to the cost. Ill-conditioning of the gradient becomes a problem when 1/2 € 2gTHg exceeds € gTg.
• To determine whether ill-conditioning is detrimental to a neural network training task, one can monitor the squared gradient
norm gTg and the gTHg term. In many cases, the gradient norm does not shrink significantly throughout learning, but the
gTHg term grows by more than an order of magnitude.
• The result is that learning becomes very slow despite the presence of a strong gradient because the learning rate must be
shrunk to compensate for even stronger curvature. Figure 8.1 shows an example of the gradient increasing significantly
during the successful training of a neural network.
8.2.2 Local Minima
• One of the most prominent features of a convex optimization problem is that it can be reduced to the problem of finding a
local minimum. Any local minimum is guaranteed to be a global minimum.
• Some convex functions have a flat region at the bottom rather than a single global minimum point, but any point within
such a flat region is an acceptable solution. When optimizing a convex function, we know that we have reached a good
solution if we find a critical point of any kind.
• With non-convex functions, such as neural nets, it is possible to have many local minima. Indeed, nearly any deep model is
essentially guaranteed to have an extremely large number of local minima. However, as we will see, this is not necessarily a
major problem.
• Neural networks and any models with multiple equivalently parametrized latent variables all have multiple local minima
because of the model identifiability problem.
• A model is said to be identifiable if a sufficiently large training set can rule out all but one setting of the model’s parameters.
Models with latent variables are often not identifiable because we can obtain equivalent models by exchanging latent
variables with each other.
• For example, we could take a neural network and modify layer 1 by swapping the incoming weight vector for unit i with the
incoming weight vector for unit j , then doing the same for the outgoing weight vectors. If we have m layers with n units
each, then there are n! m ways of arranging the hidden units. This kind of non-identifiability is known as weight space
symmetry.
8.2.3 Plateaus, Saddle Points and Other Flat Regions
• For many high-dimensional non-convex functions, local minima (and maxima) are in fact rare compared to another kind of
point with zero gradient: a saddle point. Some points around a saddle point have greater cost than the saddle point, while
others have a lower cost.
• At a saddle point, the Hessian matrix has both positive and negative eigenvalues. Points lying along eigenvectors associated
with positive eigenvalues have greater cost than the saddle point, while points lying along negative eigenvalues have lower
value.
• We can think of a saddle point as being a local minimum along one cross-section of the cost function and a local maximum
along another cross-section.
• What are the implications of the proliferation of saddle points for training algorithms? For first-order optimization
algorithms that use only gradient information, the situation is unclear. The gradient can often become very small near a
saddle point.
• On the other hand, gradient descent empirically seems to be able to escape saddle points in many cases. Goodfellow et al.
(2015) provided visualizations of several learning trajectories of state-of-the-art neural networks, with an example given in
figure 8.2.
• These visualizations show a flattening of the cost function near a prominent saddle point where the weights are all zero, but
they also show the gradient descent trajectory rapidly escaping this region.
• Goodfellow et al. (2015) also argue that continuous-time gradient descent may be shown analytically to be repelled from,
rather than attracted to, a nearby saddle point, but the situation may be different for more realistic uses of gradient descent
• Gradient descent is designed to move “downhill” and is not explicitly designed to seek a critical point. Newton’s
method, however, is designed to solve for a point where the gradient is zero. Without appropriate modification, it
can jump to a saddle point.
• The proliferation of saddle points in high dimensional spaces presumably explains why second-order methods
have not succeeded in replacing gradient descent for neural network training. Dauphin et al. (2014) introduced a
saddle-free Newton method for second-order optimization and showed that it improves significantly over the
traditional version. Second-order methods remain difficult to scale to large neural networks, but this saddle-free
approach holds promise if it could be scaled.
8.2.4 Cliffs and Exploding Gradients
• Neural networks with many layers often have extremely steep regions resembling cliffs, as illustrated in figure 8.3. These
result from the multiplication of several large weights together.
• On the face of an extremely steep cliff structure, the gradient update step can move the parameters extremely far, usually
jumping off of the cliff structure altogether.

• The cliff can be dangerous whether we approach it from above or from below, but fortunately its most serious consequences
can be avoided using the gradient clipping heuristic. The basic idea is to recall that the gradient does not specify the
optimal step size, but only the optimal direction within an infinitesimal region.
KEY terms(reference)
• Neural Networks and Layers: Neural networks are composed of multiple layers, each containing units (neurons)
that transform input data. Deep neural networks have many such layers.
• Steep Regions ("Cliffs"): In the loss landscape of a neural network, steep regions (or "cliffs") can occur due to the
multiplication of large weights. This can make the landscape highly non-linear, causing rapid changes in the loss
function with small changes in the weights.
• Gradient Update Steps: The gradient of the loss function indicates the direction to adjust the weights to minimize
the loss. However, in regions with extreme steepness, the update step based on the gradient can be very large,
potentially leading to overshooting—meaning the update can move the weights too far, possibly even jumping
out of the steep region entirely.
• Cliff Dangers: Approaching these cliffs can be problematic, whether from above (positive weights) or below
(negative weights). The overshooting can cause instability in training and prevent the model from converging to a
good solution.
• Gradient Clipping: To mitigate the issues caused by these steep gradients, a technique called gradient clipping is
often used. This involves setting a threshold for the size of the gradient. If the gradient exceeds this threshold, it
is scaled down to maintain a more manageable update step. This helps keep the training process stable by
ensuring that updates remain within a reasonable range.
• Optimal Step Size: It's important to note that while the gradient gives the direction to move in, it doesn't dictate
how far to move. Gradient clipping focuses on controlling the size of the step, ensuring it remains within a
reasonable limit, thus improving the stability and performance of the training process.
8.2.5 Long-Term Dependencies
• Another difficulty that neural network optimization algorithms must overcome arises when the computational
graph becomes extremely deep. Feedforward networks with many layers have such deep computational
graphs.
• For example, suppose that a computational graph contains a path that consists of repeatedly multiplying by a
matrix W. After t steps, this is equivalent to multiplying by W t. Suppose that W has an eigen decomposition
W = V diag(λ)V−1. In this simple case, it is straightforward to see that

• Any eigenvalues λi that are not near an absolute value of 1 will either explode if they are greater than 1 in
magnitude or vanish if they are less than 1 in magnitude. The vanishing and exploding gradient problem
refers to the fact that gradients through such a graph are also scaled according to diag(λ)t . Vanishing
gradients make it difficult to know which direction the parameters should move to improve the cost function,
while exploding gradients can make learning unstable.
• The repeated multiplication by W at each time step described here is very similar to the power method
algorithm used to find the largest eigenvalue of a matrix W and the corresponding eigenvector. From this
point of view it is not surprising that x TWt will eventually discard all components of x that are orthogonal to
the principal eigenvector of W.
• 8.2.6 Inexact Gradients
• Most optimization algorithms are designed with the assumption that we have access to the exact gradient or Hessian matrix.
In practice, we usually only have a noisy or even biased estimate of these quantities.
• Nearly every deep learning algorithm relies on sampling-based estimates at least insofar as using a minibatch of training
examples to compute the gradient.
• In other cases, the objective function to minimize is actually intractable. When the objective function is intractable,
typically its gradient is intractable as well. In such cases we can only approximate the gradient. These issues mostly arise
with the more advanced models.

8.2.7 Poor Correspondence between Local and Global Structure


• Many of the problems we have discussed so far correspond to properties of the loss function at a single point—it can be
difficult to make a single step if J(θ) is poorly conditioned at the current point θ, or if θ lies on a cliff, or if θ is a saddle
point hiding the opportunity to make progress downhill from the gradient.
• It is possible to overcome all of these problems at a single point and still perform poorly if the direction that results in the
most improvement locally does not point toward distant regions of much lower cost.
• Figure 8.4 for an example of a failure of local optimization to find a good cost function value even in the absence of any
local minima or saddle points. Future research will need to develop further understanding of the factors that influence the
length of the learning trajectory and better characterize the outcome of the process.
8.2.8 Theoretical Limits of Optimization
• Several theoretical results show that there are limits on the performance of any optimization algorithm that is
designed for neural networks. Typically these results have little bearing on the use of neural networks in practice.
Some theoretical results apply only to the case where the units of a neural network output discrete values.
• However, most neural network units output smoothly increasing values that make optimization via local search
feasible. Some theoretical results show that there exist problem classes that are intractable, but it can be difficult to
tell whether a particular problem falls into that class.
8.3 Basic Algorithms

8.3.1 Stochastic Gradient Descent


• Stochastic gradient descent (SGD) and its variants are probably the most used optimization algorithms for machine learning
in general and for deep learning in particular. It is possible to obtain an unbiased estimate of the gradient by taking the
average gradient on a minibatch of m examples drawn i.i.d from the data generating distribution.

• A crucial parameter for the SGD algorithm is the learning rate. Previously, we have described SGD as using a fixed learning
rate €. In practice, it is necessary to gradually decrease the learning rate over time, so we now denote the learning rate at
iteration k as € k.
Algorithm Explanation
8.3.2 Momentum
• While stochastic gradient descent remains a very popular optimization strategy, learning with it can sometimes be slow.
The method of momentum is designed to accelerate learning, especially in the face of high curvature, small but consistent
gradients, or noisy gradients.
• The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move
in their direction..
• Formally, the momentum algorithm introduces a variable v that plays the role of velocity—it is the direction and speed at
which the parameters move through parameter space. The velocity is set to an exponentially decaying average of the
negative gradient.
• The name momentum derives from a physical analogy, in which the negative gradient is a force moving a particle through
parameter space, according to Newton’s laws of motion. Momentum in physics is mass times velocity.
• In the momentum learning algorithm, we assume unit mass, so the velocity vector v may also be regarded as the momentum
of the particle. A hyperparameter α ∈ [0, 1) determines how quickly the contributions of previous gradients
exponentially decay. The update rule is given by:
• Previously, the size of the step was simply the norm of the gradient multiplied by the learning rate. Now, the size of
the step depends on how large and how aligned a sequence of gradients are. The step size is largest when many
successive gradients point in exactly the same direction. If the momentum algorithm always observes gradient g,
then it will accelerate in the direction of −g, until reaching a terminal velocity where the size of each step is
8.3.3 Nesterov Momentum
• Sutskever et al. (2013) introduced a variant of the momentum algorithm that was inspired by Nesterov’s accelerated
gradient method (Nesterov, 1983, 2004). The update rules in this case are given by:
8.4 Parameter Initialization Strategies
• Some optimization algorithms are non-iterative, while others are iterative and converge well regardless of initialization.
However, deep learning training algorithms are typically iterative and highly sensitive to initialization.
• The initial point for training can significantly impact whether the algorithm converges, how quickly it converges, and the
quality of the solution (in terms of cost and generalization error).
• Designing improved initialization strategies is challenging due to limited understanding of neural network optimization. An
initialization strategy beneficial for optimization might be detrimental for generalization.
• Initial parameters need to "break symmetry" between different units; otherwise, units with the same parameters would
perform identically, reducing the model's capacity to learn diverse features.
• Random initialization from a high-entropy distribution is preferred because it is computationally cheaper and ensures that
each unit computes a different function. While more structured methods like Gram-Schmidt orthogonalization exist, they
are often costly.
• Biases and extra parameters (like those encoding conditional variance) are usually initialized to heuristically chosen
constants, while weights are initialized randomly (typically from a Gaussian or uniform distribution).
• The scale of the initial weight distribution has a significant impact. Larger initial weights help break symmetry and
propagate signals better but can lead to issues like exploding gradients, chaos in recurrent networks, or saturation of
activation functions.
• Gradient descent with early stopping is somewhat analogous to weight decay, favoring solutions closer to the initial
parameters. This suggests that choosing an initial parameter set close to zero can serve as a form of regularization .
• Some heuristics are available for choosing the initial scale of the weights. One heuristic is to initialize the weights of a fully
connected layer with m inputs and n outputs by sampling each weight from U(− 1 /√m , 1 /√m), while Glorot and Bengio
(2010) suggest using the normalized initialization.

• One drawback to scaling rules that set all of the initial weights to have the same standard deviation, such as 1 /√m, is that
every individual weight becomes extremely small when the layers become large. Martens (2010) introduced an alternative
initialization scheme called sparse initialization in which each unit is initialized to have exactly k non-zero weights.
8.5 Algorithms with Adaptive Learning Rates
• Neural network researchers have long realized that the learning rate was reliably one of the hyperparameters that is the most
difficult to set because it has a significant impact on model performance.
• The momentum algorithm can mitigate these issues somewhat, but does so at the expense of introducing another
hyperparameter. In the face of this, it is natural to ask if there is another way.
• If we believe that the directions of sensitivity are somewhat axis-aligned, it can make sense to use a separate learning rate
for each parameter, and automatically adapt these learning rates throughout the course of learning.
• The delta-bar-delta algorithm (Jacobs, 1988) is an early heuristic approach to adapting individual learning rates for model
parameters during training. The approach is based on a simple idea: if the partial derivative of the loss, with respect to a
given model parameter, remains the same sign, then the learning rate should increase.
8.5.1 AdaGrad
• The AdaGrad algorithm, shown in algorithm 8.4, individually adapts the learning rates of all model parameters by scaling
them inversely proportional to the square root of the sum of all of their historical squared values (Duchi et al., 2011). The
parameters with the largest partial derivative of the loss have a correspondingly rapid decrease in their learning rate, while
parameters with small partial derivatives have a relatively small decrease in their learning rate.
• The net effect is greater progress in the more gently sloped directions of parameter space. In the context of convex
optimization, the AdaGrad algorithm enjoys some desirable theoretical properties. However, empirically it has been found
that—for training deep neural network models—the accumulation of squared gradients from the beginning of training can
result in a premature and excessive decrease in the effective learning rate. AdaGrad performs well for some but not all deep
learning models.
8.5.2 RMSProp
• The RMSProp algorithm (Hinton, 2012) modifies AdaGrad to perform better in the non-convex setting by changing the
gradient accumulation into an exponentially weighted moving average. AdaGrad is designed to converge rapidly when
applied to a convex function.
• When applied to a non-convex function to train a neural network, the learning trajectory may pass through many different
structures and eventually arrive at a region that is a locally convex bowl.
• RMSProp uses an exponentially decaying average to discard history from the extreme past so that it can converge rapidly
after finding a convex bowl, as if it were an instance of the AdaGrad algorithm initialized within that bowl.
8.5.3 Adam
• Adam (Kingma and Ba, 2014) is yet another adaptive learning rate optimization algorithm and is
presented in algorithm 8.7. The name “Adam” derives from the phrase “adaptive moments.” In the
context of the earlier algorithms, it is perhaps best seen as a variant on the combination of RMSProp and
momentum with a few important distinctions.
• First, in Adam, momentum is incorporated directly as an estimate of the first order moment (with
exponential weighting) of the gradient. The most straightforward way to add momentum to RMSProp is
to apply momentum to the rescaled gradients.
8.5.4 Choosing the Right Optimization Algorithm
• In this section, we discussed a series of related algorithms that each seek to address the challenge of optimizing deep
models by adapting the learning rate for each model parameter.
• At this point, a natural question is: which algorithm should one choose? Unfortunately, there is currently no consensus on
this point. Schaul et al. (2014) presented a valuable comparison of a large number of optimization algorithms across a wide
range of learning tasks.
• While the results suggest that the family of algorithms with adaptive learning rates (represented by RMSProp and
AdaDelta) performed fairly robustly, no single best algorithm has emerged.
• Currently, the most popular optimization algorithms actively in use include SGD, SGD with momentum, RMSProp,
RMSProp with momentum, AdaDelta and Adam. The choice of which algorithm to use, at this point, seems to depend
largely on the user’s familiarity with the algorithm (for ease of hyperparameter tuning).

You might also like