Deep Learning Module 3-1

Uploaded by

prajwaloconner

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Deep Learning Module 3-1

Uploaded by

prajwaloconner

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

DEEP LEARNING

MODULE 3
OPTIMIZATION FOR TRAINING DEEP MODELS
8.1 How Learning Differs from Pure Optimization

• Optimization algorithms used for training of deep models differ from traditional optimization algorithms in several ways.
Machine learning usually acts indirectly.
• In most machine learning scenarios, we care about some performance measure P, that is defined with respect to the test set
and may also be intractable. We therefore optimize P only indirectly.
• We reduce a different cost function J(θ) in the hope that doing so will improve P. This is in contrast to pure optimization,
where minimizing J is a goal in and of itself.
• Optimization algorithms for training deep models also typically include some specialization on the specific structure of
machine learning objective functions.
Typically, the cost function can be written as an average over the training set, such as
• Equation 8.1 defines an objective function with respect to the training set. We would usually prefer to minimize the
corresponding objective function where the expectation is taken across the data generating distribution p data rather than just
over the finite training set:

8.1.1 Empirical Risk Minimization

• The goal of a machine learning algorithm is to reduce the expected generalization error given by equation 8.2. This quantity
is known as the risk. We emphasize here that the expectation is taken over the true underlying distribution p data. If we knew
the true distribution pdata(x, y), risk minimization would be an optimization task solvable by an optimization algorithm.
However, when we do not know pdata(x, y) but only have a training set of samples, we have a machine learning problem.

• The training process based on minimizing this average training error is known as empirical risk minimization. However,
empirical risk minimization is prone to overfitting. Models with high capacity can simply memorize the training set. In
many cases, empirical risk minimization is not really feasible.
8.1.2 Surrogate Loss Functions and Early Stopping
• Sometimes, the loss function we actually care about (say classification error) is not one that can be optimized efficiently.
For example, exactly minimizing expected 0-1 loss is typically intractable (exponential in the input dimension), even for a
linear classifier (Marcotte and Savard, 1992).
• In such situations, one typically optimizes a surrogate loss function instead, which acts as a proxy but has advantages. For
example, the negative log-likelihood of the correct class is typically used as a surrogate for the 0-1 loss.
• The negative log-likelihood allows the model to estimate the conditional probability of the classes, given the input, and if
the model can do that well, then it can pick the classes that yield the least classification error in expectation.
• A very important difference between optimization in general and optimization as we use it for training algorithms is that
training algorithms do not usually halt at a local minimum.
• Instead, a machine learning algorithm usually minimizes a surrogate loss function but halts when a convergence criterion
based on early stopping (section 7.8) is satisfied.
8.1.3 Batch and Minibatch Algorithms
• One aspect of machine learning algorithms that separates them from general optimization algorithms is that the objective
function usually decomposes as a sum over the training examples.
• Optimization algorithms for machine learning typically compute each update to the parameters based on an expected value
of the cost function estimated using only a subset of the terms of the full cost function.
• Optimization algorithms that use the entire training set are called batch or deterministic gradient methods, because they
process all of the training examples simultaneously in a large batch.
• This terminology can be somewhat confusing because the word “batch” is also often used to describe the minibatch used by
minibatch stochastic gradient descent.
• Typically the term “batch gradient descent” implies the use of the full training set, while the use of the term “batch” to
describe a group of examples does not. For example, it is very common to use the term “batch size” to describe the size of a
minibatch.
• Optimization algorithms that use only a single example at a time are sometimes called stochastic or sometimes online
methods. The term online is usually reserved for the case where the examples are drawn from a stream of continually
created examples rather than from a fixed-size training set over which several passes are made.
• Most algorithms used for deep learning fall somewhere in between, using morethan one but less than all of the training
examples. These were traditionally called minibatch or minibatch stochastic methods and it is now common to simply call
them stochastic methods.
• The fact that stochastic gradient descent minimizes generalization error is easiest to see in the online learning case, where
examples or minibatches are drawn from a stream of data.
• In other words, instead of receiving a fixed-size training set, the learner is similar to a living being who sees a new example
at each instant, with every example (x, y) coming from the data generating distribution p data(x, y). In this scenario, examples
are never repeated; every experience is a fair sample from p data.
8.2 Challenges in Neural Network Optimization
• Optimization in general is an extremely difficult task. Traditionally, machine learning has avoided the difficulty of general
optimization by carefully designing the objective function and constraints to ensure that the optimization problem is
convex.
• When training neural networks, we must confront the general non-convex case. Even convex optimization is not without its
complications. In this section, we summarize several of the most prominent challenges involved in optimization for training
deep models.

8.2.1 Ill-Conditioning
• Some challenges arise even when optimizing convex functions. Of these, the most prominent is ill-conditioning of the
Hessian matrix H. This is a very general problem in most numerical optimization, convex or otherwise, and is described in
more detail in section 4.3.1.
• The ill-conditioning problem is generally believed to be present in neural network training problems. Ill-conditioning can
manifest by causing SGD to get “stuck” in the sense that even very small steps increase the cost function. Recall from
equation 4.9 that a second-order Taylor series expansion of the cost function predicts that a gradient descent step of − € g
will add

to the cost. Ill-conditioning of the gradient becomes a problem when 1/2 € 2gTHg exceeds € gTg.
• To determine whether ill-conditioning is detrimental to a neural network training task, one can monitor the squared gradient
norm gTg and the gTHg term. In many cases, the gradient norm does not shrink significantly throughout learning, but the
gTHg term grows by more than an order of magnitude.
• The result is that learning becomes very slow despite the presence of a strong gradient because the learning rate must be
shrunk to compensate for even stronger curvature. Figure 8.1 shows an example of the gradient increasing significantly
during the successful training of a neural network.
8.2.2 Local Minima
• One of the most prominent features of a convex optimization problem is that it can be reduced to the problem of finding a
local minimum. Any local minimum is guaranteed to be a global minimum.
• Some convex functions have a flat region at the bottom rather than a single global minimum point, but any point within
such a flat region is an acceptable solution. When optimizing a convex function, we know that we have reached a good
solution if we find a critical point of any kind.
• With non-convex functions, such as neural nets, it is possible to have many local minima. Indeed, nearly any deep model is
essentially guaranteed to have an extremely large number of local minima. However, as we will see, this is not necessarily a
major problem.
• Neural networks and any models with multiple equivalently parametrized latent variables all have multiple local minima
because of the model identifiability problem.
• A model is said to be identifiable if a sufficiently large training set can rule out all but one setting of the model’s parameters.
Models with latent variables are often not identifiable because we can obtain equivalent models by exchanging latent
variables with each other.
• For example, we could take a neural network and modify layer 1 by swapping the incoming weight vector for unit i with the
incoming weight vector for unit j , then doing the same for the outgoing weight vectors. If we have m layers with n units
each, then there are n! m ways of arranging the hidden units. This kind of non-identifiability is known as weight space
symmetry.
8.2.3 Plateaus, Saddle Points and Other Flat Regions
• For many high-dimensional non-convex functions, local minima (and maxima) are in fact rare compared to another kind of
point with zero gradient: a saddle point. Some points around a saddle point have greater cost than the saddle point, while
others have a lower cost.
• At a saddle point, the Hessian matrix has both positive and negative eigenvalues. Points lying along eigenvectors associated
with positive eigenvalues have greater cost than the saddle point, while points lying along negative eigenvalues have lower
value.
• We can think of a saddle point as being a local minimum along one cross-section of the cost function and a local maximum
along another cross-section.
• What are the implications of the proliferation of saddle points for training algorithms? For first-order optimization
algorithms that use only gradient information, the situation is unclear. The gradient can often become very small near a
saddle point.
• On the other hand, gradient descent empirically seems to be able to escape saddle points in many cases. Goodfellow et al.
(2015) provided visualizations of several learning trajectories of state-of-the-art neural networks, with an example given in
figure 8.2.
• These visualizations show a flattening of the cost function near a prominent saddle point where the weights are all zero, but
they also show the gradient descent trajectory rapidly escaping this region.
• Goodfellow et al. (2015) also argue that continuous-time gradient descent may be shown analytically to be repelled from,
rather than attracted to, a nearby saddle point, but the situation may be different for more realistic uses of gradient descent
• Gradient descent is designed to move “downhill” and is not explicitly designed to seek a critical point. Newton’s
method, however, is designed to solve for a point where the gradient is zero. Without appropriate modification, it
can jump to a saddle point.
• The proliferation of saddle points in high dimensional spaces presumably explains why second-order methods
have not succeeded in replacing gradient descent for neural network training. Dauphin et al. (2014) introduced a
saddle-free Newton method for second-order optimization and showed that it improves significantly over the
traditional version. Second-order methods remain difficult to scale to large neural networks, but this saddle-free
approach holds promise if it could be scaled.
8.2.4 Cliffs and Exploding Gradients
• Neural networks with many layers often have extremely steep regions resembling cliffs, as illustrated in figure 8.3. These
result from the multiplication of several large weights together.
• On the face of an extremely steep cliff structure, the gradient update step can move the parameters extremely far, usually
jumping off of the cliff structure altogether.

• The cliff can be dangerous whether we approach it from above or from below, but fortunately its most serious consequences
can be avoided using the gradient clipping heuristic. The basic idea is to recall that the gradient does not specify the
optimal step size, but only the optimal direction within an infinitesimal region.
KEY terms(reference)
• Neural Networks and Layers: Neural networks are composed of multiple layers, each containing units (neurons)
that transform input data. Deep neural networks have many such layers.
• Steep Regions ("Cliffs"): In the loss landscape of a neural network, steep regions (or "cliffs") can occur due to the
multiplication of large weights. This can make the landscape highly non-linear, causing rapid changes in the loss
function with small changes in the weights.
• Gradient Update Steps: The gradient of the loss function indicates the direction to adjust the weights to minimize
the loss. However, in regions with extreme steepness, the update step based on the gradient can be very large,
potentially leading to overshooting—meaning the update can move the weights too far, possibly even jumping
out of the steep region entirely.
• Cliff Dangers: Approaching these cliffs can be problematic, whether from above (positive weights) or below
(negative weights). The overshooting can cause instability in training and prevent the model from converging to a
good solution.
• Gradient Clipping: To mitigate the issues caused by these steep gradients, a technique called gradient clipping is
often used. This involves setting a threshold for the size of the gradient. If the gradient exceeds this threshold, it
is scaled down to maintain a more manageable update step. This helps keep the training process stable by
ensuring that updates remain within a reasonable range.
• Optimal Step Size: It's important to note that while the gradient gives the direction to move in, it doesn't dictate
how far to move. Gradient clipping focuses on controlling the size of the step, ensuring it remains within a
reasonable limit, thus improving the stability and performance of the training process.
8.2.5 Long-Term Dependencies
• Another difficulty that neural network optimization algorithms must overcome arises when the computational
graph becomes extremely deep. Feedforward networks with many layers have such deep computational
graphs.
• For example, suppose that a computational graph contains a path that consists of repeatedly multiplying by a
matrix W. After t steps, this is equivalent to multiplying by W t. Suppose that W has an eigen decomposition
W = V diag(λ)V−1. In this simple case, it is straightforward to see that

• Any eigenvalues λi that are not near an absolute value of 1 will either explode if they are greater than 1 in
magnitude or vanish if they are less than 1 in magnitude. The vanishing and exploding gradient problem
refers to the fact that gradients through such a graph are also scaled according to diag(λ)t . Vanishing
gradients make it difficult to know which direction the parameters should move to improve the cost function,
while exploding gradients can make learning unstable.
• The repeated multiplication by W at each time step described here is very similar to the power method
algorithm used to find the largest eigenvalue of a matrix W and the corresponding eigenvector. From this
point of view it is not surprising that x TWt will eventually discard all components of x that are orthogonal to
the principal eigenvector of W.
• 8.2.6 Inexact Gradients
• Most optimization algorithms are designed with the assumption that we have access to the exact gradient or Hessian matrix.
In practice, we usually only have a noisy or even biased estimate of these quantities.
• Nearly every deep learning algorithm relies on sampling-based estimates at least insofar as using a minibatch of training
examples to compute the gradient.
• In other cases, the objective function to minimize is actually intractable. When the objective function is intractable,
typically its gradient is intractable as well. In such cases we can only approximate the gradient. These issues mostly arise
with the more advanced models.

8.2.7 Poor Correspondence between Local and Global Structure

• Many of the problems we have discussed so far correspond to properties of the loss function at a single point—it can be
difficult to make a single step if J(θ) is poorly conditioned at the current point θ, or if θ lies on a cliff, or if θ is a saddle
point hiding the opportunity to make progress downhill from the gradient.
• It is possible to overcome all of these problems at a single point and still perform poorly if the direction that results in the
most improvement locally does not point toward distant regions of much lower cost.
• Figure 8.4 for an example of a failure of local optimization to find a good cost function value even in the absence of any
local minima or saddle points. Future research will need to develop further understanding of the factors that influence the
length of the learning trajectory and better characterize the outcome of the process.
8.2.8 Theoretical Limits of Optimization
• Several theoretical results show that there are limits on the performance of any optimization algorithm that is
designed for neural networks. Typically these results have little bearing on the use of neural networks in practice.
Some theoretical results apply only to the case where the units of a neural network output discrete values.
• However, most neural network units output smoothly increasing values that make optimization via local search
feasible. Some theoretical results show that there exist problem classes that are intractable, but it can be difficult to
tell whether a particular problem falls into that class.
8.3 Basic Algorithms

8.3.1 Stochastic Gradient Descent

• Stochastic gradient descent (SGD) and its variants are probably the most used optimization algorithms for machine learning
in general and for deep learning in particular. It is possible to obtain an unbiased estimate of the gradient by taking the
average gradient on a minibatch of m examples drawn i.i.d from the data generating distribution.

• A crucial parameter for the SGD algorithm is the learning rate. Previously, we have described SGD as using a fixed learning
rate €. In practice, it is necessary to gradually decrease the learning rate over time, so we now denote the learning rate at
iteration k as € k.
Algorithm Explanation
8.3.2 Momentum
• While stochastic gradient descent remains a very popular optimization strategy, learning with it can sometimes be slow.
The method of momentum is designed to accelerate learning, especially in the face of high curvature, small but consistent
gradients, or noisy gradients.
• The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move
in their direction..
• Formally, the momentum algorithm introduces a variable v that plays the role of velocity—it is the direction and speed at
which the parameters move through parameter space. The velocity is set to an exponentially decaying average of the
negative gradient.
• The name momentum derives from a physical analogy, in which the negative gradient is a force moving a particle through
parameter space, according to Newton’s laws of motion. Momentum in physics is mass times velocity.
• In the momentum learning algorithm, we assume unit mass, so the velocity vector v may also be regarded as the momentum
of the particle. A hyperparameter α ∈ [0, 1) determines how quickly the contributions of previous gradients
exponentially decay. The update rule is given by:
• Previously, the size of the step was simply the norm of the gradient multiplied by the learning rate. Now, the size of
the step depends on how large and how aligned a sequence of gradients are. The step size is largest when many
successive gradients point in exactly the same direction. If the momentum algorithm always observes gradient g,
then it will accelerate in the direction of −g, until reaching a terminal velocity where the size of each step is
8.3.3 Nesterov Momentum
• Sutskever et al. (2013) introduced a variant of the momentum algorithm that was inspired by Nesterov’s accelerated
gradient method (Nesterov, 1983, 2004). The update rules in this case are given by:
8.4 Parameter Initialization Strategies
• Some optimization algorithms are non-iterative, while others are iterative and converge well regardless of initialization.
However, deep learning training algorithms are typically iterative and highly sensitive to initialization.
• The initial point for training can significantly impact whether the algorithm converges, how quickly it converges, and the
quality of the solution (in terms of cost and generalization error).
• Designing improved initialization strategies is challenging due to limited understanding of neural network optimization. An
initialization strategy beneficial for optimization might be detrimental for generalization.
• Initial parameters need to "break symmetry" between different units; otherwise, units with the same parameters would
perform identically, reducing the model's capacity to learn diverse features.
• Random initialization from a high-entropy distribution is preferred because it is computationally cheaper and ensures that
each unit computes a different function. While more structured methods like Gram-Schmidt orthogonalization exist, they
are often costly.
• Biases and extra parameters (like those encoding conditional variance) are usually initialized to heuristically chosen
constants, while weights are initialized randomly (typically from a Gaussian or uniform distribution).
• The scale of the initial weight distribution has a significant impact. Larger initial weights help break symmetry and
propagate signals better but can lead to issues like exploding gradients, chaos in recurrent networks, or saturation of
activation functions.
• Gradient descent with early stopping is somewhat analogous to weight decay, favoring solutions closer to the initial
parameters. This suggests that choosing an initial parameter set close to zero can serve as a form of regularization .
• Some heuristics are available for choosing the initial scale of the weights. One heuristic is to initialize the weights of a fully
connected layer with m inputs and n outputs by sampling each weight from U(− 1 /√m , 1 /√m), while Glorot and Bengio
(2010) suggest using the normalized initialization.

• One drawback to scaling rules that set all of the initial weights to have the same standard deviation, such as 1 /√m, is that
every individual weight becomes extremely small when the layers become large. Martens (2010) introduced an alternative
initialization scheme called sparse initialization in which each unit is initialized to have exactly k non-zero weights.
8.5 Algorithms with Adaptive Learning Rates
• Neural network researchers have long realized that the learning rate was reliably one of the hyperparameters that is the most
difficult to set because it has a significant impact on model performance.
• The momentum algorithm can mitigate these issues somewhat, but does so at the expense of introducing another
hyperparameter. In the face of this, it is natural to ask if there is another way.
• If we believe that the directions of sensitivity are somewhat axis-aligned, it can make sense to use a separate learning rate
for each parameter, and automatically adapt these learning rates throughout the course of learning.
• The delta-bar-delta algorithm (Jacobs, 1988) is an early heuristic approach to adapting individual learning rates for model
parameters during training. The approach is based on a simple idea: if the partial derivative of the loss, with respect to a
given model parameter, remains the same sign, then the learning rate should increase.
8.5.1 AdaGrad
• The AdaGrad algorithm, shown in algorithm 8.4, individually adapts the learning rates of all model parameters by scaling
them inversely proportional to the square root of the sum of all of their historical squared values (Duchi et al., 2011). The
parameters with the largest partial derivative of the loss have a correspondingly rapid decrease in their learning rate, while
parameters with small partial derivatives have a relatively small decrease in their learning rate.
• The net effect is greater progress in the more gently sloped directions of parameter space. In the context of convex
optimization, the AdaGrad algorithm enjoys some desirable theoretical properties. However, empirically it has been found
that—for training deep neural network models—the accumulation of squared gradients from the beginning of training can
result in a premature and excessive decrease in the effective learning rate. AdaGrad performs well for some but not all deep
learning models.
8.5.2 RMSProp
• The RMSProp algorithm (Hinton, 2012) modifies AdaGrad to perform better in the non-convex setting by changing the
gradient accumulation into an exponentially weighted moving average. AdaGrad is designed to converge rapidly when
applied to a convex function.
• When applied to a non-convex function to train a neural network, the learning trajectory may pass through many different
structures and eventually arrive at a region that is a locally convex bowl.
• RMSProp uses an exponentially decaying average to discard history from the extreme past so that it can converge rapidly
after finding a convex bowl, as if it were an instance of the AdaGrad algorithm initialized within that bowl.
8.5.3 Adam
• Adam (Kingma and Ba, 2014) is yet another adaptive learning rate optimization algorithm and is
presented in algorithm 8.7. The name “Adam” derives from the phrase “adaptive moments.” In the
context of the earlier algorithms, it is perhaps best seen as a variant on the combination of RMSProp and
momentum with a few important distinctions.
• First, in Adam, momentum is incorporated directly as an estimate of the first order moment (with
exponential weighting) of the gradient. The most straightforward way to add momentum to RMSProp is
to apply momentum to the rescaled gradients.
8.5.4 Choosing the Right Optimization Algorithm
• In this section, we discussed a series of related algorithms that each seek to address the challenge of optimizing deep
models by adapting the learning rate for each model parameter.
• At this point, a natural question is: which algorithm should one choose? Unfortunately, there is currently no consensus on
this point. Schaul et al. (2014) presented a valuable comparison of a large number of optimization algorithms across a wide
range of learning tasks.
• While the results suggest that the family of algorithms with adaptive learning rates (represented by RMSProp and
AdaDelta) performed fairly robustly, no single best algorithm has emerged.
• Currently, the most popular optimization algorithms actively in use include SGD, SGD with momentum, RMSProp,
RMSProp with momentum, AdaDelta and Adam. The choice of which algorithm to use, at this point, seems to depend
largely on the user’s familiarity with the algorithm (for ease of hyperparameter tuning).

Desert Stalker Walkthrough 10 v0-17-0
100% (3)
Desert Stalker Walkthrough 10 v0-17-0
28 pages
Literacy Essentials
No ratings yet
Literacy Essentials
148 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
DL-12
No ratings yet
DL-12
55 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
Unit 2
No ratings yet
Unit 2
18 pages
week 06 - Deep Feedforward Networks - Optimization
No ratings yet
week 06 - Deep Feedforward Networks - Optimization
83 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
Unit-3
No ratings yet
Unit-3
47 pages
Deep Learning 3rd Module
No ratings yet
Deep Learning 3rd Module
2 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
Module3_notes
No ratings yet
Module3_notes
18 pages
DL 3
No ratings yet
DL 3
72 pages
Unit 2 Introduction to Deep Learning
No ratings yet
Unit 2 Introduction to Deep Learning
79 pages
Chapter
No ratings yet
Chapter
46 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
Optimization for Deep Learning- An Overview
No ratings yet
Optimization for Deep Learning- An Overview
51 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
4 Optimization
No ratings yet
4 Optimization
48 pages
Unit 2
No ratings yet
Unit 2
37 pages
SCSA3015 Deep Learning Unit 4 PDF
No ratings yet
SCSA3015 Deep Learning Unit 4 PDF
30 pages
Optimization For Deep Learning Theory and Algorithms
No ratings yet
Optimization For Deep Learning Theory and Algorithms
60 pages
unit-1.2-Perceptron-2024
No ratings yet
unit-1.2-Perceptron-2024
107 pages
Machine Learning-2
No ratings yet
Machine Learning-2
16 pages
School of Computer Science and Applied Mathematics
No ratings yet
School of Computer Science and Applied Mathematics
12 pages
7 Optimization2 Stochastic Gradient
No ratings yet
7 Optimization2 Stochastic Gradient
114 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Unit 4 Final
No ratings yet
Unit 4 Final
29 pages
04 Optimization
No ratings yet
04 Optimization
62 pages
DL 4
No ratings yet
DL 4
15 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Lecture 04 - Optimization - 4p
No ratings yet
Lecture 04 - Optimization - 4p
11 pages
unit-online-1.3
No ratings yet
unit-online-1.3
21 pages
Lecture-4-1
No ratings yet
Lecture-4-1
60 pages
Learning To Learn by Gradient Descent by Gradient Descent
No ratings yet
Learning To Learn by Gradient Descent by Gradient Descent
10 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
DL_M2_Regularization
No ratings yet
DL_M2_Regularization
12 pages
ML assignment
No ratings yet
ML assignment
7 pages
UNIT2
No ratings yet
UNIT2
25 pages
2a - 3
No ratings yet
2a - 3
8 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
DL Notes
No ratings yet
DL Notes
16 pages
3
No ratings yet
3
11 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
TheoryDL
No ratings yet
TheoryDL
227 pages
dl 3unit last topic meta algoritham
No ratings yet
dl 3unit last topic meta algoritham
32 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
ML 21-22 Sem
No ratings yet
ML 21-22 Sem
10 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
Machine Learning1
No ratings yet
Machine Learning1
8 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Bus PP1 QS
No ratings yet
Bus PP1 QS
9 pages
Particle Swarm Optimization - 1
No ratings yet
Particle Swarm Optimization - 1
21 pages
Injection Molding
No ratings yet
Injection Molding
81 pages
Chapter Two: Contributing To The Service Culture
No ratings yet
Chapter Two: Contributing To The Service Culture
13 pages
15-Questions on Jet Streams
No ratings yet
15-Questions on Jet Streams
3 pages
DATA ANALYSIS Advertising
No ratings yet
DATA ANALYSIS Advertising
11 pages
Ik Dee 391384 R4a
No ratings yet
Ik Dee 391384 R4a
10 pages
1.1 Assignments Plane Mirrors
No ratings yet
1.1 Assignments Plane Mirrors
7 pages
MCA 1 ST SEMESTER - 2022 Linux & Shell Programming (MCA545) Assignment - 3 Name:-Shubham Babu
No ratings yet
MCA 1 ST SEMESTER - 2022 Linux & Shell Programming (MCA545) Assignment - 3 Name:-Shubham Babu
12 pages
Usg Ceilings Systems Catalog en SC2000
No ratings yet
Usg Ceilings Systems Catalog en SC2000
302 pages
3 MS - Bostik Ardacolor Xtrem Easy
No ratings yet
3 MS - Bostik Ardacolor Xtrem Easy
3 pages
Gerald NG Resume & Portfolio (September 2014)
No ratings yet
Gerald NG Resume & Portfolio (September 2014)
42 pages
Bismillahedit 28
No ratings yet
Bismillahedit 28
32 pages
Sample Questions C PM 71 PDF
No ratings yet
Sample Questions C PM 71 PDF
4 pages
Form of Absence in Teaching Laboratory Session (2023-2024 Semester 2)
No ratings yet
Form of Absence in Teaching Laboratory Session (2023-2024 Semester 2)
1 page
Training and Development Chapter 8
No ratings yet
Training and Development Chapter 8
38 pages
LTE RAN Performance and Optimization
No ratings yet
LTE RAN Performance and Optimization
5 pages
Battle Drill
No ratings yet
Battle Drill
7 pages
Transactions Vol IV PDF
No ratings yet
Transactions Vol IV PDF
113 pages
12 Class Assignment The Third Level
No ratings yet
12 Class Assignment The Third Level
2 pages
VLSI Lab 9
No ratings yet
VLSI Lab 9
33 pages
Jidaigeki Movies To Check Out (Japanese)
No ratings yet
Jidaigeki Movies To Check Out (Japanese)
6 pages
Miquel Bota - The Contestation of Patriarchy in Luis Martín-Santos' Work-Springer International Publishing - Palgrave Macmillan (2020)
No ratings yet
Miquel Bota - The Contestation of Patriarchy in Luis Martín-Santos' Work-Springer International Publishing - Palgrave Macmillan (2020)
172 pages
Uropoetica Organ: Shafira Zahra Ovaditya ANATOMI 2014
No ratings yet
Uropoetica Organ: Shafira Zahra Ovaditya ANATOMI 2014
46 pages
1.3 Average Rates of Change
No ratings yet
1.3 Average Rates of Change
3 pages
Cases
No ratings yet
Cases
5 pages
12w Solar Street Light System
No ratings yet
12w Solar Street Light System
8 pages
Strength of Concrete 1
No ratings yet
Strength of Concrete 1
12 pages

Deep Learning Module 3-1

Uploaded by

Deep Learning Module 3-1

Uploaded by

DEEP LEARNING

8.1.1 Empirical Risk Minimization

8.2.7 Poor Correspondence between Local and Global Structure

8.3.1 Stochastic Gradient Descent

You might also like