0% found this document useful (0 votes)

27 views

CS601 Machine Learning Unit 2 Notes 1672759753

The document discusses key concepts in machine learning including linearity vs non-linearity, activation functions like sigmoid and ReLU, weights and bias, loss functions, gradient descent, multilayer networks, backpropagation, and regularization techniques. It provides details on these topics with examples and figures to illustrate concepts like different activation functions and types of gradient descent.

Uploaded by

Xmax

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

CS601 Machine Learning Unit 2 Notes 1672759753

Uploaded by

Xmax

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Chameli Devi Group of Institutions, Indore

Department of Computer Science and Engineering

Subject Notes
CS 601- Machine Learning
UNIT-II

Syllabus: Linearity vs non linearity, activation functions like sigmoid, ReLU, etc., weights and
bias, loss function, gradient descent, multilayer network, back propagation, weight
initialization, training, testing, unstable gradient problem, auto encoders, batch normalization,
dropout, L1 and L2 regularization, momentum, tuning hyper parameters

Linearity vs non linearity

A linear model uses a linear function for its prediction function or as a crucial part of its
prediction function.
A linear function takes a fixed number of numerical inputs x 1, x2,…, xn and weights w0,…,wn as
the parameters of the model.
n
w 0 +∑ w i x i
i=1
If the prediction function is a linear function, we can perform regression, i.e. predicting a
numerical label. We can also take a linear function and return the sign of the result (whether
the result is positive or not) and perform binary classification that way: all examples with a
positive output receive label A, all others receive label B. There are various other (more
complex) options for a response function on top of the linear function, the logistic function is
very commonly used (which leads to logistic regression, predicting a number between 0 and 1,
typically used to learn the probability of a binary outcome in a noisy setting).
A non-linear model is a model which is not a linear model. Typically these are more powerful
(they can represent a larger class of functions) but much harder to train.
Nonlinear regression is a statistical technique that helps describe nonlinear relationships in
experimental data. Nonlinear regression models are generally assumed to be parametric,
where the model is described as a nonlinear equation. Typically machine learning methods are
used for non-parametric nonlinear regression.
Parametric nonlinear regression models the dependent variable (also called the response) as a
function of a combination of nonlinear parameters and one or more independent variables
(called predictors). The model can be univariate (single response variable) or multivariate
(multiple response variables).
The parameters can take the form of an exponential, trigonometric, power, or any other
nonlinear function. To determine the nonlinear parameter estimates, an iterative algorithm is
typically used.
y=f(X,β)+ϵ
where, β represents nonlinear parameter estimates to be computed and ϵ represents the error
terms.

Activation functions like Sigmoid, ReLU

A neural network is comprised of layers of nodes and learns to map examples of inputs to
outputs.
For a given node, the inputs are multiplied by the weights in a node and summed together. This
value is referred to as the summed activation of the node. The summed activation is then
transformed via an activation function and defines the specific output or “activation” of the
node. It is also known as Transfer Function.
Activation function decides, whether a neuron should be activated or not by calculating
weighted sum and further adding bias with it with the intention to introduce non-linearity into
the output of a neuron.

The Activation Functions can be basically divided into 2 types-

 Linear Activation Function
 Non-linear Activation Functions

Sigmoid
Sigmoid takes a real value as input and outputs another value between 0 and 1. It’s easy to
work with and has all the nice properties of activation functions: it’s non-linear, continuously
differentiable, monotonic, and has a fixed output range.

Figure 2.1: Sigmoid Activation Function

ReLU
It is a recent invention which stands for Rectified Linear Units. The formula is deceptively
simple: max(0,z). Despite its name and appearance, it’s not linear and provides the same
benefits as Sigmoid but with better performance.

Figure 2.2: ReLu Activation Function

Weights
A weight represents the strength of the connection between units. If the weight from node 1 to
node 2 has greater magnitude, it means that neuron 1 has greater influence over neuron 2. A
weight brings down the importance of the input value. Weights near zero means changing this
input will not change the output. Negative weights mean increasing this input will decrease the
output. A weight decides how much influence the input will have on the output.
A neuron’s input equals the sum of weighted outputs from all neurons in the previous layer.
Each input is multiplied by the weight associated with the synapse connecting the input to the
current neuron. If there are 3 inputs or neurons in the previous layer, each neuron in the
current layer will have 3 distinct weights — one for each synapse.

Bias
It is an extra input to neurons and it is always 1, and has its own connection weight. This makes
sure that even when all the inputs are none (all 0’s) there’s going to be activation in the neuron.
Bias terms are additional constants attached to neurons and added to the weighted input
before the activation function is applied. Bias terms help models represent patterns that do not
necessarily pass through the origin. For example, if all your features were 0, would your output
also be zero? Is it possible there is some base value upon which your features have an effect?
Bias terms typically accompany weights and must also be learned by your model.

Loss Functions
A loss function, or cost function, is a wrapper, around our model predict function that tells us
“how good” the model is at making predictions for a given set of parameters. The loss function
has its own curve and its own derivatives. The slope of this curve tells us how to change our
parameters to make the model more accurate! We use the model to make predictions. We use
the cost function to update our parameters. Our cost function can take a variety of forms as
there are many different cost functions available. Popular loss functions include: MSE (L2) and
Cross-entropy Loss.
The loss function computes the error for a single training example. The cost function is the
average of the loss functions of the entire training set.
 ‘mse’: for mean squared error.
 ‘binary_crossentropy’: for binary logarithmic loss (logloss).
 ‘categorical_crossentropy’: for multi-class logarithmic loss (logloss).

Gradient Descent
Optimization is a big part of machine learning. Almost every machine learning algorithm has an
optimization algorithm at its core.
Gradient descent is an optimization algorithm used to find the values of parameters
(coefficients) of a function (f) that minimizes a cost function (cost).
Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using
linear algebra) and must be searched for by an optimization algorithm.
Gradient Descent Procedure
The procedure starts off with initial values for the coefficient or coefficients for the function.
These could be 0.0 or a small random value.
coefficient = 0.0
The cost of the coefficients is evaluated by plugging them into the function and calculating the
cost.
cost = f(coefficient)
or
cost = evaluate(f(coefficient))
The derivative of the cost is calculated. The derivative is a concept from calculus and refers to
the slope of the function at a given point. We need to know the slope so that we know the
direction (sign) to move the coefficient values in order to get a lower cost on the next iteration.
delta = derivative(cost)
Now that we know from the derivative which direction is downhill, we can now update the
coefficient values. A learning rate parameter (alpha) must be specified that controls how much
the coefficients can change on each update.
coefficient = coefficient – (alpha * delta)
This process is repeated until the cost of the coefficients (cost) is 0.0 or close enough to zero to
be good enough.

Types of gradient Descent:

1. Batch Gradient Descent: This is a type of gradient descent which processes all the
training examples for each iteration of gradient descent. But if the number of training
examples is large, then batch gradient descent is computationally very expensive. Hence
if the number of training examples is large, then batch gradient descent is not preferred.
Instead, we prefer to use stochastic gradient descent or mini-batch gradient descent.
2. Stochastic Gradient Descent: This is a type of gradient descent which processes 1
training example per iteration. Hence, the parameters are being updated even after one
iteration in which only a single example has been processed. Hence this is quite faster
than batch gradient descent. But again, when the number of training examples is large,
even then it processes only one example which can be additional overhead for the
system as the number of iterations will be quite large.
3. Mini Batch gradient descent: This is a type of gradient descent which works faster than
both batch gradient descent and stochastic gradient descent. Here b examples
where b<m are processed per iteration. So even if the number of training examples is
large, it is processed in batches of b training examples in one go. Thus, it works for
larger training examples and that too with lesser number of iterations.

Multilayer network
A fully connected multi-layer neural network is called a Multilayer Perceptron (MLP).

Figure 2.3: Multilayer Perceptron Network

It has 3 layers including one hidden layer. If it has more than 1 hidden layer, it is called a deep
ANN.
An MLP is a typical example of a feed-forward artificial neural network.
In this figure, the ithactivation unit in the lth layer is denoted as ai(l).
The number of layers and the number of neurons are referred to as hyper parameters of a
neural network, and these need tuning. Cross-validation techniques must be used to find ideal
values for these.
The weight adjustment training is done via back propagation.

Back propagation
Back propagation is a supervised learning technique for neural networks that calculates the
gradient of descent for weighting different variables. It’s short for the backward propagation of
errors, since the error is computed at the output and distributed backwards throughout the
network’s layers.
When an artificial neural network discovers an error, the algorithm calculates the gradient of
the error function, adjusted by the network’s various weights. The gradient for the final layer of
weights is calculated first, with the first layer’s gradient of weights calculated last. Partial
calculations of the gradient from one layer are reused to determine the gradient for the
previous layer. This point of this backwards method of error checking is to more efficiently
calculate the gradient at each layer than the traditional approach of calculating each layer’s
gradient separately.

Weight Initialization
 The weights of a network to be trained by backprop must be initialized to some non-
zero values.
 The usual thing to do is to initialize the weights to small random values.
 The reason for this is that sometimes backprop training runs become "lost" on a plateau
in weight-space, or for some other reason backprop cannot find a good minimum error
value.
 Using small random values means different starting points for each training run, so that
subsequent training runs have a good chance of finding a suitable minimum.

Training and Testing

 The basic algorithm can be summed up in the following equation (the delta rule) for the
change to the weight wji from node i to node j:

weight learning local input signal

change rate gradient to node j
Δwji = Η × δj × yi

 where the local gradient δj is defined as follows:

1. If node j is an output node, then δj is the product of φ'(vj) and the error signal ej, where
φ(_) is the logistic function and vj is the total input to node j (i.e. Σi wjiyi), and ej is the
error signal for node j (i.e. the difference between the desired output and the actual
output);
2. If node j is a hidden node, then δj is the product of φ'(vj) and the weighted sum of the
δ's computed for the nodes in the next hidden or output layer that are connected to
node j.
3. The formula is δj = φ'(vj) &Sigmak δkwkj where k ranges over those nodes for which wkj is
non-zero (i.e. nodes k that actually have connections from node j. The δk values have
already been computed as they are in the output layer (or a layer closer to the output
layer than node j).

Figure 2.4: Error Back Propagation Network

Unstable gradient problem

The unstable gradient problem is a fundamental problem that occurs in a neural network that
entails that a gradient in a deep neural network tends to either explode or vanish in early layers.
The unstable gradient problem is not necessarily the vanishing gradient problem or the
exploding gradient problem, but is rather due to the fact that gradient in early layers is the
product of terms from all preceding layers. More layers make the network an intrinsically
unstable solution. Balancing all products of terms is the only way each layer in a neural network
can close at the same speed and avoid vanishing or exploding gradients. Balanced product of
terms occurring by chance becomes more and more unlikely with more layers. Neural networks
therefore have layers that learn at different speeds, without being given any mechanisms or
underlying reason for balancing learning speeds.
When magnitudes of gradients accumulate, unstable networks are more likely to occur, which is
a cause of poor prediction results.

Auto encoders
An auto encoder neural network is an Unsupervised Machine learning algorithm that applies
back propagation, setting the target values to be equal to the inputs. Auto encoders are used to
reduce the size of our inputs into a smaller representation. If anyone needs the original data,
they can reconstruct it from the compressed data.
Auto encoders are preferred over PCA because:
 An auto encoder can learn non-linear transformations with a non-linear activation
function and multiple layers.
 It doesn’t have to learn dense layers. It can use convolutional layers to learn which is
better for video, image and series data.
 It is more efficient to learn several layers with an auto encoder rather than learn one
huge transformation with PCA.
 An auto encoder provides a representation of each layer as the output.
 It can make use of pre-trained layers from another model to apply transfer learning to
enhance the encoder/decoder.
Applications of Auto encoders
1. Image Coloring
Auto encoders are used for converting any black and white picture into a colored image.
Depending on what is in the picture, it is possible to tell what the color should be.
2. Feature variation
It extracts only the required features of an image and generates the output by removing any
noise or unnecessary interruption.
3. Dimensionality Reduction
The reconstructed image is the same as our input but with reduced dimensions. It helps in
providing the similar image with a reduced pixel value.
4. Denoising Image
The input seen by the auto encoder is not the raw input but a stochastically corrupted version.
A denoising auto encoder is thus trained to reconstruct the original input from the noisy
version.
5. Watermark Removal
It is also used for removing watermarks from images or to remove any object while filming a
video or a movie.
Architecture of Auto encoders: An Auto encoder consists of three layers:
1. Encoder
2. Code
3. Decoder

Figure 2.5: Architecture of Auto encoder

 Encoder: This part of the network compresses the input into a latent space
representation. The encoder layer encodes the input image as a compressed
representation in a reduced dimension. The compressed image is the distorted version
of the original image.
 Code: This part of the network represents the compressed input which is fed to the
decoder.
 Decoder: This layer decodes the encoded image back to the original dimension. The
decoded image is a lossy reconstruction of the original image and it is reconstructed
from the latent space representation.
Types of Auto encoders
1. Convolution Auto encoders
Auto encoders in their traditional formulation do not take into account the fact that a signal can
be seen as a sum of other signals. Convolutional Auto encoders use the convolution operator to
exploit this observation. They learn to encode the input in a set of simple signals and then try to
reconstruct the input from them, modify the geometry or the reflectance of the image.
2. Sparse Auto encoders
Sparse auto encoders offer us an alternative method for introducing an information
bottleneck without requiring a reduction in the number of nodes at our hidden layers.
3. Deep Auto encoders
The extension of the simple Auto encoder is the Deep Auto encoder. The first layer of the Deep
Auto encoder is used for first-order features in the raw input. The second layer is used for
second-order features corresponding to patterns in the appearance of first-order features.
Deeper layers of the Deep Auto encoder tend to learn even higher-order features.
A deep auto encoder is composed of two, symmetrical deep-belief networks-
1. First four or five shallow layers representing the encoding half of the net.
2. The second set of four or five layers that make up the decoding half.

Batch normalization
Batch normalization is one of the important features we add to our model helps as a
Regularizer, normalizing the inputs, in the back propagation process, and can be adapted to
most of the models to converge better.
How Does Batch Normalization work?
Batch normalization is a feature that we add between the layers of the neural network and it
continuously takes the output from the previous layer and normalizes it before sending it to the
next layer. This has the effect of stabilizing the neural network. Batch normalization is also used
to maintain the distribution of the data.
Figure 2.6: Working of Back Normalization

The problem we have in neural networks is the internal covariate shift. When we are training
our neural network, the distribution of data changes and the model trains slower. This problem
is framed as an internal covariate shift. To maintain the similar distribution of data we use batch
normalization by normalizing the outputs using mean=0, standard dev=1 (μ=0,σ=1). By using
this technique, the model is trained faster and it also increases the accuracy of the model
compared to a model that does not use the batch normalization.

Dropout
Dropout is implemented per-layer in a neural network.
It can be used with most types of layers, such as dense fully connected layers, convolutional
layers, and recurrent layers such as the long short-term memory network layer.
Dropout may be implemented on any or all hidden layers in the network as well as the visible or
input layer. It is not used on the output layer.
The term “dropout” refers to dropping out units (hidden and visible) in a neural network.
Simply, dropout refers to ignoring units (i.e. neurons) during the training phase of certain set of
neurons which is chosen at random. By “ignoring”, mean these units are not considered during a
particular forward or backward pass.
More technically, at each training stage, individual nodes are either dropped out of the net with
probability 1-p or kept with probability p, so that a reduced network is left; incoming and
outgoing edges to a dropped-out node are also removed.
Neural networks are the building blocks of any machine-learning architecture. They consist of
one input layer, one or more hidden layers, and an output layer.
When we training our neural network (or model) by updating each of its weights, it might
become too dependent on the dataset we are using. Therefore, when this model has to make a
prediction or classification, it will not give satisfactory results. This is known as over-fitting. We
might understand this problem through a real-world example: If a student of mathematics learns
only one chapter of a book and then takes a test on the whole syllabus, he will probably fail.
To overcome this problem, we use a technique that was introduced by Geoffrey Hinton in 2012.
This technique is known as dropout.
L1 regularization (Lasso Regression)
This regularization technique performs L1 regularization. Unlike Ridge Regression, it modifies the
RSS by adding the penalty (shrinkage quantity) equivalent to the sum of the absolute value of
coefficients.
Looking at the equation below, we can observe that similar to Ridge Regression, Lasso (Least
Absolute Shrinkage and Selection Operator) also penalizes the absolute size of the regression
coefficients. In addition to this, it is quite capable of reducing the variability and improving the
accuracy of linear regression models.
=argmin || y – Xβ ||22 + λ ||β1
Limitation of Lasso Regression:
 If the number of predictors (p) is greater than the number of observations (n), Lasso will
pick at most n predictors as non-zero, even if all predictors are relevant (or may be used
in the test set). In such cases, Lasso sometimes really has to struggle with such types of
data.
 If there are two or more highly collinear variables, then LASSO regression select one of
them randomly this is not good for the interpretation of data.
Lasso regression differs from ridge regression in a way that it uses absolute values within the
penalty function, rather than that of squares. This leads to penalizing (or equivalently
constraining the sum of the absolute values of the estimates) values which causes some of the
parameter estimates to turn out exactly zero. The more penalties are applied, the more the
estimates get shrunk towards absolute zero. This helps to variable selection out of given range of
n variables.

L2 regularization (Ridge Regression)

This technique performs L2 regularization. The main algorithm behind this is to modify the RSS
by adding the penalty which is equivalent to the square of the magnitude of coefficients.
However, it is considered to be a technique used when the info suffers from multi collinearity
(independent variables are highly correlated). In multi collinearity, albeit the smallest amount
squares estimates (OLS) are unbiased; their variances are large which deviate the observed value
faraway from truth value. By adding a degree of bias to the regression estimates, ridge
regression reduces the quality errors. It tends to solve the multi collinearity problem through
shrinkage parameter λ.
=argmin || y – Xβ ||22 + λ ||β22
In this equation, we have two components. The foremost one denotes the least square term and
later one is lambda of the summation of β2 (beta- square) where β is the coefficient. This is
added to least square term so as to shrink the parameter to possess a really low variance.
Every technique has some pros and cons, so as Ridge regression. It decreases the complexity of a
model but does not reduce the number of variables since it never leads to a coefficient tending
to zero rather only minimizes it. Hence, this model is not a good fit for feature reduction.
Momentum

Figure 2.7: Momentum

Momentum methods in the context of machine learning refer to a group of tricks and
techniques designed to speed up convergence of first order optimization methods like gradient
descent (and its many variants).
They essentially work by adding what’s called the momentum term to the update formula for
gradient descent, thereby ameliorating its natural “zigzagging behavior,” especially in long
narrow valleys of the cost function.
The figure shows the progress of gradient descent - with and without momentum - towards
reaching the minimum of a quadratic cost function, located at the center of the concentric
elliptical contours.
Let’s say your first update to the weights is a vector θ1θ1. For the second update (which would
be θ2θ2 without momentum) you update by θ2+αθ1θ2+αθ1. For the next one, you update
by θ3+αθ2+α2θ1θ3+αθ2+α2θ1, and so on. Here the parameter 0≤α<10≤α<1 indicates the
amount of momentum we want.
The practical way of doing that is keeping an update vector vi and updating it
as vi+1=αvi+θi+1vi+1=αvi+θi+1.
The reason we do this is to avoid the algorithm getting stuck in a local minimum. Think of it as a
marble rolling around on a curved surface. We want to get to the lowest point. The marble
having momentum will allow it to avoid a lot of small dips and make it more likely to find a
better local solution.
Having momentum too high means you will be more likely to overshoot (the marble goes
through the local minimum but the momentum carries it back upwards for a bit). This will lead
to longer learning times. Finding the correct value of the momentum will depend on the
particular problem: the smoothness of the function, how many local minima you expect, how
“deep” the sub-optimal local minima are expected to be, etc.
Tuning hyper parameters
Hyper parameters that cannot be directly learned from the regular training process are usually
fixed before the actual training process begins. These parameters express important properties
of the model such as its complexity or how fast it should learn.
Some examples of model hyper parameters include:
1. The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization
2. The learning rate for training a neural network.
3. The C and sigma hyper parameters for support vector machines.
4. The k in k-nearest neighbours.
Models can have many hyper parameters and finding the best combination of parameters can
be treated as a search problem. Two best strategies for Hyper parameter tuning are:
1. GridSearchCV
2. RandomizedSearchCV
1. GridSearchCV
In GridSearchCV approach, machine learning model is evaluated for a range of hyper parameter
values. This approach is called GridSearchCV, because it searches for best set of hyper
parameters from a grid of hyper parameters values.
For example, if we want to set two hyper parameters C and Alpha of Logistic Regression
Classifier model, with different set of values. The gridsearch technique will construct many
versions of the model with all possible combinations of hyper parameters, and will return the
best one.

Figure 2.8: Example of Hyper parameter

As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4].
For a combination C=0.3 and Alpha=0.2, performance score comes out to be 0.726(Highest),
therefore it is selected.
Drawback: GridSearchCV will go through all the intermediate combinations of hyper
parameters which makes grid search computationally very expensive.
2. Randomized Search CV
RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes through only a fixed
number of hyper parameter settings. It moves within the grid in random fashion to find the best
set hyper parameters. This approach reduces unnecessary computation.
RandomizedSearchCV implements a “fit” and a “score” method. It also implements
“score_samples”, “predict”, “predict_proba”, “decision_function”, “transform” and
“inverse_transform” if they are implemented in the estimator used.
The parameters of the estimator used to apply these methods are optimized by cross-validated
search over parameter settings.
In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of
parameter settings is sampled from the specified distributions. The number of parameter
settings that are tried is given by n_iter.
If all parameters are presented as a list, sampling without replacement is performed. If at least
one parameter is given as a distribution, sampling with replacement is used. It is highly
recommended to use continuous distributions for continuous parameters.

Inf 110
No ratings yet
Inf 110
3 pages
Topic 4 - Inverter
No ratings yet
Topic 4 - Inverter
9 pages
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
No ratings yet
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
14 pages
CS601_Machine Learning_Unit 2 New
No ratings yet
CS601_Machine Learning_Unit 2 New
56 pages
Deep Learning Tutorial 9
No ratings yet
Deep Learning Tutorial 9
70 pages
Neural Network - Optimization DRAFT 3.11
No ratings yet
Neural Network - Optimization DRAFT 3.11
66 pages
ml
No ratings yet
ml
10 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
Unit 2 - Machine Learning
No ratings yet
Unit 2 - Machine Learning
19 pages
Unit 2
No ratings yet
Unit 2
18 pages
Multi Percept Ron
No ratings yet
Multi Percept Ron
14 pages
DL_Unit2
No ratings yet
DL_Unit2
113 pages
Activation Function in NN
No ratings yet
Activation Function in NN
29 pages
AD601 Deep Learning Unit-2 Notes
No ratings yet
AD601 Deep Learning Unit-2 Notes
14 pages
2.3 Feed Forward Netwoks
No ratings yet
2.3 Feed Forward Netwoks
25 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
DL UNIT-I
No ratings yet
DL UNIT-I
30 pages
Unit-1 and 2 and 3 (1)
No ratings yet
Unit-1 and 2 and 3 (1)
212 pages
NN Concepts
No ratings yet
NN Concepts
4 pages
HODL Lec 2 Training NNs Intro TF
No ratings yet
HODL Lec 2 Training NNs Intro TF
83 pages
Mod 2.3 - Activation Function, Loss Functions
No ratings yet
Mod 2.3 - Activation Function, Loss Functions
12 pages
Unit 2 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Machine Learning - WWW - Rgpvnotes.in
18 pages
Types of Machine Learning: Supervised Learning: The Computer Is Presented With Example Inputs and Their
No ratings yet
Types of Machine Learning: Supervised Learning: The Computer Is Presented With Example Inputs and Their
50 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
Deep Learning (1)
No ratings yet
Deep Learning (1)
19 pages
Lecture Slides 2 - Neural Networks - 2021
No ratings yet
Lecture Slides 2 - Neural Networks - 2021
42 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
NN unit_1
No ratings yet
NN unit_1
27 pages
Unit 2 - Machine Learning - www.rgpvnotes.in
No ratings yet
Unit 2 - Machine Learning - www.rgpvnotes.in
18 pages
AD3451 ML UNIT 4 NOTES
No ratings yet
AD3451 ML UNIT 4 NOTES
36 pages
Unit 2 Deep Learning and Neural Networks
No ratings yet
Unit 2 Deep Learning and Neural Networks
38 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
Machine Learning NN
100% (2)
Machine Learning NN
16 pages
DL M2 Tech
No ratings yet
DL M2 Tech
32 pages
Upload_Unit_2
No ratings yet
Upload_Unit_2
19 pages
UNIT V (1)
No ratings yet
UNIT V (1)
25 pages
DeepLearing Theory
No ratings yet
DeepLearing Theory
51 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Deep Learning Unit 1
No ratings yet
Deep Learning Unit 1
32 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Perceptron in Machine Learning
No ratings yet
Perceptron in Machine Learning
11 pages
DL UNIT2
No ratings yet
DL UNIT2
22 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
4-Neural Networks and Activation Function
No ratings yet
4-Neural Networks and Activation Function
28 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Ann
No ratings yet
Ann
40 pages
DEEP LEARNING Paper
No ratings yet
DEEP LEARNING Paper
12 pages
Chapter 5 Final
No ratings yet
Chapter 5 Final
80 pages
PDF_1678529419
No ratings yet
PDF_1678529419
100 pages
Activation Functions in Neural Networks - 241102 - 224129
No ratings yet
Activation Functions in Neural Networks - 241102 - 224129
7 pages
tutorial 1,2
No ratings yet
tutorial 1,2
12 pages
Activation Function
No ratings yet
Activation Function
43 pages
Module1 - Upto Loss Function
No ratings yet
Module1 - Upto Loss Function
137 pages
UNIT2
No ratings yet
UNIT2
25 pages
Ad3451 ML Unit 4 Notes Eduengg
No ratings yet
Ad3451 ML Unit 4 Notes Eduengg
36 pages
Neural Networks
No ratings yet
Neural Networks
14 pages
Ad3451 Ml Unit 4 Notes
No ratings yet
Ad3451 Ml Unit 4 Notes
34 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
neural-networks-essay-feranmi-dere
No ratings yet
neural-networks-essay-feranmi-dere
7 pages
Notes On Introduction To Deep Learning
No ratings yet
Notes On Introduction To Deep Learning
19 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Chemistry 9th Round 1,2,3,4 (Complete)
No ratings yet
Chemistry 9th Round 1,2,3,4 (Complete)
47 pages
Instant download Encyclopedia of Inorganic Chemistry in 10 Vol 2 ed 2nd Edition King R. Bruce pdf all chapter
100% (3)
Instant download Encyclopedia of Inorganic Chemistry in 10 Vol 2 ed 2nd Edition King R. Bruce pdf all chapter
51 pages
绿色消费对我国能源效率的影响研究郭文琪
No ratings yet
绿色消费对我国能源效率的影响研究郭文琪
164 pages
Notes_ML_02_Slides_RNN_ANN
No ratings yet
Notes_ML_02_Slides_RNN_ANN
105 pages
Piping System Design Part - 1
No ratings yet
Piping System Design Part - 1
28 pages
Chemistry Worksheet Acid and Bases - 240217 - 144437
No ratings yet
Chemistry Worksheet Acid and Bases - 240217 - 144437
18 pages
Rubrics-Taking Reservation Through Phone Call
No ratings yet
Rubrics-Taking Reservation Through Phone Call
1 page
Guide Techview User S Guide en 132476
No ratings yet
Guide Techview User S Guide en 132476
130 pages
Python3 Quiz - 2.py 0 4
No ratings yet
Python3 Quiz - 2.py 0 4
2 pages
DB Link On Different Instances and PG - AUDIT
No ratings yet
DB Link On Different Instances and PG - AUDIT
10 pages
Kendall's Coefficient of Concordance W
100% (1)
Kendall's Coefficient of Concordance W
8 pages
JJ512 Pneumatic PH 4 Lab Sheet
No ratings yet
JJ512 Pneumatic PH 4 Lab Sheet
4 pages
590SL Brakes 5
No ratings yet
590SL Brakes 5
3 pages
NSCP 2015 - Wind Load Design PDF
No ratings yet
NSCP 2015 - Wind Load Design PDF
6 pages
Vibration Specifications Standards Electrical Motors With Alarm Limits
No ratings yet
Vibration Specifications Standards Electrical Motors With Alarm Limits
3 pages
1 s2.0 S0208521617301973 Main
No ratings yet
1 s2.0 S0208521617301973 Main
11 pages
A Practical Handbook of Speech Coders
No ratings yet
A Practical Handbook of Speech Coders
14 pages
SPiiPlus Utilities Users Guide (V4-20)
No ratings yet
SPiiPlus Utilities Users Guide (V4-20)
19 pages
Momentum Theory - A New Calculation of Blast Design and Assessment of Blast Vibrations
No ratings yet
Momentum Theory - A New Calculation of Blast Design and Assessment of Blast Vibrations
11 pages
Advance Python Assignment 2024 IT
No ratings yet
Advance Python Assignment 2024 IT
1 page
Transport Layer and Security Protocols For Ad Hoc Wireless Networks
No ratings yet
Transport Layer and Security Protocols For Ad Hoc Wireless Networks
54 pages
Good Practice Guide For Strategic Noise Mapping and The Production of Associated Data On Noise Exposure
100% (2)
Good Practice Guide For Strategic Noise Mapping and The Production of Associated Data On Noise Exposure
129 pages
Anatomy of Thoracic Wall & Pleura-Dikonversi
No ratings yet
Anatomy of Thoracic Wall & Pleura-Dikonversi
69 pages
sap cpi main interview qtns & answers
No ratings yet
sap cpi main interview qtns & answers
12 pages
Manuscript Template ENG
No ratings yet
Manuscript Template ENG
3 pages
Mock Exam G6 Model
No ratings yet
Mock Exam G6 Model
2 pages
EUPAVE Guide For The Design of Jointed Plain Concrete Pavements April 2020
No ratings yet
EUPAVE Guide For The Design of Jointed Plain Concrete Pavements April 2020
40 pages
10.attribution Accuracy When Using Anonymity in Group Support Systems
No ratings yet
10.attribution Accuracy When Using Anonymity in Group Support Systems
24 pages

CS601 Machine Learning Unit 2 Notes 1672759753

Uploaded by

CS601 Machine Learning Unit 2 Notes 1672759753

Uploaded by

Chameli Devi Group of Institutions, Indore

Department of Computer Science and Engineering

Linearity vs non linearity

Activation functions like Sigmoid, ReLU

The Activation Functions can be basically divided into 2 types-

Figure 2.1: Sigmoid Activation Function

Figure 2.2: ReLu Activation Function

Types of gradient Descent:

Figure 2.3: Multilayer Perceptron Network

Training and Testing

weight learning local input signal

 where the local gradient δj is defined as follows:

Figure 2.4: Error Back Propagation Network

Unstable gradient problem

Figure 2.5: Architecture of Auto encoder

L2 regularization (Ridge Regression)

Figure 2.7: Momentum

Figure 2.8: Example of Hyper parameter

You might also like