0% found this document useful (0 votes)

62 views

UNIT 4 - Perceptron and DL

The document discusses the fundamentals of perceptrons and deep learning, covering topics such as multilayer perceptrons, activation functions, and training algorithms like gradient descent and stochastic gradient descent. It explains the structure of perceptrons, including input layers, weights, biases, and activation functions, as well as the training process involving forward propagation, loss computation, and backpropagation. Additionally, it addresses challenges like the vanishing gradient problem and compares batch and stochastic gradient descent methods.

Uploaded by

venkatasubramanian Srinivasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views

UNIT 4 - Perceptron and DL

Uploaded by

venkatasubramanian Srinivasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 39

UNIT 4 - Perceptron and DL

Multilayer perceptron, activation functions, network training –

gradient descent optimization – stochastic gradient descent,
error backpropagation, from shallow networks to deep
networks –Unit saturation (aka the vanishing gradient
problem) – (rectified linear )ReLU, hyperparameter tuning,
batch normalization, regularization, dropout.

ML ( V.S) 1
Perceptron in Machine Learning
• ( Mr. Frank Rosenblatt )

 A Perceptron is an Artificial Neuron

 It is the simplest possible Neural Network
 Neural Networks are the building blocks of Machine Learning.

ML ( V.S) 2
ANN ( Artificial Neural Network)
• An artificial neuron is a mathematical function based on a model of
biological neurons, where each neuron takes inputs, weighs them
separately, sums them up and passes this sum through a nonlinear
function to produce output.

ML ( V.S) 3
Components of Perceptron

Input Layer: The input layer consists of one or more input neurons, which receive input
signals from the external world or from other layers of the neural network.
Weights: Each input neuron is associated with a weight, which represents the strength
of the connection between the input neuron and the output neuron.
Bias: A bias term is added to the input layer to provide the perceptron with additional
flexibility in modeling complex patterns in the input data.
Activation Function: The activation function determines the output of the perceptron
based on the weighted sum of the inputs and the bias term. Common activation
functions used in perceptrons include the step function, sigmoid function, and ReLU
function.
Output: The output of the perceptron is a single binary value, either 0 or 1, which
indicates the class or category to which the input data belongs.
Training Algorithm: The perceptron is typically trained using a supervised learning
algorithm such as the perceptron learning algorithm or backpropagation. During
training, the weights and biases of the perceptron are adjusted to minimize the error
between the predicted output and the true output for a given set of training examples.
ML ( V.S) 4
Types of Perceptron models
• Single Layer Perceptron model: One of the easiest ANN(Artificial Neural Networks) type consists of a
feed-forward network. A Single-layer perceptron can learn only linearly separable patterns.

• Multi-Layered Perceptron model: It is mainly similar to a single-layer perceptron model but has more
hidden layers.

Stages of Multi-Layered Perceptron model

o Forward Stage: Activation functions start from the input layer in the forward stage and terminate on the
output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per the model's
requirement. In this stage, the error between actual output and demanded originated backward on the output
layer and ended on the input layer.

ML ( V.S) 5
example Criteria input Weight
Artists is Good x1 = 0 or 1 w1 = 0.7
• Imagine a perceptron (in your brain).
Weather is Good x2 = 0 or 1 w2 = 0.6
Friend will Come x3 = 0 or 1 w3 = 0.5
• The perceptron tries to decide if you should go to a movie. Food is Served x4 = 0 or 1 w4 = 0.3

• Is the hero good? Is the weather good? Coffee is Served x5 = 0 or 1 w5 = 0.4

The Perceptron Algorithm

1. Set a threshold value
2. Multiply all inputs with its weights
3. Sum all the results
4. Activate the output

ML ( V.S) 6
Example…
• 1. Set a threshold value:

 Threshold = 1.5

• 2. Multiply all inputs with its weights:

 x1 * w1 = 1 * 0.7 = 0.7

 x2 * w2 = 0 * 0.6 = 0

 x3 * w3 = 1 * 0.5 = 0.5

 x4 * w4 = 0 * 0.3 = 0

 x5 * w5 = 1 * 0.4 = 0.4

• 3. Sum all the results:

 0.7 + 0 + 0.5 + 0 + 0.4 = 1.6 (The Weighted Sum)

• 4. Activate the Output:

 Return true if the sum > 1.5 ("Yes I will go to the movie") ML ( V.S) 7
Perceptron vs DNN

ML ( V.S) 8
ML ( V.S) 9
Activation function

• An activation function ( Transfer Function ) in a neural network defines how the

weighted sum of the input is transformed into an output from a node or nodes in a
layer of the network.
• It is used to determine the output of neural network like yes or no. It maps the resulting
values in between 0 to 1 or -1 to 1 etc. (depending upon the function).

The Activation Functions can be basically divided into 2 types-

 Linear Activation Function

 Non-linear Activation Functions

ML ( V.S) 10
Activation functions ( Transfer Functions )

ML ( V.S) 11
ML ( V.S) 12
Neural network training
• Neural network training is the process of training a neural network model to
learn patterns and make predictions from input data. The goal of training a
neural network is to minimize its prediction error, which is achieved by
adjusting the model's parameters (weights and biases) through an iterative
optimization process.
• The training process involves the following steps:
1.Data preparation: The input data is preprocessed and split into training,
validation, and testing sets.
2.Model architecture: The neural network model is designed with appropriate
layers, activation functions, and loss functions.
3.Initialization: The initial values of the model parameters (weights and biases)
are randomly initialized.
4.Forward propagation: The input data is fed into the model, and the output is
computed using the current parameter values.
5.Loss computation: The difference between the predicted output and the actual
output is calculated using the chosen loss function.
6.Backpropagation: The error is propagated backwards through the network to
ML ( V.S) 13
compute the gradient of the loss function with respect to each parameter.
Neural network training ……
7. Parameter update: The parameters are updated using an
optimization algorithm such as stochastic gradient descent
(SGD) or Adam.
8. Repeat: Steps 4-7 are repeated until the model converges
to a satisfactory level of performance.
9. Evaluation: The performance of the trained model is
evaluated on a held-out testing set.
• Neural network training can be a computationally intensive
process, especially for large datasets and complex models.
• GPUs are often used to accelerate the training process..

ML ( V.S) 14
Epochs

• One Epoch is when an ENTIRE dataset is passed forward and backward through the
neural network only once.
• updating the weights with single pass or one epoch is not enough.
• Batch Size - Total number of training examples present in a single batch.
• Iteration is the number of batches needed to complete one epoch.

Let’s say we have 2000 training examples that we are going to use .

We can divide the dataset of 2000 examples into batches of 500 then it will take 4
iterations to complete 1 epoch.

Where Batch Size is 500 and Iterations is 4, for 1 complete epoch.

ML ( V.S) 15
gradient decent for preceptron

• The gradient descent algorithm is a popular optimization technique used in machine learning
to find the optimal values of parameters in a model.
• In the case of the perceptron learning algorithm, gradient descent can be used to adjust the
weights of input features to minimize the classification error.
Here are the steps for gradient descent in perceptron learning:

1. Initialize the weights to small random values.

2. For each input vector, calculate the predicted output using the current weights and the activation
function (usually the sign function).

3. Calculate the error between the predicted output and the actual output.

4.Update the weights according to the following formula:

5. weight[i] = weight[i] + learning_rate * error * input[i]

where learning_rate is a hyperparameter that determines the step size for updating the weights, error is
the difference between the predicted output and the actual output, input[i] is the ith input feature.
ML ( V.S) 16
6.Repeat steps 2-4 for a number of iterations or until the classification error is below a certain threshold.
gradient decent
It is an algorithm that starts from a random point on the loss function and iterates down the
slope to reach the global minima of the function. The weight of the algorithm is updated once
every data point is iterated for the calculation of the loss that occurred.

Wnew = Wold – eta(dL/dX)

After every iteration, the value of the slope is calculated i.e. dL/dx. and the new weight is
calculated using the formula Wnew = Wold – eta(dL/dX)
Eta --> Learning Rate, WoldPrevious old weight , Wnew  Updated the new weight
dL/dX The gradient value that we got after the iteration.

ML ( V.S) 17
gradient decent

Lets take x0 = 0.5 and learning rate

(eta) = 1
X1 = X0 – eta[df/dx]
X1 = 0.5 –(2 * 0.5)
X1 = 0.5 – 1
X1 = – 0.5
ML ( V.S) 18
Stochastic Gradient Descent

Let’s say we have 20000 data points with 10 features. Then the gradient will be calculated
concerning all the features for all the data points in the set.
So the total number of calculations will be 20000 * 10 = 200000.
For attaining global minima, it is common for an algorithm to have 1000 iterations.
So now, the total number of computations performed by the system will be 200000 * 1000 =
200000000. This is a very large computation that consumes a lot of time and
hence Gradient descent is slow over the huge dataset.

Stochastic Gradient Descent (SGD), only one random training example is used
to calculate the gradient and update the parameters at each iteration

ML ( V.S) 19
SGD….
1.Initialize the model parameters randomly.
2.Define the cost function that you want to minimize.
3.Specify the hyperparameters such as the learning rate, batch size, and
momentum.
4.Repeat the following steps until convergence or a maximum number of
iterations:
a. Shuffle the training data.
b. Divide the training data into mini-batches of the specified size.
c. For each mini-batch, compute the gradient of the cost function with respect
to the model parameters.
d. Update the model parameters by subtracting the product of the gradient and
the learning rate from the current parameter values.
e. Optionally, apply momentum to the update by adding a fraction of the
previous update vector to the current update vector.
5. Evaluate the performance of the trained model on the validation set or test set.
6. Adjust the hyperparameters if necessary and repeat steps 4-5.
ML ( V.S) 20
7. Stop when the performance of the model on the validation set stops improving,
Local and Global minima
Local minima:
The point in a curve which is minimum when compared to its preceding and succeeding
points is called local minima.

Global minima:
The point in a curve which is minimum when compared to all points in the curve is called
Global Minima.
For a curve there can be more than one local minima, but it does have only one global
minima.

ML ( V.S) 21
Batch Gradient Descent Stochastic Gradient Descent
 Computes gradient using the whole Training
 Computes gradient using a single Training sample
sample
 Faster and less computationally expensive than
 Slow and computationally expensive algorithm
Batch GD
 Not suggested for huge training samples.  Can be used for large training samples.
 Deterministic in nature.  Stochastic in nature.
 Gives optimal solution given sufficient time to
 Gives good solution but not optimal.
converge.
 The data sample should be in a random order, and
 No random shuffling of points are required. this is why we want to shuffle the training set for
every epoch.
 Can’t escape shallow local minima easily.  SGD can escape shallow local minima more easily.
 Convergence is slow.  Reaches the convergence much faster.

ML ( V.S) 22
Batch Gradient Descent Stochastic Gradient Descent

It updates the model parameters only after It updates the parameters after each individual data
processing the entire training set. point.
The learning rate is fixed and cannot be changed
during training. The learning rate can be adjusted dynamically.

It typically converges to the global minimum for

It may converge to a local minimum or saddle point.
convex loss functions.
It may suffer from overfitting if the model is too It can help reduce overfitting by updating the model
complex for the dataset. parameters more frequently.

ML ( V.S) 23
Backpropagation

Consider a neural network with three layers:

1. Input layer with two inputs neurons
2. One hidden layer with two neurons
3. Output layer with a single neuron
Our main goal of the training is to reduce the error or the difference between prediction and actual
output. Since actual output is constant, “not changing”, the only way to reduce the error is to change
prediction value. The question now is, how to change prediction value?

By decomposing prediction into its basic elements we can find that weights are the variable elements
affecting prediction value. In other words, in order to change prediction value, we need to change
weights values.
ML ( V.S) 24
Backpropagation ……

The question now is how to change\update the weights value so that the error is reduced?
The answer is Backpropagation!

Backpropagation, short for “backward propagation of errors”, is a

mechanism used to update the weights using gradient descent. It
calculates the gradient of the error function with respect to the neural
network’s weights. The calculation proceeds backwards through the
network. ML ( V.S) 25
Backpropagation….. For example, to update w6, we take the current w6 and subtract the
partial derivative of error function with respect to w6. Optionally, we
multiply the derivative of the error function by a selected number to
make sure that the new updated weight is minimizing the error
function; this number is called learning rate.

ML ( V.S) 26
Backpropagation…..
So to update w6 we can apply the following formula

Similarly, we can derive the update formula for w5 and any other weights existing between the output and the hidden
layer.

New weight Wij = Wij + n (t – o )Xi

N is learning rate ( hyperparameter)
T- target value
O- output
Xi - input
ML ( V.S) 27
vanishing gradients problem
As the backpropagation algorithm advances downwards(or backward)
from the output layer towards the input layer, the gradients often get
smaller and smaller and approach zero which eventually leaves the
weights of the initial or lower layers nearly unchanged. As a result, the
gradient descent never converges to the optimum. This is known as
the vanishing gradients problem.

Vanishing gradient problem depends on the choice of the activation function. Many common activation
functions (e.g sigmoid or tanh) 'squash' their input into a very small output range in a very non-linear
fashion. For example, sigmoid maps the real number line onto a "small" range of [0, 1], especially with the
function being very flat on most of the number-line. As a result, there are large regions of the input space
which are mapped to an extremely small range. In these regions of the input space, even a large change in
the input will produce a small change in the output - hence the gradient is small.

ML ( V.S) 28
vanishing gradients problem……..

Vanishing gradient problem depends on the choice of the activation function. Many common activation
functions (e.g sigmoid or tanh) 'squash' their input into a very small output range in a very non-linear fashion.
For example, sigmoid maps the real number line onto a "small" range of [0, 1], especially with the function
being very flat on most of the number-line. As a result, there are large regions of the input space which are
mapped to an extremely small range. In these regions of the input space, even a large change in the input will
produce a small change in the output - hence the gradient is small.

This becomes much worse when we stack multiple layers of such non-linearities on top of each other. For
instance, first layer will map a large input region to a smaller output region, which will be mapped to an even
smaller region by the second layer, which will be mapped to an even smaller region by the third layer and so
on. As a result, even a large change in the parameters of the first layer doesn't change the output much.

We can avoid this problem by using activation functions which don't have this property of 'squashing' the
input space into a small region. A popular choice is Rectified Linear Unit which maps x
to max(0,x)
.

ML ( V.S) 29
ReLU Activation Function

ReLU stands for Rectified Linear activation Unit It is simple yet really better than its
predecessor activation functions such as sigmoid or tanh.

ReLU formula : f(x)=max(0,x)

ReLU function and its derivative both are monotonic. The function returns 0 if it receives
any negative input, but for any positive value x, it returns that value back. Thus it gives an
output that has a range from 0 to infinity.

let us define a ReLU function

def ReLU(x):
if x>0:
return x
else:
return 0
ML ( V.S) 30
Advantage of ReLu:
• Here all the negative values are converted into the 0 so there are no negative values
are available.
• Maximum Threshold values are Infinity, so there is no issue of Vanishing Gradient
problem so the output prediction accuracy and there efficiency is maximum.
• Speed is fast compare to other activation function

ML ( V.S) 31
hyperparameter tuning
Hyperparameters, are the parameters that cannot be directly learned from the regular training process.
They are usually fixed before the actual training process begins. These parameters express important
properties of the model such as its complexity or how fast it should learn.

Some examples of model hyperparameters include:

 The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization

 The learning rate for training a neural network.
 The C and sigma hyperparameters for support vector machines.
 The k in k-nearest neighbors.
The two best strategies for Hyperparameter tuning are:

 GridSearchCV
 RandomizedSearchCV

ML ( V.S) 32
GridSearchCV

• In GridSearchCV approach, the machine learning model is evaluated for a range of hyperparameter values. This
approach is called GridSearchCV, because it searches for the best set of hyperparameters from a grid of
hyperparameters values.

• For example, if we want to set two hyperparameters C and Alpha of the Logistic Regression Classifier model, with
different sets of values. The grid search technique will construct many versions of the model with all possible
combinations of hyperparameters and will return the best one.

• As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4]. For a combination of C=0.3 and
Alpha=0.2, the performance score comes out to be 0.726(Highest), therefore it is selected.

ML ( V.S) 33
ear separability: A dataset is linearly separable if there is at least one line that clearly
nguishes the classes.
-linear separability: A dataset is said to be non-linearly separable if there isn’t a single line
t clearly distinguishes the classes.

ML ( V.S) 34
regularization

 One of the reasons for overfitting is large weights in the network.

 A network with large network weights can be a sign of an unstable network where small
changes in the input can lead to large changes in the output.
 A solution to this problem is to update the learning algorithm to encourage the network to
keep the weights small. This is called regularization.
There are techniques that are used for regularization which are mentioned below

1. Batch Normalization
2. Drop-out layer

ML ( V.S) 35
1. Batch Normalization
• Normalization is a data pre-processing tool used to bring the numerical data to a common scale without
distorting its shape.

• Batch normalization, is a process to make neural networks faster and more stable
through adding extra layers to a deep neural network. The new layer performs the
standardizing and normalizing operations on the input of a layer coming from a
previous layer.
• A typical neural network is trained using a collected set of input data called batch.
Similarly, the normalizing process in batch normalization takes place in batches, not as a
single input.
Internal Covariate Shift as the change in the distribution of network activations due to the
change in network parameters during training.

ML ( V.S) 36
Batch Normalization…..

ML ( V.S) 37
Batch Normalization…..
Although, our input X was normalized with time the
output will no longer be on the same scale. As the
data go through multiple layers of the neural
network and L activation functions are applied, it
leads to an internal co-variate shift in the data.
Step 1: Normalization of the Input
Normalization is the process of transforming the data to have a mean zero and standard deviation one.
In this step we have our batch input from layer h, first, we need to calculate the mean of this hidden activation.

Here, m is the number of neurons at layer h.

step 2 : is to calculate the standard deviation of the hidden activations.

ML ( V.S) 38
Batch Normalization…..

Step 3: subtract the mean from each input and divide the whole value with the sum of
standard deviation and the smoothing term (ε).
The smoothing term(ε) assures numerical stability within the operation by stopping a division
by a zero value.

Step 4: Rescaling of Offsetting

In the final operation, the re-scaling and offsetting of the input take place. Here two components of the BN
algorithm come into the picture, γ(gamma) and β (beta). These parameters are used for re-scaling (γ) and
shifting(β) of the vector containing values from the previous operations.

Advantages Of Batch Normalization

• Reduces internal covariant shift.
• Reduces the dependence of gradients on the scale of the parameters or their initial values.
• Regularizes the model and reduces the need for dropout, photometric distortions, local
response normalization and other regularization techniques.

ML ( V.S) 39

Employee Discipline - Unauthorized Absence Warning Letter
50% (2)
Employee Discipline - Unauthorized Absence Warning Letter
1 page
CS3491 - Notes - Unit 5 - Neural Networks
No ratings yet
CS3491 - Notes - Unit 5 - Neural Networks
37 pages
M.Tech CSE Syllabus Notes
No ratings yet
M.Tech CSE Syllabus Notes
32 pages
Unit 3 Full Notes
No ratings yet
Unit 3 Full Notes
30 pages
Cs3491-Artificial Intelligence and Machine Learning-819461728-Ai Unit 1
No ratings yet
Cs3491-Artificial Intelligence and Machine Learning-819461728-Ai Unit 1
73 pages
DL Notes ALL
No ratings yet
DL Notes ALL
63 pages
IF4071 - Deep Learning Laboratory
No ratings yet
IF4071 - Deep Learning Laboratory
1 page
SCT - QB - Anwers - p1
No ratings yet
SCT - QB - Anwers - p1
53 pages
Syllabus
No ratings yet
Syllabus
9 pages
Bannari Amman Institute of Technology
No ratings yet
Bannari Amman Institute of Technology
10 pages
NN UNIT-1 Complete Notes with 153 pages (1)
No ratings yet
NN UNIT-1 Complete Notes with 153 pages (1)
153 pages
DSP Lab Content Beyond Syllabus Code
No ratings yet
DSP Lab Content Beyond Syllabus Code
3 pages
CS3491 - Notes - Unit 4 - Ensemble Techniques and Unsupervised Learning
No ratings yet
CS3491 - Notes - Unit 4 - Ensemble Techniques and Unsupervised Learning
35 pages
Cs3351 Aiml Unit 4 Notes Eduengg
No ratings yet
Cs3351 Aiml Unit 4 Notes Eduengg
33 pages
LM7 Approximate Inference in BN
No ratings yet
LM7 Approximate Inference in BN
18 pages
DL Unit-2 Notes PPT
No ratings yet
DL Unit-2 Notes PPT
39 pages
DT For Strategic Innovation
No ratings yet
DT For Strategic Innovation
79 pages
Module 4
No ratings yet
Module 4
18 pages
Pattern Recognition and Anomaly Detection Lab
No ratings yet
Pattern Recognition and Anomaly Detection Lab
3 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
Designing A Learning System
No ratings yet
Designing A Learning System
21 pages
Guidelines To Prepare B.Tech Mini Project Documentation
No ratings yet
Guidelines To Prepare B.Tech Mini Project Documentation
6 pages
ML unit-1
No ratings yet
ML unit-1
15 pages
18AI61
No ratings yet
18AI61
3 pages
IAT-I Question Paper With Solution of 18CS71 Artificial Intelligence and Machine Learning Oct-2022-Dr. Paras Nath Singh
No ratings yet
IAT-I Question Paper With Solution of 18CS71 Artificial Intelligence and Machine Learning Oct-2022-Dr. Paras Nath Singh
7 pages
Designing A Learning System
No ratings yet
Designing A Learning System
12 pages
Concept Learning
No ratings yet
Concept Learning
85 pages
AIML Notes Unit-5
No ratings yet
AIML Notes Unit-5
15 pages
SCADA and RFID Protocols
No ratings yet
SCADA and RFID Protocols
4 pages
Omkar Sabnis B4-764 Experiment No. 7 Aim: Implementation of MC-Culloch Pitt Model For AND Gate Using Python. Theory
No ratings yet
Omkar Sabnis B4-764 Experiment No. 7 Aim: Implementation of MC-Culloch Pitt Model For AND Gate Using Python. Theory
10 pages
Unit 3 - Iot and Arduino Programming
No ratings yet
Unit 3 - Iot and Arduino Programming
55 pages
MACHINE LEARNING Important Questions
100% (1)
MACHINE LEARNING Important Questions
2 pages
Two Mark Questions and Answers
No ratings yet
Two Mark Questions and Answers
19 pages
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
No ratings yet
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
85 pages
Deep Learning Techniques Notes
No ratings yet
Deep Learning Techniques Notes
42 pages
Overfitting vs. Underfitting, Bias vs. Variance
No ratings yet
Overfitting vs. Underfitting, Bias vs. Variance
7 pages
Machine Learning Handwritten
No ratings yet
Machine Learning Handwritten
128 pages
Data Compression Jan 2014
No ratings yet
Data Compression Jan 2014
2 pages
CCS357 LAB MANUAL
No ratings yet
CCS357 LAB MANUAL
41 pages
Question Bank: T.E. (Computer Engineering) Data Science and Big Data Analytics (2019 Pattern)
No ratings yet
Question Bank: T.E. (Computer Engineering) Data Science and Big Data Analytics (2019 Pattern)
4 pages
IoT-Enabling-Technologies
No ratings yet
IoT-Enabling-Technologies
17 pages
MODULE 5
No ratings yet
MODULE 5
31 pages
Artificial Intelligence Question Bank-RICH
No ratings yet
Artificial Intelligence Question Bank-RICH
10 pages
Convolution Neural Networks U2
No ratings yet
Convolution Neural Networks U2
24 pages
What is Gradient Based Learning in Deep Learning
No ratings yet
What is Gradient Based Learning in Deep Learning
12 pages
Unit -3-NNDL- Notes
No ratings yet
Unit -3-NNDL- Notes
17 pages
NCIIT 12 Proceedings
100% (1)
NCIIT 12 Proceedings
86 pages
cs8086 Soft Computing
No ratings yet
cs8086 Soft Computing
14 pages
Course File - WSN
100% (1)
Course File - WSN
20 pages
ML unit-5
No ratings yet
ML unit-5
14 pages
AI Ch-14 Inroduction To Prolog
No ratings yet
AI Ch-14 Inroduction To Prolog
15 pages
San Unit-Wise Questions
No ratings yet
San Unit-Wise Questions
6 pages
CP1103 Unit - 1
No ratings yet
CP1103 Unit - 1
37 pages
Gujarat Technological University: Computer Engineering Machine Learning SUBJECT CODE: 3710216
No ratings yet
Gujarat Technological University: Computer Engineering Machine Learning SUBJECT CODE: 3710216
2 pages
Unit1 Web Essentials
No ratings yet
Unit1 Web Essentials
25 pages
Artifical Intelligence and Machine Learning Lab
No ratings yet
Artifical Intelligence and Machine Learning Lab
109 pages
Collision Free Scheduling
No ratings yet
Collision Free Scheduling
18 pages
Model Question For Multimedia
No ratings yet
Model Question For Multimedia
3 pages
CS3401 - Algorithm
No ratings yet
CS3401 - Algorithm
37 pages
19cs413 Artificial Intelligence
No ratings yet
19cs413 Artificial Intelligence
3 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
IJANA
No ratings yet
IJANA
9 pages
07 Design Engineering Paper 2
No ratings yet
07 Design Engineering Paper 2
9 pages
FLAODV_Skybold_pape
No ratings yet
FLAODV_Skybold_pape
17 pages
Regression Testing
No ratings yet
Regression Testing
6 pages
unit 2_class
No ratings yet
unit 2_class
16 pages
unit 2_class_preceptron
No ratings yet
unit 2_class_preceptron
13 pages
UNIT3_class
No ratings yet
UNIT3_class
30 pages
ALL Units QBank
No ratings yet
ALL Units QBank
12 pages
L14 JavaFXBasics
No ratings yet
L14 JavaFXBasics
41 pages
Pristine Motors - Google Search
No ratings yet
Pristine Motors - Google Search
1 page
Refinery Safety BP Texas City Explosion
No ratings yet
Refinery Safety BP Texas City Explosion
11 pages
MGT112 Organization Process
No ratings yet
MGT112 Organization Process
20 pages
ECDIS JRC JAN-7201-9201 Instruct Manual Basic
No ratings yet
ECDIS JRC JAN-7201-9201 Instruct Manual Basic
294 pages
Filipino Communities During Spanish Era
No ratings yet
Filipino Communities During Spanish Era
5 pages
Fire Alarm Panel RE - 2554 / 58: Product Overview
100% (1)
Fire Alarm Panel RE - 2554 / 58: Product Overview
2 pages
Construction Project Manager
No ratings yet
Construction Project Manager
4 pages
Learner's Module in Technology and Livelihood Education 10 Electronics Product Assembly and Servicing (EPAS)
No ratings yet
Learner's Module in Technology and Livelihood Education 10 Electronics Product Assembly and Servicing (EPAS)
8 pages
Nursing Care Plan Risk For Imbalance Body Temperature
No ratings yet
Nursing Care Plan Risk For Imbalance Body Temperature
4 pages
Weir Split Case Brochure
No ratings yet
Weir Split Case Brochure
4 pages
Fisiha Fikiru - The Effect of HRM Practices On HCQ at TASH-1
No ratings yet
Fisiha Fikiru - The Effect of HRM Practices On HCQ at TASH-1
110 pages
Untitled
No ratings yet
Untitled
24 pages
DM 083 S. 2023 Innovation SHS
No ratings yet
DM 083 S. 2023 Innovation SHS
11 pages
S35-4 Manual
No ratings yet
S35-4 Manual
48 pages
Mil STD 2175a
No ratings yet
Mil STD 2175a
34 pages
Week # 02 (Lecture 3 & 4) : Computer Graphics (CS-575)
No ratings yet
Week # 02 (Lecture 3 & 4) : Computer Graphics (CS-575)
10 pages
Tda7294 Letak Components
No ratings yet
Tda7294 Letak Components
51 pages
ATR42 - 72 Q2018 (Answers Highlited)
No ratings yet
ATR42 - 72 Q2018 (Answers Highlited)
4 pages
Safari - 10 Jun 2023 at 16_41
No ratings yet
Safari - 10 Jun 2023 at 16_41
1 page
2014.02.20.secrecy Statutes
No ratings yet
2014.02.20.secrecy Statutes
5 pages
RULA: A Survey Method For The - Irwestigation of World-Related Upper Limb Disorders
No ratings yet
RULA: A Survey Method For The - Irwestigation of World-Related Upper Limb Disorders
10 pages
Pretect Hot Tapping Download - Web
100% (1)
Pretect Hot Tapping Download - Web
84 pages
Kamera SDC 415
No ratings yet
Kamera SDC 415
6 pages
Đề Thi Thử TN THPT Tiếng Anh 2024 - THPT Chuyên Hạ Long - Quảng Ninh - File Word Có Lời Giải
No ratings yet
Đề Thi Thử TN THPT Tiếng Anh 2024 - THPT Chuyên Hạ Long - Quảng Ninh - File Word Có Lời Giải
35 pages
17 OCT TO 20 JAN
No ratings yet
17 OCT TO 20 JAN
4 pages
Industrial Security Management Lecture Kcast
No ratings yet
Industrial Security Management Lecture Kcast
180 pages
FINAL_Rule for Opening_till Amendment dated 18_09_2024
No ratings yet
FINAL_Rule for Opening_till Amendment dated 18_09_2024
74 pages
The Waste of Waiting
No ratings yet
The Waste of Waiting
4 pages

UNIT 4 - Perceptron and DL

Uploaded by

UNIT 4 - Perceptron and DL

Uploaded by

UNIT 4 - Perceptron and DL

Multilayer perceptron, activation functions, network training –

 A Perceptron is an Artificial Neuron

Stages of Multi-Layered Perceptron model

• Is the hero good? Is the weather good? Coffee is Served x5 = 0 or 1 w5 = 0.4

The Perceptron Algorithm

• 2. Multiply all inputs with its weights:

• 3. Sum all the results:

 0.7 + 0 + 0.5 + 0 + 0.4 = 1.6 (The Weighted Sum)

• 4. Activate the Output:

• An activation function ( Transfer Function ) in a neural network defines how the

The Activation Functions can be basically divided into 2 types-

 Linear Activation Function

Where Batch Size is 500 and Iterations is 4, for 1 complete epoch.

1. Initialize the weights to small random values.

4.Update the weights according to the following formula:

5. weight[i] = weight[i] + learning_rate * error * input[i]

Wnew = Wold – eta(dL/dX)

Lets take x0 = 0.5 and learning rate

It typically converges to the global minimum for

Consider a neural network with three layers:

Backpropagation, short for “backward propagation of errors”, is a

New weight Wij = Wij + n (t – o )Xi

ReLU formula : f(x)=max(0,x)

let us define a ReLU function

Some examples of model hyperparameters include:

 The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization

 One of the reasons for overfitting is large weights in the network.

Here, m is the number of neurons at layer h.

step 2 : is to calculate the standard deviation of the hidden activations.

Step 4: Rescaling of Offsetting

Advantages Of Batch Normalization

You might also like