0% found this document useful (0 votes)
62 views

UNIT 4 - Perceptron and DL

The document discusses the fundamentals of perceptrons and deep learning, covering topics such as multilayer perceptrons, activation functions, and training algorithms like gradient descent and stochastic gradient descent. It explains the structure of perceptrons, including input layers, weights, biases, and activation functions, as well as the training process involving forward propagation, loss computation, and backpropagation. Additionally, it addresses challenges like the vanishing gradient problem and compares batch and stochastic gradient descent methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

UNIT 4 - Perceptron and DL

The document discusses the fundamentals of perceptrons and deep learning, covering topics such as multilayer perceptrons, activation functions, and training algorithms like gradient descent and stochastic gradient descent. It explains the structure of perceptrons, including input layers, weights, biases, and activation functions, as well as the training process involving forward propagation, loss computation, and backpropagation. Additionally, it addresses challenges like the vanishing gradient problem and compares batch and stochastic gradient descent methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

UNIT 4 - Perceptron and DL

Multilayer perceptron, activation functions, network training –


gradient descent optimization – stochastic gradient descent,
error backpropagation, from shallow networks to deep
networks –Unit saturation (aka the vanishing gradient
problem) – (rectified linear )ReLU, hyperparameter tuning,
batch normalization, regularization, dropout.

ML ( V.S) 1
Perceptron in Machine Learning
• ( Mr. Frank Rosenblatt )

 A Perceptron is an Artificial Neuron


 It is the simplest possible Neural Network
 Neural Networks are the building blocks of Machine Learning.

ML ( V.S) 2
ANN ( Artificial Neural Network)
• An artificial neuron is a mathematical function based on a model of
biological neurons, where each neuron takes inputs, weighs them
separately, sums them up and passes this sum through a nonlinear
function to produce output.

ML ( V.S) 3
Components of Perceptron

Input Layer: The input layer consists of one or more input neurons, which receive input
signals from the external world or from other layers of the neural network.
Weights: Each input neuron is associated with a weight, which represents the strength
of the connection between the input neuron and the output neuron.
Bias: A bias term is added to the input layer to provide the perceptron with additional
flexibility in modeling complex patterns in the input data.
Activation Function: The activation function determines the output of the perceptron
based on the weighted sum of the inputs and the bias term. Common activation
functions used in perceptrons include the step function, sigmoid function, and ReLU
function.
Output: The output of the perceptron is a single binary value, either 0 or 1, which
indicates the class or category to which the input data belongs.
Training Algorithm: The perceptron is typically trained using a supervised learning
algorithm such as the perceptron learning algorithm or backpropagation. During
training, the weights and biases of the perceptron are adjusted to minimize the error
between the predicted output and the true output for a given set of training examples.
ML ( V.S) 4
Types of Perceptron models
• Single Layer Perceptron model: One of the easiest ANN(Artificial Neural Networks) type consists of a
feed-forward network. A Single-layer perceptron can learn only linearly separable patterns.

• Multi-Layered Perceptron model: It is mainly similar to a single-layer perceptron model but has more
hidden layers.

Stages of Multi-Layered Perceptron model


o Forward Stage: Activation functions start from the input layer in the forward stage and terminate on the
output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per the model's
requirement. In this stage, the error between actual output and demanded originated backward on the output
layer and ended on the input layer.

ML ( V.S) 5
example Criteria input Weight
Artists is Good x1 = 0 or 1 w1 = 0.7
• Imagine a perceptron (in your brain).
Weather is Good x2 = 0 or 1 w2 = 0.6
Friend will Come x3 = 0 or 1 w3 = 0.5
• The perceptron tries to decide if you should go to a movie. Food is Served x4 = 0 or 1 w4 = 0.3

• Is the hero good? Is the weather good? Coffee is Served x5 = 0 or 1 w5 = 0.4

The Perceptron Algorithm


1. Set a threshold value
2. Multiply all inputs with its weights
3. Sum all the results
4. Activate the output

ML ( V.S) 6
Example…
• 1. Set a threshold value:

 Threshold = 1.5

• 2. Multiply all inputs with its weights:

 x1 * w1 = 1 * 0.7 = 0.7

 x2 * w2 = 0 * 0.6 = 0

 x3 * w3 = 1 * 0.5 = 0.5

 x4 * w4 = 0 * 0.3 = 0

 x5 * w5 = 1 * 0.4 = 0.4

• 3. Sum all the results:

 0.7 + 0 + 0.5 + 0 + 0.4 = 1.6 (The Weighted Sum)

• 4. Activate the Output:

 Return true if the sum > 1.5 ("Yes I will go to the movie") ML ( V.S) 7
Perceptron vs DNN

ML ( V.S) 8
ML ( V.S) 9
Activation function

• An activation function ( Transfer Function ) in a neural network defines how the


weighted sum of the input is transformed into an output from a node or nodes in a
layer of the network.
• It is used to determine the output of neural network like yes or no. It maps the resulting
values in between 0 to 1 or -1 to 1 etc. (depending upon the function).

The Activation Functions can be basically divided into 2 types-

 Linear Activation Function


 Non-linear Activation Functions

ML ( V.S) 10
Activation functions ( Transfer Functions )

ML ( V.S) 11
ML ( V.S) 12
Neural network training
• Neural network training is the process of training a neural network model to
learn patterns and make predictions from input data. The goal of training a
neural network is to minimize its prediction error, which is achieved by
adjusting the model's parameters (weights and biases) through an iterative
optimization process.
• The training process involves the following steps:
1.Data preparation: The input data is preprocessed and split into training,
validation, and testing sets.
2.Model architecture: The neural network model is designed with appropriate
layers, activation functions, and loss functions.
3.Initialization: The initial values of the model parameters (weights and biases)
are randomly initialized.
4.Forward propagation: The input data is fed into the model, and the output is
computed using the current parameter values.
5.Loss computation: The difference between the predicted output and the actual
output is calculated using the chosen loss function.
6.Backpropagation: The error is propagated backwards through the network to
ML ( V.S) 13
compute the gradient of the loss function with respect to each parameter.
Neural network training ……
7. Parameter update: The parameters are updated using an
optimization algorithm such as stochastic gradient descent
(SGD) or Adam.
8. Repeat: Steps 4-7 are repeated until the model converges
to a satisfactory level of performance.
9. Evaluation: The performance of the trained model is
evaluated on a held-out testing set.
• Neural network training can be a computationally intensive
process, especially for large datasets and complex models.
• GPUs are often used to accelerate the training process..

ML ( V.S) 14
Epochs

• One Epoch is when an ENTIRE dataset is passed forward and backward through the
neural network only once.
• updating the weights with single pass or one epoch is not enough.
• Batch Size - Total number of training examples present in a single batch.
• Iteration is the number of batches needed to complete one epoch.

Let’s say we have 2000 training examples that we are going to use .

We can divide the dataset of 2000 examples into batches of 500 then it will take 4
iterations to complete 1 epoch.

Where Batch Size is 500 and Iterations is 4, for 1 complete epoch.

ML ( V.S) 15
gradient decent for preceptron

• The gradient descent algorithm is a popular optimization technique used in machine learning
to find the optimal values of parameters in a model.
• In the case of the perceptron learning algorithm, gradient descent can be used to adjust the
weights of input features to minimize the classification error.
Here are the steps for gradient descent in perceptron learning:

1. Initialize the weights to small random values.

2. For each input vector, calculate the predicted output using the current weights and the activation
function (usually the sign function).

3. Calculate the error between the predicted output and the actual output.

4.Update the weights according to the following formula:

5. weight[i] = weight[i] + learning_rate * error * input[i]

where learning_rate is a hyperparameter that determines the step size for updating the weights, error is
the difference between the predicted output and the actual output, input[i] is the ith input feature.
ML ( V.S) 16
6.Repeat steps 2-4 for a number of iterations or until the classification error is below a certain threshold.
gradient decent
It is an algorithm that starts from a random point on the loss function and iterates down the
slope to reach the global minima of the function. The weight of the algorithm is updated once
every data point is iterated for the calculation of the loss that occurred.

Wnew = Wold – eta(dL/dX)

After every iteration, the value of the slope is calculated i.e. dL/dx. and the new weight is
calculated using the formula Wnew = Wold – eta(dL/dX)
Eta --> Learning Rate, WoldPrevious old weight , Wnew  Updated the new weight
dL/dX The gradient value that we got after the iteration.

ML ( V.S) 17
gradient decent

Lets take x0 = 0.5 and learning rate


(eta) = 1
X1 = X0 – eta[df/dx]
X1 = 0.5 –(2 * 0.5)
X1 = 0.5 – 1
X1 = – 0.5
ML ( V.S) 18
Stochastic Gradient Descent

Let’s say we have 20000 data points with 10 features. Then the gradient will be calculated
concerning all the features for all the data points in the set.
So the total number of calculations will be 20000 * 10 = 200000.
For attaining global minima, it is common for an algorithm to have 1000 iterations.
So now, the total number of computations performed by the system will be 200000 * 1000 =
200000000. This is a very large computation that consumes a lot of time and
hence Gradient descent is slow over the huge dataset.

Stochastic Gradient Descent (SGD), only one random training example is used
to calculate the gradient and update the parameters at each iteration

ML ( V.S) 19
SGD….
1.Initialize the model parameters randomly.
2.Define the cost function that you want to minimize.
3.Specify the hyperparameters such as the learning rate, batch size, and
momentum.
4.Repeat the following steps until convergence or a maximum number of
iterations:
a. Shuffle the training data.
b. Divide the training data into mini-batches of the specified size.
c. For each mini-batch, compute the gradient of the cost function with respect
to the model parameters.
d. Update the model parameters by subtracting the product of the gradient and
the learning rate from the current parameter values.
e. Optionally, apply momentum to the update by adding a fraction of the
previous update vector to the current update vector.
5. Evaluate the performance of the trained model on the validation set or test set.
6. Adjust the hyperparameters if necessary and repeat steps 4-5.
ML ( V.S) 20
7. Stop when the performance of the model on the validation set stops improving,
Local and Global minima
Local minima:
The point in a curve which is minimum when compared to its preceding and succeeding
points is called local minima.

Global minima:
The point in a curve which is minimum when compared to all points in the curve is called
Global Minima.
For a curve there can be more than one local minima, but it does have only one global
minima.

ML ( V.S) 21
Batch Gradient Descent Stochastic Gradient Descent
 Computes gradient using the whole Training
 Computes gradient using a single Training sample
sample
 Faster and less computationally expensive than
 Slow and computationally expensive algorithm
Batch GD
 Not suggested for huge training samples.  Can be used for large training samples.
 Deterministic in nature.  Stochastic in nature.
 Gives optimal solution given sufficient time to
 Gives good solution but not optimal.
converge.
 The data sample should be in a random order, and
 No random shuffling of points are required. this is why we want to shuffle the training set for
every epoch.
 Can’t escape shallow local minima easily.  SGD can escape shallow local minima more easily.
 Convergence is slow.  Reaches the convergence much faster.

ML ( V.S) 22
Batch Gradient Descent Stochastic Gradient Descent

It updates the model parameters only after It updates the parameters after each individual data
processing the entire training set. point.
The learning rate is fixed and cannot be changed
during training. The learning rate can be adjusted dynamically.

It typically converges to the global minimum for


It may converge to a local minimum or saddle point.
convex loss functions.
It may suffer from overfitting if the model is too It can help reduce overfitting by updating the model
complex for the dataset. parameters more frequently.

ML ( V.S) 23
Backpropagation

Consider a neural network with three layers:


1. Input layer with two inputs neurons
2. One hidden layer with two neurons
3. Output layer with a single neuron
Our main goal of the training is to reduce the error or the difference between prediction and actual
output. Since actual output is constant, “not changing”, the only way to reduce the error is to change
prediction value. The question now is, how to change prediction value?

By decomposing prediction into its basic elements we can find that weights are the variable elements
affecting prediction value. In other words, in order to change prediction value, we need to change
weights values.
ML ( V.S) 24
Backpropagation ……

The question now is how to change\update the weights value so that the error is reduced?
The answer is Backpropagation!

Backpropagation, short for “backward propagation of errors”, is a


mechanism used to update the weights using gradient descent. It
calculates the gradient of the error function with respect to the neural
network’s weights. The calculation proceeds backwards through the
network. ML ( V.S) 25
Backpropagation….. For example, to update w6, we take the current w6 and subtract the
partial derivative of error function with respect to w6. Optionally, we
multiply the derivative of the error function by a selected number to
make sure that the new updated weight is minimizing the error
function; this number is called learning rate.

ML ( V.S) 26
Backpropagation…..
So to update w6 we can apply the following formula

Similarly, we can derive the update formula for w5 and any other weights existing between the output and the hidden
layer.

New weight Wij = Wij + n (t – o )Xi


N is learning rate ( hyperparameter)
T- target value
O- output
Xi - input
ML ( V.S) 27
vanishing gradients problem
As the backpropagation algorithm advances downwards(or backward)
from the output layer towards the input layer, the gradients often get
smaller and smaller and approach zero which eventually leaves the
weights of the initial or lower layers nearly unchanged. As a result, the
gradient descent never converges to the optimum. This is known as
the vanishing gradients problem.

Vanishing gradient problem depends on the choice of the activation function. Many common activation
functions (e.g sigmoid or tanh) 'squash' their input into a very small output range in a very non-linear
fashion. For example, sigmoid maps the real number line onto a "small" range of [0, 1], especially with the
function being very flat on most of the number-line. As a result, there are large regions of the input space
which are mapped to an extremely small range. In these regions of the input space, even a large change in
the input will produce a small change in the output - hence the gradient is small.

ML ( V.S) 28
vanishing gradients problem……..

Vanishing gradient problem depends on the choice of the activation function. Many common activation
functions (e.g sigmoid or tanh) 'squash' their input into a very small output range in a very non-linear fashion.
For example, sigmoid maps the real number line onto a "small" range of [0, 1], especially with the function
being very flat on most of the number-line. As a result, there are large regions of the input space which are
mapped to an extremely small range. In these regions of the input space, even a large change in the input will
produce a small change in the output - hence the gradient is small.

This becomes much worse when we stack multiple layers of such non-linearities on top of each other. For
instance, first layer will map a large input region to a smaller output region, which will be mapped to an even
smaller region by the second layer, which will be mapped to an even smaller region by the third layer and so
on. As a result, even a large change in the parameters of the first layer doesn't change the output much.

We can avoid this problem by using activation functions which don't have this property of 'squashing' the
input space into a small region. A popular choice is Rectified Linear Unit which maps x
to max(0,x)
.

ML ( V.S) 29
ReLU Activation Function

ReLU stands for Rectified Linear activation Unit It is simple yet really better than its
predecessor activation functions such as sigmoid or tanh.

ReLU formula : f(x)=max(0,x)


ReLU function and its derivative both are monotonic. The function returns 0 if it receives
any negative input, but for any positive value x, it returns that value back. Thus it gives an
output that has a range from 0 to infinity.

let us define a ReLU function

def ReLU(x):
if x>0:
return x
else:
return 0
ML ( V.S) 30
Advantage of ReLu:
• Here all the negative values are converted into the 0 so there are no negative values
are available.
• Maximum Threshold values are Infinity, so there is no issue of Vanishing Gradient
problem so the output prediction accuracy and there efficiency is maximum.
• Speed is fast compare to other activation function

ML ( V.S) 31
hyperparameter tuning
Hyperparameters, are the parameters that cannot be directly learned from the regular training process.
They are usually fixed before the actual training process begins. These parameters express important
properties of the model such as its complexity or how fast it should learn.

Some examples of model hyperparameters include:

 The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization


 The learning rate for training a neural network.
 The C and sigma hyperparameters for support vector machines.
 The k in k-nearest neighbors.
The two best strategies for Hyperparameter tuning are:

 GridSearchCV
 RandomizedSearchCV

ML ( V.S) 32
GridSearchCV

• In GridSearchCV approach, the machine learning model is evaluated for a range of hyperparameter values. This
approach is called GridSearchCV, because it searches for the best set of hyperparameters from a grid of
hyperparameters values.

• For example, if we want to set two hyperparameters C and Alpha of the Logistic Regression Classifier model, with
different sets of values. The grid search technique will construct many versions of the model with all possible
combinations of hyperparameters and will return the best one.

• As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4]. For a combination of C=0.3 and
Alpha=0.2, the performance score comes out to be 0.726(Highest), therefore it is selected.

ML ( V.S) 33
ear separability: A dataset is linearly separable if there is at least one line that clearly
nguishes the classes.
-linear separability: A dataset is said to be non-linearly separable if there isn’t a single line
t clearly distinguishes the classes.

ML ( V.S) 34
regularization

 One of the reasons for overfitting is large weights in the network.


 A network with large network weights can be a sign of an unstable network where small
changes in the input can lead to large changes in the output.
 A solution to this problem is to update the learning algorithm to encourage the network to
keep the weights small. This is called regularization.
There are techniques that are used for regularization which are mentioned below

1. Batch Normalization
2. Drop-out layer

ML ( V.S) 35
1. Batch Normalization
• Normalization is a data pre-processing tool used to bring the numerical data to a common scale without
distorting its shape.

• Batch normalization, is a process to make neural networks faster and more stable
through adding extra layers to a deep neural network. The new layer performs the
standardizing and normalizing operations on the input of a layer coming from a
previous layer.
• A typical neural network is trained using a collected set of input data called batch.
Similarly, the normalizing process in batch normalization takes place in batches, not as a
single input.
Internal Covariate Shift as the change in the distribution of network activations due to the
change in network parameters during training.

ML ( V.S) 36
Batch Normalization…..

ML ( V.S) 37
Batch Normalization…..
Although, our input X was normalized with time the
output will no longer be on the same scale. As the
data go through multiple layers of the neural
network and L activation functions are applied, it
leads to an internal co-variate shift in the data.
Step 1: Normalization of the Input
Normalization is the process of transforming the data to have a mean zero and standard deviation one.
In this step we have our batch input from layer h, first, we need to calculate the mean of this hidden activation.

Here, m is the number of neurons at layer h.

step 2 : is to calculate the standard deviation of the hidden activations.

ML ( V.S) 38
Batch Normalization…..

Step 3: subtract the mean from each input and divide the whole value with the sum of
standard deviation and the smoothing term (ε).
The smoothing term(ε) assures numerical stability within the operation by stopping a division
by a zero value.

Step 4: Rescaling of Offsetting


In the final operation, the re-scaling and offsetting of the input take place. Here two components of the BN
algorithm come into the picture, γ(gamma) and β (beta). These parameters are used for re-scaling (γ) and
shifting(β) of the vector containing values from the previous operations.

Advantages Of Batch Normalization


• Reduces internal covariant shift.
• Reduces the dependence of gradients on the scale of the parameters or their initial values.
• Regularizes the model and reduces the need for dropout, photometric distortions, local
response normalization and other regularization techniques.

ML ( V.S) 39

You might also like