UNIT 4 - Perceptron and DL
UNIT 4 - Perceptron and DL
ML ( V.S) 1
Perceptron in Machine Learning
• ( Mr. Frank Rosenblatt )
ML ( V.S) 2
ANN ( Artificial Neural Network)
• An artificial neuron is a mathematical function based on a model of
biological neurons, where each neuron takes inputs, weighs them
separately, sums them up and passes this sum through a nonlinear
function to produce output.
ML ( V.S) 3
Components of Perceptron
Input Layer: The input layer consists of one or more input neurons, which receive input
signals from the external world or from other layers of the neural network.
Weights: Each input neuron is associated with a weight, which represents the strength
of the connection between the input neuron and the output neuron.
Bias: A bias term is added to the input layer to provide the perceptron with additional
flexibility in modeling complex patterns in the input data.
Activation Function: The activation function determines the output of the perceptron
based on the weighted sum of the inputs and the bias term. Common activation
functions used in perceptrons include the step function, sigmoid function, and ReLU
function.
Output: The output of the perceptron is a single binary value, either 0 or 1, which
indicates the class or category to which the input data belongs.
Training Algorithm: The perceptron is typically trained using a supervised learning
algorithm such as the perceptron learning algorithm or backpropagation. During
training, the weights and biases of the perceptron are adjusted to minimize the error
between the predicted output and the true output for a given set of training examples.
ML ( V.S) 4
Types of Perceptron models
• Single Layer Perceptron model: One of the easiest ANN(Artificial Neural Networks) type consists of a
feed-forward network. A Single-layer perceptron can learn only linearly separable patterns.
• Multi-Layered Perceptron model: It is mainly similar to a single-layer perceptron model but has more
hidden layers.
ML ( V.S) 5
example Criteria input Weight
Artists is Good x1 = 0 or 1 w1 = 0.7
• Imagine a perceptron (in your brain).
Weather is Good x2 = 0 or 1 w2 = 0.6
Friend will Come x3 = 0 or 1 w3 = 0.5
• The perceptron tries to decide if you should go to a movie. Food is Served x4 = 0 or 1 w4 = 0.3
ML ( V.S) 6
Example…
• 1. Set a threshold value:
Threshold = 1.5
x1 * w1 = 1 * 0.7 = 0.7
x2 * w2 = 0 * 0.6 = 0
x3 * w3 = 1 * 0.5 = 0.5
x4 * w4 = 0 * 0.3 = 0
x5 * w5 = 1 * 0.4 = 0.4
Return true if the sum > 1.5 ("Yes I will go to the movie") ML ( V.S) 7
Perceptron vs DNN
ML ( V.S) 8
ML ( V.S) 9
Activation function
ML ( V.S) 10
Activation functions ( Transfer Functions )
ML ( V.S) 11
ML ( V.S) 12
Neural network training
• Neural network training is the process of training a neural network model to
learn patterns and make predictions from input data. The goal of training a
neural network is to minimize its prediction error, which is achieved by
adjusting the model's parameters (weights and biases) through an iterative
optimization process.
• The training process involves the following steps:
1.Data preparation: The input data is preprocessed and split into training,
validation, and testing sets.
2.Model architecture: The neural network model is designed with appropriate
layers, activation functions, and loss functions.
3.Initialization: The initial values of the model parameters (weights and biases)
are randomly initialized.
4.Forward propagation: The input data is fed into the model, and the output is
computed using the current parameter values.
5.Loss computation: The difference between the predicted output and the actual
output is calculated using the chosen loss function.
6.Backpropagation: The error is propagated backwards through the network to
ML ( V.S) 13
compute the gradient of the loss function with respect to each parameter.
Neural network training ……
7. Parameter update: The parameters are updated using an
optimization algorithm such as stochastic gradient descent
(SGD) or Adam.
8. Repeat: Steps 4-7 are repeated until the model converges
to a satisfactory level of performance.
9. Evaluation: The performance of the trained model is
evaluated on a held-out testing set.
• Neural network training can be a computationally intensive
process, especially for large datasets and complex models.
• GPUs are often used to accelerate the training process..
ML ( V.S) 14
Epochs
• One Epoch is when an ENTIRE dataset is passed forward and backward through the
neural network only once.
• updating the weights with single pass or one epoch is not enough.
• Batch Size - Total number of training examples present in a single batch.
• Iteration is the number of batches needed to complete one epoch.
Let’s say we have 2000 training examples that we are going to use .
We can divide the dataset of 2000 examples into batches of 500 then it will take 4
iterations to complete 1 epoch.
ML ( V.S) 15
gradient decent for preceptron
• The gradient descent algorithm is a popular optimization technique used in machine learning
to find the optimal values of parameters in a model.
• In the case of the perceptron learning algorithm, gradient descent can be used to adjust the
weights of input features to minimize the classification error.
Here are the steps for gradient descent in perceptron learning:
2. For each input vector, calculate the predicted output using the current weights and the activation
function (usually the sign function).
3. Calculate the error between the predicted output and the actual output.
where learning_rate is a hyperparameter that determines the step size for updating the weights, error is
the difference between the predicted output and the actual output, input[i] is the ith input feature.
ML ( V.S) 16
6.Repeat steps 2-4 for a number of iterations or until the classification error is below a certain threshold.
gradient decent
It is an algorithm that starts from a random point on the loss function and iterates down the
slope to reach the global minima of the function. The weight of the algorithm is updated once
every data point is iterated for the calculation of the loss that occurred.
After every iteration, the value of the slope is calculated i.e. dL/dx. and the new weight is
calculated using the formula Wnew = Wold – eta(dL/dX)
Eta --> Learning Rate, WoldPrevious old weight , Wnew Updated the new weight
dL/dX The gradient value that we got after the iteration.
ML ( V.S) 17
gradient decent
Let’s say we have 20000 data points with 10 features. Then the gradient will be calculated
concerning all the features for all the data points in the set.
So the total number of calculations will be 20000 * 10 = 200000.
For attaining global minima, it is common for an algorithm to have 1000 iterations.
So now, the total number of computations performed by the system will be 200000 * 1000 =
200000000. This is a very large computation that consumes a lot of time and
hence Gradient descent is slow over the huge dataset.
Stochastic Gradient Descent (SGD), only one random training example is used
to calculate the gradient and update the parameters at each iteration
ML ( V.S) 19
SGD….
1.Initialize the model parameters randomly.
2.Define the cost function that you want to minimize.
3.Specify the hyperparameters such as the learning rate, batch size, and
momentum.
4.Repeat the following steps until convergence or a maximum number of
iterations:
a. Shuffle the training data.
b. Divide the training data into mini-batches of the specified size.
c. For each mini-batch, compute the gradient of the cost function with respect
to the model parameters.
d. Update the model parameters by subtracting the product of the gradient and
the learning rate from the current parameter values.
e. Optionally, apply momentum to the update by adding a fraction of the
previous update vector to the current update vector.
5. Evaluate the performance of the trained model on the validation set or test set.
6. Adjust the hyperparameters if necessary and repeat steps 4-5.
ML ( V.S) 20
7. Stop when the performance of the model on the validation set stops improving,
Local and Global minima
Local minima:
The point in a curve which is minimum when compared to its preceding and succeeding
points is called local minima.
Global minima:
The point in a curve which is minimum when compared to all points in the curve is called
Global Minima.
For a curve there can be more than one local minima, but it does have only one global
minima.
ML ( V.S) 21
Batch Gradient Descent Stochastic Gradient Descent
Computes gradient using the whole Training
Computes gradient using a single Training sample
sample
Faster and less computationally expensive than
Slow and computationally expensive algorithm
Batch GD
Not suggested for huge training samples. Can be used for large training samples.
Deterministic in nature. Stochastic in nature.
Gives optimal solution given sufficient time to
Gives good solution but not optimal.
converge.
The data sample should be in a random order, and
No random shuffling of points are required. this is why we want to shuffle the training set for
every epoch.
Can’t escape shallow local minima easily. SGD can escape shallow local minima more easily.
Convergence is slow. Reaches the convergence much faster.
ML ( V.S) 22
Batch Gradient Descent Stochastic Gradient Descent
It updates the model parameters only after It updates the parameters after each individual data
processing the entire training set. point.
The learning rate is fixed and cannot be changed
during training. The learning rate can be adjusted dynamically.
ML ( V.S) 23
Backpropagation
By decomposing prediction into its basic elements we can find that weights are the variable elements
affecting prediction value. In other words, in order to change prediction value, we need to change
weights values.
ML ( V.S) 24
Backpropagation ……
The question now is how to change\update the weights value so that the error is reduced?
The answer is Backpropagation!
ML ( V.S) 26
Backpropagation…..
So to update w6 we can apply the following formula
Similarly, we can derive the update formula for w5 and any other weights existing between the output and the hidden
layer.
Vanishing gradient problem depends on the choice of the activation function. Many common activation
functions (e.g sigmoid or tanh) 'squash' their input into a very small output range in a very non-linear
fashion. For example, sigmoid maps the real number line onto a "small" range of [0, 1], especially with the
function being very flat on most of the number-line. As a result, there are large regions of the input space
which are mapped to an extremely small range. In these regions of the input space, even a large change in
the input will produce a small change in the output - hence the gradient is small.
ML ( V.S) 28
vanishing gradients problem……..
Vanishing gradient problem depends on the choice of the activation function. Many common activation
functions (e.g sigmoid or tanh) 'squash' their input into a very small output range in a very non-linear fashion.
For example, sigmoid maps the real number line onto a "small" range of [0, 1], especially with the function
being very flat on most of the number-line. As a result, there are large regions of the input space which are
mapped to an extremely small range. In these regions of the input space, even a large change in the input will
produce a small change in the output - hence the gradient is small.
This becomes much worse when we stack multiple layers of such non-linearities on top of each other. For
instance, first layer will map a large input region to a smaller output region, which will be mapped to an even
smaller region by the second layer, which will be mapped to an even smaller region by the third layer and so
on. As a result, even a large change in the parameters of the first layer doesn't change the output much.
We can avoid this problem by using activation functions which don't have this property of 'squashing' the
input space into a small region. A popular choice is Rectified Linear Unit which maps x
to max(0,x)
.
ML ( V.S) 29
ReLU Activation Function
ReLU stands for Rectified Linear activation Unit It is simple yet really better than its
predecessor activation functions such as sigmoid or tanh.
def ReLU(x):
if x>0:
return x
else:
return 0
ML ( V.S) 30
Advantage of ReLu:
• Here all the negative values are converted into the 0 so there are no negative values
are available.
• Maximum Threshold values are Infinity, so there is no issue of Vanishing Gradient
problem so the output prediction accuracy and there efficiency is maximum.
• Speed is fast compare to other activation function
ML ( V.S) 31
hyperparameter tuning
Hyperparameters, are the parameters that cannot be directly learned from the regular training process.
They are usually fixed before the actual training process begins. These parameters express important
properties of the model such as its complexity or how fast it should learn.
GridSearchCV
RandomizedSearchCV
ML ( V.S) 32
GridSearchCV
• In GridSearchCV approach, the machine learning model is evaluated for a range of hyperparameter values. This
approach is called GridSearchCV, because it searches for the best set of hyperparameters from a grid of
hyperparameters values.
• For example, if we want to set two hyperparameters C and Alpha of the Logistic Regression Classifier model, with
different sets of values. The grid search technique will construct many versions of the model with all possible
combinations of hyperparameters and will return the best one.
• As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4]. For a combination of C=0.3 and
Alpha=0.2, the performance score comes out to be 0.726(Highest), therefore it is selected.
ML ( V.S) 33
ear separability: A dataset is linearly separable if there is at least one line that clearly
nguishes the classes.
-linear separability: A dataset is said to be non-linearly separable if there isn’t a single line
t clearly distinguishes the classes.
ML ( V.S) 34
regularization
1. Batch Normalization
2. Drop-out layer
ML ( V.S) 35
1. Batch Normalization
• Normalization is a data pre-processing tool used to bring the numerical data to a common scale without
distorting its shape.
• Batch normalization, is a process to make neural networks faster and more stable
through adding extra layers to a deep neural network. The new layer performs the
standardizing and normalizing operations on the input of a layer coming from a
previous layer.
• A typical neural network is trained using a collected set of input data called batch.
Similarly, the normalizing process in batch normalization takes place in batches, not as a
single input.
Internal Covariate Shift as the change in the distribution of network activations due to the
change in network parameters during training.
ML ( V.S) 36
Batch Normalization…..
ML ( V.S) 37
Batch Normalization…..
Although, our input X was normalized with time the
output will no longer be on the same scale. As the
data go through multiple layers of the neural
network and L activation functions are applied, it
leads to an internal co-variate shift in the data.
Step 1: Normalization of the Input
Normalization is the process of transforming the data to have a mean zero and standard deviation one.
In this step we have our batch input from layer h, first, we need to calculate the mean of this hidden activation.
ML ( V.S) 38
Batch Normalization…..
Step 3: subtract the mean from each input and divide the whole value with the sum of
standard deviation and the smoothing term (ε).
The smoothing term(ε) assures numerical stability within the operation by stopping a division
by a zero value.
ML ( V.S) 39