Unit 1
Unit 1
DEEP LEARNING
Deep learning is a subset of machine learning that involves algorithms called artificial neural
networks, which are inspired by the structure and function of the human brain. These algorithms
can learn to recognize patterns and make decisions in a way that is similar to how humans do,
but with the ability to process large amounts of data much more quickly and efficiently.
Deep learning has been particularly successful in tasks such as image and speech recognition,
natural language processing, and autonomous driving. It has enabled significant advancements in
areas like computer vision, medical diagnosis, and language understanding.
One of the key features of deep learning is its ability to automatically learn representations of
data, which means that it can discover patterns and features in the input data without the need for
explicit programming. This makes deep learning particularly powerful for tasks where the
underlying structure of the data is complex or not well understood.
Deep learning has been driven by advancements in hardware, such as GPUs (Graphics
Processing Units) and TPUs (Tensor Processing Units), as well as by the availability of large
datasets and improvements in algorithms and training techniques. These advancements have led
to the development of increasingly complex and powerful neural network architectures, leading
to state-of-the-art performance in many domains.
DEEP LEARNING DEFINATION AND DIFFERENCE MACHINE LEARNING AND
ARTIFICAL
Deep learning is a subset of machine learning that uses neural networks with many layers (hence
the term "deep") to learn from data. It is inspired by the structure and function of the human
brain, where each layer of neurons processes different aspects of the input data and passes the
processed information to the next layer. Deep learning algorithms can automatically learn to
recognize patterns and features in the input data without explicit programming, making them
particularly powerful for tasks such as image and speech recognition, natural language
processing, and autonomous driving.
Machine learning, on the other hand, is a broader field that encompasses techniques and
algorithms that enable computers to learn from and make predictions or decisions based on data.
It includes not only deep learning but also other approaches such as decision trees, support vector
machines, and clustering algorithms. Machine learning algorithms can be categorized into
supervised learning (where the model learns from labeled data), unsupervised learning (where
the model learns from unlabeled data), and reinforcement learning (where the model learns
through trial and error based on feedback from its environment).
Artificial intelligence (AI) is a broader concept that refers to the development of computer
systems that can perform tasks that typically require human intelligence, such as visual
perception, speech recognition, decision-making, and language translation. Machine learning and
deep learning are subsets of AI, representing specific approaches to achieving artificial
intelligence by enabling machines to learn from data.
In summary, deep learning is a subset of machine learning that uses neural networks with many
layers to learn from data, while machine learning is a broader field that includes various
techniques and algorithms for enabling computers to learn from data. Artificial intelligence is the
overarching concept that encompasses the development of intelligent computer systems,
including but not limited to machine learning and deep learning.
At its core, deep learning is a type of machine learning that involves training artificial neural
networks to recognize patterns in data. Here are some basic concepts to understand:
Neural Networks: Deep learning uses artificial neural networks, which are computational models
inspired by the structure and function of biological neural networks in the human brain. These
networks consist of interconnected nodes, called neurons, organized in layers. Each neuron takes
input, processes it using a mathematical function, and produces an output.
Deep Neural Networks: Deep learning involves deep neural networks, which are neural networks
with multiple hidden layers between the input and output layers. The "deep" in deep learning
refers to these multiple layers, which allow the network to learn increasingly complex
representations of the data.
Training Data: Deep learning models are trained on large amounts of labeled data. During the
training process, the model learns to associate input data with the correct output by adjusting the
weights of the connections between neurons. This process is often done using optimization
algorithms like gradient descent, which minimize the difference between the model's predictions
and the actual outputs.
Feature Extraction: One of the key advantages of deep learning is its ability to automatically
learn relevant features from raw data. In traditional machine learning, feature engineering (the
process of selecting and transforming input variables) is often done manually, but in deep
learning, the model can learn to extract useful features from the data on its own.
Activation Functions: Activation functions are used in neural networks to introduce non-
linearities into the model, allowing it to learn complex relationships in the data. Common
activation functions include the sigmoid function, tanh function, and rectified linear unit (ReLU).
Loss Functions: Loss functions measure the difference between the model's predictions and the
actual outputs during training. The goal of training is to minimize this loss, which is achieved by
adjusting the model's parameters (weights and biases) using optimization algorithms.
Applications: Deep learning has been successfully applied to a wide range of tasks, including
image and speech recognition, natural language processing, recommendation systems, and
autonomous vehicles. Its ability to learn from large amounts of data and automatically extract
relevant features has made it a powerful tool in many domains.
Overall, deep learning is a powerful approach to machine learning that has led to significant
advancements in AI applications, especially in tasks that involve complex patterns and large
datasets.
LEARNING ALGROTHIM
In deep learning, the learning algorithm refers to the method used to adjust the parameters of the
neural network (such as weights and biases) during the training process. The goal of the learning
algorithm is to minimize a loss function, which measures the difference between the model's
predictions and the actual outputs.
Back propagation: Back propagation is a technique used to calculate the gradients of the loss
function with respect to the parameters of the neural network. It is based on the chain rule of
calculus and allows the gradients to be efficiently calculated layer by layer, starting from the
output layer and moving backward through the network. Back propagation is essential for
training deep neural networks efficiently.
Stochastic Gradient Descent (SGD): SGD is a variant of gradient descent where the parameters
are updated using the gradients of the loss function calculated on a subset of the training data (a
mini-batch) rather than the entire dataset. This can lead to faster convergence and is commonly
used in practice due to its efficiency.
Adam: Adam is an adaptive learning rate optimization algorithm that combines the advantages
of both AdaGrad (adaptive gradient algorithm) and RMSProp (root mean square propagation). It
dynamically adjusts the learning rate for each parameter based on the first and second moments
of the gradients, which can lead to faster convergence and improved performance.
These are just a few examples of learning algorithms used in deep learning. There are many
other algorithms and variations that are used depending on the specific requirements of the
problem and the architecture of the neural network. Each algorithm has its advantages and trade-
offs in terms of convergence speed, stability, and performance.
Maximum Likelihood Estimation (MLE) is a method used in statistics and machine learning to
estimate the parameters of a statistical model. The goal of MLE is to find the values of the
model's parameters that maximize the likelihood of observing the given data.
Maximizing Likelihood: The goal of MLE is to find the values of the parameters that maximize
the likelihood function In other words, we want to find the values of that make the observed data
most probable under the model.
Optimization: Once we have the log-likelihood function, we can use optimization algorithms
such as gradient descent, Newton's method, or other iterative methods to find the values of that
maximize the log-likelihood. These algorithms iteratively update the parameters to move towards
the maximum of the log-likelihood function.
Interpretation: The parameter estimates obtained through MLE are often considered the "best"
estimates in the sense that they maximize the likelihood of observing the given data under the
assumed statistical model.
Maximum Likelihood Estimation is widely used in statistics and machine learning for parameter
estimation in various models, including linear regression, logistic regression, neural networks,
and many others. It provides a principled way to estimate the parameters of a model based on the
observed data and is a fundamental concept in statistical inference.
Building a machine learning algorithm involves several key steps, which can vary depending on
the specific problem you're trying to solve and the type of algorithm you're using. Here's a
general outline of the process:
Define the Problem: Clearly define the problem you want to solve with machine learning. This
includes defining the input data (features) and the output (target variable) you want to predict.
Gather and Prepare Data: Collect and preprocess the data you'll use to train and evaluate your
algorithm. This may involve tasks like data cleaning, handling missing values, encoding
categorical variables, and splitting the data into training and test sets.
Select a Model: Choose a machine learning model that is suitable for your problem. This could
be a decision tree, random forest, support vector machine, neural network, or another type of
model depending on the nature of your data and the problem you're solving.
Train the Model: Use the training data to train the selected model. During training, the model
learns the patterns in the data by adjusting its internal parameters.
Evaluate the Model: Once the model is trained, evaluate its performance on a separate test
dataset that it hasn't seen before. Common evaluation metrics include accuracy, precision, recall,
F1 score, and others, depending on the nature of the problem (classification, regression, etc.).
Tune Hyper parameters: Many machine learning algorithms have hyper parameters that need
to be set before training. Hyper parameters are not learned from the data but are set before the
learning process begins. Tuning these hyper parameters can significantly impact the performance
of the model.
Deploy the Model (if applicable): If the model performs well and meets the requirements of
your problem, you may deploy it to make predictions on new, unseen data.
Monitor and Maintain the Model (if applicable): If the model is deployed in a production
environment, it's important to monitor its performance over time and update it as needed to
maintain its accuracy and relevance.
Throughout this process, it's crucial to iterate and refine your approach based on the performance
of the model and the specific requirements of your problem. Machine learning is often an
iterative process that involves experimentation and continuous improvement.
Neural network
A neural network is a computational model inspired by the structure and function of the human
brain. It is composed of interconnected nodes, called neurons that work together to process and
learn from data. Neural networks are a fundamental component of deep learning, a subset of
machine learning that focuses on training algorithms to learn from data.
Neurons: Neurons are the basic computational units of a neural network. Each neuron receives
input signals, processes them using an activation function, and produces an output signal. In an
artificial neural network, the input signals are typically weighted before being processed by the
activation function.
Layers: Neurons in a neural network are organized into layers. The most basic type of neural
network is the feed forward neural network, which consists of an input layer, one or more hidden
layers, and an output layer. Each layer (except the input layer) receives inputs from the previous
layer and produces outputs that serve as inputs to the next layer.
Weights and Biases: The connections between neurons in adjacent layers are characterized by
weights, which represent the strength of the connection. Each neuron also has an associated bias,
which allows the network to learn more complex patterns by shifting the activation function.
Activation Functions: Activation functions introduce non-linearity into the output of a neuron.
Common activation functions include sigmoid, tanh, ReLU (Rectified Linear Unit), and softmax.
These functions are crucial for enabling the network to learn complex, non-linear relationships in
the data.
Training: Training a neural network involves adjusting the weights and biases of the network
based on the input data and the expected outputs (supervised learning). This is typically done
using optimization algorithms like gradient descent, which minimize a loss function that
measures the difference between the predicted outputs and the true outputs.
Back propagation: Back propagation is a key algorithm for training neural networks. It
calculates the gradient of the loss function with respect to the weights and biases of the network,
allowing for efficient adjustment of these parameters during training.
Types of Neural Networks: There are many types of neural networks, each designed for
specific tasks. Some common types include:
Feed forward Neural Networks (FNN): Basic neural networks where information flows in one
direction, from input to output.
Convolution Neural Networks (CNN): Well-suited for image recognition tasks, with specialized
layers for feature extraction.
Recurrent Neural Networks (RNN): Designed for sequential data, such as time series or natural
language processing tasks, with connections that form cycles.
Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU): Specialized types of
RNNs with improved handling of long-range dependencies.
Neural networks have demonstrated remarkable success in various fields, including image and
speech recognition, natural language processing, robotics, and more. Their ability to learn
complex patterns from data makes them a powerful tool for solving a wide range of problems in
Multilayer perceptron
A multilayer perceptron (MLP) is a type of feed forward artificial neural network with one or
more layers between the input and output layers. It is a classic and widely used architecture in
the field of neural networks and serves as the foundation for many modern deep learning models.
Here are some key characteristics of multilayer perceptron:
Architecture: An MLP consists of an input layer, one or more hidden layers, and an output layer.
Each layer is composed of multiple neurons (also called units or nodes), and each neuron in one
layer is connected to every neuron in the subsequent layer. In a standard MLP, the connections
between neurons do not form cycles, making it a feed forward network.
Neurons and Activation Functions: Each neuron in an MLP applies a weighted sum of its inputs,
adds a bias term, and passes the result through an activation function to produce the neuron's
output. Common activation functions used in MLPs include the sigmoid function, hyperbolic
tangent (tanh) function, and rectified linear unit (ReLU).
Training: MLPs are typically trained using supervised learning techniques, where the network
learns to map input data to output data based on a set of labeled training examples. Training
involves adjusting the weights and biases of the network using algorithms such as gradient
descent and back propagation to minimize a loss function that measures the difference between
the predicted outputs and the true outputs. Universal Approximates: MLPs with a single hidden
layer containing a sufficient number of neurons and an appropriate activation function have been
shown to be universal function approximates, meaning they can approximate any continuous
function to arbitrary accuracy given enough resources.
Applications: MLPs have been successfully applied to a wide range of tasks, including
regression, classification, pattern recognition, and time series prediction. They have been used in
fields such as finance, healthcare, natural language processing, computer vision, and many
others.
Challenges: While powerful, MLPs can be prone to overfitting, especially when dealing with
high-dimensional data or limited training examples. Regularization techniques, such as dropout
and weight decay, are commonly used to mitigate overfitting in MLPs.
Extensions: MLPs can be extended and modified in various ways to address specific challenges
or requirements. For example, adding more layers (creating a deep neural network) can allow
MLPs to learn hierarchical representations of data,
Rosenblatt’s perceptron machine relied on a basic unit of computation, the neuron. Just like in
previous models, each neuron has a cell that receives a series of pairs of inputs and weights.
The major difference in Rosenblatt’s model is that inputs are combined in a weighted sum and, if
the weighted sum exceeds a predefined threshold, the neuron fires and produces an output.
Perceptron neuron model (left) and threshold logic (right). (Image by author)
Threshold T represents the activation function. If the weighted sum of the inputs is greater than
zero the neuron outputs the value 1, otherwise the output value is zero.
Perceptron for Binary Classification With this discrete output, controlled by the activation
function, the perceptron can be used as a binary classification model, defining a linear decision
boundary. It finds the separating hyper plane that minimizes the distance between misclassified
points and the decision boundary.
To minimize this distance, Perceptron uses Stochastic Gradient Descent as the optimization
function. If the data is linearly separable, it is guaranteed that Stochastic Gradient Descent will
converge in a finite number of steps. The last piece that Perceptron needs is the activation
function, the function that determines if the neuron will fire or not. Initial Perceptron models
used sigmoid function, and just by looking at its shape, it makes a lot of sense! The sigmoid
function maps any real input to a value that is either 0 or 1, and encodes a non-linear function.
The neuron can receive negative numbers as input, and it will still be able to produce an output
that is either 0 or 1.
But, if you look at Deep Learning papers and algorithms from the last decade, you’ll see the most
of them use the Rectified Linear Unit (ReLU) as the neuron’s activation function.
The reason why ReLU became more adopted is that it allows better optimization using Stochastic
Gradient Descent, more efficient computation and is scale-invariant, meaning; its characteristics
are not affected by the scale of the input. Putting it all together The neuron receives inputs and
picks an initial set of weights a random. These are combined in weighted sum and then ReLU, the
activation function, determines the value of the output.
Perceptron neuron model (left) and activation function (right).
It does! Perceptron uses Stochastic Gradient Descent to find, or you might say learn, the set of
weight that minimizes the distance between the misclassified points and the decision boundary.
Once Stochastic Gradient Descent converges, the dataset is separated into two regions by a linear
hyper plane. Although it was said the Perceptron could represent any circuit and logic, the biggest
criticism was that it couldn’t represent the XOR gate, exclusive OR, where the gate only
returns 1 if the inputs are different. This was proved almost a decade later by Minsk and Paper, in
1969 and highlights the fact that Perceptron, with only one neuron, can’t be applied to non-linear
data.
Multilayer Perceptron
The Multilayer Perceptron was developed to tackle this limitation. It is a neural network where
the mapping between inputs and output is non-linear. A Multilayer Perceptron has input and
output layers and one or more hidden layers with many neurons stacked together. And while in
the Perceptron the neuron must have an activation function that imposes a threshold, like ReLU or
sigmoid, neurons in a Multilayer Perceptron can use any arbitrary activation function.
Multilayer Perceptron.
Multilayer Perceptron falls under the category of feed forward algorithms, because inputs are
combined with the initial weights in a weighted sum and subjected to the activation function, just
like in the Perceptron. But the difference is that each linear combination is propagated to the next
layer. Each layer is feeding the next one with the result of their computation, their internal
representation of the data. This goes all the way through the hidden layers to the output layer. But
it has more to it. If the algorithm only computed the weighted sums in each neuron, propagated
results to the output layer, and stopped there, it wouldn’t be able to learn the weights that
minimize the cost function. If the algorithm only computed one iteration, there would be no actual
learning.
Back propagation
Back propagation is the learning mechanism that allows the Multilayer Perceptron to iteratively
adjust the weights in the network, with the goal of minimizing the cost function. There is one hard
requirement for back propagation to work properly. The function that combines inputs and
weights in a neuron, for instance the weighted sum, and the threshold function, for instance
ReLU, must be differentiable. These functions must have a bounded derivative, because Gradient
Descent is typically the optimization function used in Multilayer Perceptron.
Multilayer Perceptron, highlighting the Feed forward and back propagation steps.
In each iteration, after the weighted sums are forwarded through all layers, the gradient of
the Mean Squared Error is computed across all input and output pairs. Then, to propagate it back,
the weights of the first hidden layer are updated with the value of the gradient. That’s how the
weights are propagated back to the starting point of the neural network!
One iteration of Gradient Descent. (Image by author)
This process keeps going until gradient for each input-output pair has converged, meaning the
newly computed gradient hasn’t changed more than a specified convergence threshold, compared
to the previous iteration. Your parents have a cozy bed and breakfast in the countryside with the
traditional guestbook in the lobby. Every guest is welcome to write a note before they leave and,
so far, very few leave without writing a short note or inspirational quote. Some even leave
drawings of Molly, the family dog. Summer season is getting to a close, which means cleaning
time, before work starts picking up again for the holidays. In the old storage room, you’ve
stumbled upon a box full of guestbook’s your parents kept over the years. After reading a few
pages, you just had a much better idea. Why not try to understand if guests left a positive or
negative message? In Natural Language Processing tasks, some of the text can be ambiguous, so
usually you have a corpus of text where the labels were agreed upon by 3 experts, to avoid ties.\
Back propagation, short for "backward propagation of errors," is a fundamental algorithm used
for training artificial neural networks, particularly in the context of supervised learning. It plays a
critical role in the learning process by adjusting the weights of the network to minimize the
difference between the actual output and the desired output. Here's how it works:
Basic Concept
Forward Pass:
Input Processing: The input data is fed into the neural network.
Output Calculation: The network processes the input through its layers to produce an output.
Error Calculation: At the output layer, the algorithm calculates the error, which is the
difference between the network's output and the desired output (from the training dataset).
Back propagation
Backward Pass:
Error Propagation: The error is propagated back through the network, starting from the output
layer and moving towards the input layer. This step involves calculating the error contribution of
each neuron to the neurons in the layer before it.
Gradient Calculation: The algorithm computes the gradients of the error with respect to the
network's weights. This is typically done using the chain rule of calculus.
Weights Update: The calculated gradients are then used to update the weights of the network.
The goal is to adjust the weights in a way that minimizes the error.
Learning Rate: An important hyper parameter that controls how much the weights are adjusted
during training. Too high a learning rate can cause the algorithm to overshoot the minimum
error, while too low a learning rate can make the learning process very slow.
Activation Functions: The choice of activation function in the neurons is crucial since the back
propagation algorithm relies on derivatives. Functions like the sigmoid, tanh, and ReLU are
commonly used because they are differentiable.
Use of Derivatives: Back propagation essentially uses the derivatives of the activation functions
to understand how changes in weights affect the overall error. Back propagation is a fundamental
algorithm used in the training of artificial neural networks, which are a type of machine learning
model inspired by the human brain. The goal of back propagation is to adjust the weights of the
connections between neurons in the network in order to minimize the difference between the
actual output of the network and the desired output, typically measured by a loss function.
Forward Pass: During the forward pass, the input data is fed into the neural network, and the
network computes an output based on its current weights and biases. This output is compared to
the true output (the target) using a loss function, which measures the difference between the
predicted output and the actual output.
Backward Pass: In the backward pass (back propagation), the gradients of the loss function with
respect to the weights and biases of the network are computed using the chain rule from calculus.
This involves calculating how small changes in the weights and biases affect the loss function.
Gradient Descent: Once the gradients have been computed, the weights and biases of the
network are adjusted in the opposite direction of the gradients in order to minimize the loss
function. This process is typically done using an optimization algorithm such as gradient descent,
which iteratively updates the weights and biases in small steps to gradually reduce the loss.
Repeat: Steps 1-3 are repeated for multiple iterations (epochs) or until the model's performance
converges to a satisfactory level.
Back propagation is a key component of many machine learning algorithms, especially in the
context of deep learning, where neural networks have multiple layers and a large number of
parameters. It allows these models to learn complex patterns in data and make accurate
predictions.
Stochastic Gradient Descent (SGD) is a variant of the standard gradient descent algorithm
commonly used in training neural networks through back propagation. While standard gradient
descent calculates the average gradient over the entire training dataset to update the model
parameters, stochastic gradient descent updates the parameters using the gradient of the loss
function with respect to a single training example at a time.
Initialization: Initialize the model parameters (weights and biases) with small random values.
Iterative Optimization: For each training example in the dataset: Compute the gradient of the loss
function with respect to the model parameters using the current example. Update the model
parameters in the opposite direction of the gradient, scaled by a learning rate, which controls the
size of the update step. Repeat: Repeat the iterative optimization process for a fixed number of
iterations (epochs) or until convergence criteria are met. Stochastic gradient descent is
computationally more efficient than standard gradient descent because it processes individual
training examples rather than the entire dataset in each iteration. However, it can be noisier and
may require careful tuning of the learning rate to ensure convergence.
Other variants of stochastic gradient descent include: Mini-batch Gradient Descent: This variant
combines the ideas of stochastic gradient descent and batch gradient descent by updating the
model parameters based on a small random subset (mini-batch) of the training data at each
iteration. This approach can provide a balance between the efficiency of stochastic gradient
descent and the stability of batch gradient descent.
Momentum: Momentum is a technique that accelerates SGD by adding a fraction of the update
vector of the past time step to the current update vector. This helps to dampen oscillations and
accelerate convergence, especially in the presence of high curvature, small but consistent
gradients, or noisy gradients.
Adaptive Learning Rate Methods (e.g., Degrade, RMSprop, Adam): These methods adapt the
learning rate for each parameter based on the history of gradients for that parameter. They aim to
overcome some of the challenges of choosing an appropriate global learning rate by adjusting the
learning rate dynamically for each parameter.
The curse of dimensionality refers to various phenomena that arise when analyzing and
organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions)
that do not occur in low-dimensional settings such as the three-dimensional physical space of
everyday experience. This concept is particularly relevant in fields such as machine learning,
data mining, and pattern recognition.
The term was first coined by Richard E. Bellman when considering problems in dynamic
optimization. However, it has broader implications in the field of data analysis. The curse of
dimensionality includes several issues:
Exponential Increase in Volume: As the number of dimensions increases, the volume of the
space increases so fast that the available data become sparse. This sparsity is problematic for any
method that requires statistical significance.
Distance Concentration: In high dimensions, the concept of distance becomes less useful. In
high-dimensional spaces, all points converge to the same distance from each other, making it
difficult to distinguish between data points based on their distances.
Increased Computation Time: The amount of time required to process data can increase
exponentially with the number of dimensions, which can make the analysis of high-dimensional
datasets computationally infeasible.
Over fitting in Machine Learning: In the context of machine learning, more features
(dimensions) can result in a model that fits the training data very well but fails to generalize to
new, unseen data. This is because in high-dimensional spaces, the model has more opportunities
to fit noise in the training data.
Sample Size Requirement: The number of training samples required to ensure that at least one
sample is close to a query point increases exponentially with the number of dimensions. This is
known as the "sample size curse."
Feature Selection: Selecting a subset of the most relevant features to reduce the dimensionality.
Ensemble Methods: Using ensemble methods that combine the predictions from multiple models
can sometimes mitigate the effects of the curse of dimensionality.
Understanding and mitigating the effects of the curse of dimensionality is crucial for effectively
analyzing and extracting meaningful information from high-dimensional data.
Deep feed forward networks, also known as feed forward neural networks or multilayer
perceptions (MLPs), are a type of artificial neural network commonly used in machine learning
and deep learning. They are called "feed forward" because the information flows through the
network in one direction, from the input nodes through the hidden layers (if any) to the output
nodes, without any feedback loops.
Architecture: A deep feed forward network consists of an input layer, one or more hidden
layers, and an output layer. Each layer is composed of nodes (also called neurons) that are
connected to nodes in adjacent layers by weighted connections.
Activation Function: Each node in the hidden layers and the output layer typically applies an
activation function to its input to introduce non-linearity into the network. Common activation
functions include the sigmoid, tanh, and ReLU (Rectified Linear Unit) functions.
Weighted Connections: The connections between nodes in adjacent layers are associated with
weights, which are learned during the training process. These weights determine the strength of
the connections and are adjusted using optimization algorithms like gradient descent during the
training phase to minimize a loss function.
Training: Deep feed forward networks are trained using supervised learning, where the network
is presented with input-output pairs (training examples) and adjusts its weights to minimize the
difference between its predictions and the true outputs. This is typically done using back
propagation, where the gradient of the loss function with respect to the weights is calculated and
used to update the weights in the opposite direction of the gradient.
Universal Function Approximates: Deep feed forward networks have been shown to be
universal function approximates, meaning that they can approximate any continuous function
given enough hidden units (nodes) in the hidden layers and appropriate activation functions.
Applications: Deep feed forward networks are used in a wide range of applications, including
image and speech recognition, natural language processing, and many other areas of machine
learning and artificial intelligence.
Deep feed forward networks are foundational models in the field of deep learning and have
paved the way for more complex architectures like convolution neural networks (CNNs) and
recurrent neural networks (RNNs). They are powerful tools for learning from complex, high-
dimensional data and are widely used in modern machine learning applications.
Now, we know how with the combination of lines with different weight and biases can result in
non-linear models. How does a neural network know what weight and biased values to have in
each layer? It is no different from how we did it for the single based perceptron model. We are
still making use of a gradient descent optimization algorithm which acts to minimize the error of
our model by iteratively moving in the direction with the steepest descent, the direction which
updates the parameters of our model while ensuring the minimal error. It updates the weight of
every model in every single layer. We will talk more about optimization algorithms and back
propagation later. It is important to recognize the subsequent training of our neural network.
Recognition is done by dividing our data samples through some decision boundary. "The process
of receiving an input to produce some kind of output to make some kind of prediction is known
as Feed Forward." Feed Forward neural network is the core of many other important neural
networks such as convolution neural network.
0:0
In the feed-forward neural network, there are not any feedback loops or connections in the
network. Here is simply an input layer, a hidden layer, and an output layer.
There can be multiple hidden layers which depend on what kind of data you are dealing with.
The number of hidden layers is known as the depth of the neural network. The deep neural
network can learn from more functions. Input layer first provides the neural network with data
and the output layer then make predictions on that data which is based on a series of functions.
ReLU Function is the most commonly used activation function in the deep neural network. To
gain a solid understanding of the feed-forward process, let's see this mathematically.
1) The first input is fed to the network, which is represented as matrix x1, x2, and one where one
is the bias value.
2) Each input is multiplied by weight with respect to the first and second model to obtain their
probability of being in the positive region in each model.
So, we will multiply our inputs by a matrix of weight using matrix multiplication.
3) After that, we will take the sigmoid of our scores and gives us the probability of the point
being in the positive region in both models.
4) We multiply the probability which we have obtained from the previous step with the second
set of weights. We always include a bias of one whenever taking a combination of inputs.
And as we know to obtain the probability of the point being in the positive region of this model,
we take the sigmoid and thus producing our final output in a feed-forward process.
Let takes the neural network which we had previously with the following linear models and the
hidden layer which combined to form the non-linear model in the output layer.
So, what we will do we use our non-linear model to produce an output that describes the
probability of the point being in the positive region. The point was represented by 2 and 2. Along
with bias, we will represent the input as
The first linear model in the hidden layer recall and the equation defined it
Which means in the first player to obtain the linear combination the inputs are multiplied by -4, -
1 and the bias value is multiplied by twelve.
The weight of the inputs are multiplied by -1/5, 1, and the bias is multiplied by three to obtain
the linear combination of that same point in our second model.
Now, to obtain the probability of the point is in the positive region relative to both models we
apply sigmoid to both points as
The second layer contains the weights which dictated the combination of the linear models in the
first layer to obtain the non-linear model in the second layer. The weights are 1.5, 1, and a bias
value of 0.5.
Now, we have to multiply our probabilities from the first layer with the second set of weights as
Now, we will take the sigmoid of our final score
It is complete math behind the feed forward process where the inputs from the input traverse the
entire depth of the neural network. In this example, there is only one hidden layer. Whether there
is one hidden layer or twenty, the computational processes are the same for all hidden layers.