0% found this document useful (0 votes)
12 views35 pages

UNIT I-PGI20C05J-Deep Neural Networks (1)

Deep Neural Networks (DNNs) are multi-layered artificial neural networks that excel in modeling complex relationships in data, particularly in tasks like image and speech recognition. They draw inspiration from biological neural systems, allowing them to learn from data and improve performance with larger datasets and computational power. The perceptron, a basic neural network model, serves as a foundation for understanding more complex architectures and their optimization through various activation and loss functions.

Uploaded by

Kejin Spam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views35 pages

UNIT I-PGI20C05J-Deep Neural Networks (1)

Deep Neural Networks (DNNs) are multi-layered artificial neural networks that excel in modeling complex relationships in data, particularly in tasks like image and speech recognition. They draw inspiration from biological neural systems, allowing them to learn from data and improve performance with larger datasets and computational power. The perceptron, a basic neural network model, serves as a foundation for understanding more complex architectures and their optimization through various activation and loss functions.

Uploaded by

Kejin Spam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

UNIT – I

Definition for Deep Neural Networks

Deep neural networks (DNNs) are artificial neural networks with multiple layers between
the input and output layers. These layers include:

• Input Layer: Receives the data.

• Hidden Layers: Perform computations and extract complex features.

• Output Layer: Produces the final result, such as a prediction or classification.

The term "deep" refers to the presence of many hidden layers, allowing DNNs to model
complex relationships and patterns in data. They are widely used in tasks like image
recognition, natural language processing, and speech processing.

Session 1: SLO-1 - An Introduction to Neural Networks

A neural network (also called an artificial neural network) is an adaptive system that learns
by using interconnected nodes or neurons in a layered structure that resembles a human
brain.

Artificial neural networks are popular machine learning techniques that simulate the
mechanism of learning in biological organisms.

A neural network can learn from data so it can be trained to recognize patterns, classify data,
and forecast future events.

The human nervous system contains cells, which are referred to as neurons.

The neurons are connected to one another with the use of axons and dendrites, and the
connecting regions between axons and dendrites are referred to as synapses.
The strengths of synaptic connections often change in response to external stimuli.

This change is how learning takes place in living organisms.

This biological mechanism is simulated in artificial neural networks, which contain


computation units that are referred to as neurons.

The computational units are connected to one another through weights, which serve the same
role as the strengths of synaptic connections in biological organisms. Each input to a
neuron is scaled with a weight, which affects the function computed at that unit.

“An artificial neural network computes a function of the inputs by propagating the
computed values from the input neurons to the output neuron(s) and using the weights
as intermediate parameters.”

Weights are adjusted based on training data, which provide input-output pairs.

Training data offer feedback on prediction accuracy, guiding weight updates to reduce errors.

Errors in prediction act as feedback, similar to unpleasant stimuli in biological organisms,


leading to weight adjustments.

Through training on diverse examples, neural networks generalize their learning to make
accurate predictions on unseen data.

Basic units in neural networks draw inspiration from classical models like least-squares and
logistic regression.

By connecting basic units, neural networks can learn more complex functions than classical
methods.

Proper architecture design and insight from analysts enhance the power of the network.
Larger and more complex neural network architectures need sufficient training data to learn
the numerous parameters effectively.

The performance of neural networks improves significantly with the availability of large
datasets.

Deep learning models outperform conventional machine learning methods when sufficient
data and computational resources are available.

Increased data availability and computational power have fuelled rapid advancements in
deep learning, likened to a "Cambrian explosion" in technology.

The major advantage which DNNs hold is the ability of dealing with unstructured
and unlabeled data which is actually most of the world's data. Especially in the medical
field, most of the data is unstructured and unlabeled.

Session 1 : SLO-2 -Humans Versus Computers: Stretching the Limits of Artificial


Intelligence

Differences Between Humans and Computers:

● Computers excel at tasks like calculating cube roots, which are difficult for humans.
Humans, on the other hand, are naturally good at tasks like recognizing objects in
images, which have traditionally been hard for machines.
● Recent advancements in deep learning have allowed AI to exceed human performance
in some narrow tasks like image recognition, which was unimaginable a decade ago.
Deep Neural Networks and Biological Inspiration:

● Deep learning architectures are not random; their success is due to careful design,
often inspired by biological systems.
● Depth in artificial neural networks is key to their performance, similar to biological
neural networks.
● Convolutional Neural Networks (CNNs), used in image recognition, were inspired by
experiments on the visual cortex of cats. This design shows the power of drawing
from biological neuroscience.

Biological Learning and AI Potential:

● The human brain merges sensation and intuition in ways AI cannot replicate yet.
● As neuroscience advances, there is potential to create more biologically inspired
architectures like CNNs, leading to breakthroughs in AI.

Advantages of Neural Networks:

● Neural networks are better than traditional machine learning at capturing high-level
patterns in data through architecture design.
● Their complexity can be easily adjusted by adding or removing neurons, depending
on the data and computational power available.

Challenges with Traditional Machine Learning:

● Traditional machine learning often works better with smaller datasets because it
allows for hand-crafted features and greater interpretability.
● In the past, neural networks couldn't reach their full potential due to limited data and
computational power.

The Era of Big Data:

● Advances in data collection mean nearly all our actions—like purchases and clicks—
are recorded and stored, enabling "big data."
● The development of GPUs has made it faster and easier to process large datasets,
allowing more experimentation with neural networks.

Advances in Deep Learning:

● Many recent successes in deep learning are due to increased data, computational
power, and faster algorithm testing.
● Faster testing enables quick adjustments to neural network algorithms, a key factor in
their improvement.

Future Prospects:

● By the end of the century, computers might be able to train networks as large as the
human brain, though predicting their exact capabilities is difficult.
● The progress in computer vision shows that unexpected breakthroughs are likely in
AI's future.

Session 2: SLO-1 The Basic Architecture of Neural Networks

● Single-layer
● Multi-layer neural networks

In the single- layer network, a set of inputs is directly mapped to an output by using a
generalized variation of a linear function. This simple instantiation of a neural network is also
referred to as the perceptron.

In multi-layer neural networks, the neurons are arranged in layered fashion, in which the
input and output layers are separated by a group of hidden layers. This layer-wise architecture
of the neural network is also referred to as a feed-forward network.

A perceptron takes inputs, multiplies them by weights, and adds a bias. The result is passed
through an activation function (like a sign function) to produce an output. It adjusts weights
during training to minimize classification errors, working well for linearly separable data but
struggling with non-linear data.

Session 2: SLO-2 Single Computational Layer: The Perceptron

The simplest neural network is referred to as the perceptron. This neural network contains a
single input layer and an output node. The basic architecture of the perceptron is shown in
Figure 1.3(a).
Perceptron is also known as an artificial neural network. Perceptron is mainly used to
compute the logical gate like AND, OR, and NOR which has binary input and binary output.

The main functionality of the perceptron is:-

• Takes input from the input layer

• Weight them up and sum it up.

• Pass the sum to the nonlinear function to produce the output.

The input layer contains d nodes that transmit the d features X = [x1 . . . xd] with edges of
weight W = [w1 . . . wd] to an output node. The input layer does not perform any computation
in its own right. The linear function W · X = _d i=1 wixi is computed at the output node.
Subsequently, the sign of this real value is used in order to predict the dependent variable of
X. Therefore, the prediction ˆy is computed as follows:

The error of the prediction is therefore E(X) = y ˆy, which is one of the values drawn from
the set { 2, 0, +2}. In cases where the error value E(X) is nonzero, the weights in the neural
network need to be updated in the (negative) direction of the error gradient.
The perceptron has an input layer that passes features to the output node. Each input is
multiplied by a weight, and the results are added together. A sign function is applied to this
sum to produce a class label. By changing the activation function, the perceptron can imitate
models like regression or SVMs. It has two layers, but only the computational layer is
counted, so it is called a single-layer network.

There is an invariant part of the prediction, which is referred to as the bias. An additional
bias variable b that captures this invariant part of the prediction:

The bias can be incorporated as the weight of an edge by using a bias neuron. This is
achieved by adding a neuron that always transmits a value of 1 to the output node. The
weight of the edge connecting the bias neuron to the output node provides the bias variable.

The perceptron algorithm aims to reduce misclassifications and has convergence proofs for
simple cases. Its goal can be written in least-squares form using all feature-label pairs in the
dataset D

This type of minimization objective function is also referred to as a loss function. The
perceptron algorithm (implicitly) uses a smooth approximation of the gradient of this
objective function with respect to each example:

The gradient is not from the staircase-like heuristic objective, which lacks useful gradients.
Instead, it is smoothed into the perceptron criterion. The perceptron criterion, introduced after
Rosenblatt’s work, explains heuristic gradient descent. For simplicity, we assume the
perceptron optimizes an unknown smooth function using gradient descent.
The training algorithm updates weights after processing each input X or small batches. Predictions are made

A single training data point may be cycled through many times. Each such cycle is referred to
as an epoch
. One can also write the gradient descent update in terms of the error E(X) = (y ˆy) as follows:

The basic perceptron algorithm can be considered a stochastic gradient-descent method,


which implicitly minimizes the squared error of prediction by performing gradient-descent
updates with respect to randomly chosen training points.

Linearly Separable Data: Data is linearly separable when it can be divided into distinct
classes using a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions).
In this case, the equation 𝑊⋅𝑋=0 defines the boundary, with being the vector normal to the
hyperplane. Points on one side have ⋅𝑋>0, and points on the other side have 𝑊⋅𝑋<0.

The perceptron algorithm performs well on such data and achieves zero classification error
during training.

Not Linearly Separable Data: Data is not linearly separable when no straight line, plane, or
hyperplane can perfectly divide the classes. The perceptron struggles with this type of data
and may fail to converge or provide poor solutions compared to other learning algorithms.
What Objective Function Is the Perceptron Optimizing?

The general goal was to minimize the number of classification errors with a heuristic update

process (in hardware) that changed weights in the “correct” direction whenever errors were

made.

In modern deep neural networks, smooth approximations of the perceptron loss are often
used to facilitate gradient-based optimization techniques like stochastic gradient descent
(SGD). Some common smooth loss functions that serve as alternatives to the perceptron loss
include:

a. Hinge Loss (for Support Vector Machines)

The hinge loss is commonly used for binary classification problems and provides a smooth,
differentiable loss function for optimizing linear classifiers

● This loss penalizes points that are misclassified or within a margin 111.
● It encourages a margin of separation between classes, making it smoother than the
classic perceptron loss.

b. Logistic Loss (Cross-Entropy Loss)

The logistic loss, or cross-entropy loss, is used in logistic regression and neural networks:

This loss is smooth and differentiable.

It ensures that even small deviations from the correct classification result in
meaningful gradient updates.

c. Mean Squared Error (MSE) Loss

Although less common for classification, the mean squared error (MSE) loss can
also be used:
● This loss measures the squared difference between the predicted and true labels.
● It is smooth and provides gradients proportional to the error magnitude.

Choice of Activation and Loss Functions

The importance of nonlinear activation functions becomes significant when one


moves from the single-layered perceptron to the multi-layered architectures. Different
types of nonlinear functions such as the sign, sigmoid, or hyperbolic tangents

may be used in various layers. The notation Φ is used to denote the activation
function

In the case of the perceptron, the choice of the sign activation function is motivated by the
fact that a binary class label needs to be predicted.

The importance of nonlinear activation functions becomes significant when one moves from
the single-layered perceptron to the multi-layered architectures discussed later in this chapter.
Different types of nonlinear functions such as the sign, sigmoid, or hyperbolic tangents may
be used in various layers.We use the notation Φ to denote the activation function:

a neuron really computes two functions within the node.

The value computed before applying the activation function Φ(·) will be referred to as the
pre-activation value, whereas the value computed after applying the activation function is
referred to as the post-activation value. The output of a neuron is always the post-activation
value, although the pre-activation variables are often used in different types of analyses, such
as the computations of the backpropagation algorithm.

The classical activation functions that were used early in the development of neural networks
were the sign, sigmoid, and the hyperbolic tangent functions:

The sign activation can be used to map to binary outputs at prediction time, its non-
differentiability prevents its use for creating the loss function at training time.

The sigmoid activation outputs a value in (0, 1), which is helpful in performing computations
that should be interpreted as probabilities.

The tanh

function has a shape similar to that of the sigmoid function, except that it is horizontally re-scaled
The ReLU and hard tanh activation functions have largely replaced the sigmoid and soft
tanh activation functions in modern neural networks because of the ease in training
multilayered neural networks with these activation functions.

Choice and Number of Output Nodes

The choice and number of output nodes in a neural network are closely linked to the
activation function, which depends on the application:

k-Way Classification:

● Uses k output nodes and a softmax activation function.


● The softmax function for the i-th output is:

● Converts the final layer’s outputs into probabilities for k classes.

Multinomial Logistic Regression:

● Implemented with a single hidden layer using linear activations followed by a


softmax layer.
● No weights are associated with the softmax layer.
Autoencoders:

● Use multiple output nodes to reconstruct the input data.


● Useful for tasks like matrix factorization (e.g., Singular Value Decomposition).

Choice of Loss Function

The choice of the loss function is critical in defining the outputs in a way that is sensitive

to the application at hand.

Least-Squares Regression:

● Loss Function : L=(yy^)2


● For numeric outputs with linear activation.

Hinge Loss (Support Vector Machine):

● Loss Function : L=max{0,1yy^}


● For binary classification with y{1,+1} and linear activation.

For multiway predictions (e.g., classifying words or multiple categories), softmax is useful
for converting outputs into probabilities. The probabilistic predictions uses two different
types of loss functions.

● Binary targets (logistic regression)


● Categorical targets
Binary targets (logistic regression)

Refers to outputs that can take only two possible values


, typically represented as {0,1} or {1,+1}

Example: Predicting whether an email is spam (1) or not spam (0).

Loss Function : L=log(1+exp(yy^))

For probabilistic binary classification.

Categorical targets

Refers to outputs that can take more than two possible classes (multi-class classification).

Example: Predicting the category of a news article (e.g., politics, sports, technology).

Loss Function : L=log(y^r) is the probability of the correct class.

Used for multi-class classification.

Some Useful Derivatives of Activation Functions

Linear and Sign Activations:

● Linear Activation: The derivative is always 1.


● Sign Activation: The derivative is 0 for all values except at v=0, where it is
discontinuous and non-differentiable.
● Due to the zero gradient and non-differentiability, the sign function is rarely used in
loss functions, though it may be used during testing.

Sigmoid Activation:

● Function:

● Derivative:
● The sigmoid function is useful for binary classification tasks, and its derivative
facilitates smooth gradient descent updates.

Tanh Activation:

● Function:

● Derivative:


The tanh function outputs values between 1 and 1, making it useful for zero-centered data, and its der

ReLU and Hard Tanh Activations:

● ReLU (Rectified Linear Unit): Derivative is 1 for v0 and 0 otherwise.


● Hard Tanh: Derivative is 1 for v[1,1]and 0 outside this range.
● ReLU is widely used in deep networks for its simplicity and efficiency in mitigating
vanishing gradient issues.
Session 3: SLO-1 Multilayer Neural Networks

Multilayer neural networks have more than one computational layer.

A perceptron has only an input and output layer, with computations in the output layer.

The input layer transmits the data to the output layer, and all computations are completely
visible to the user.

Multilayer neural networks contain multiple computational layers; the additional intermediate
layers (between input and output) are referred to as hidden layers because the computations
performed are not visible to the user.

Feed-forward networks are defined as Data flows from input to output through successive
layers. All nodes in one layer typically connect to the next.

The network’s structure is determined by the number of layers and the number and type of
nodes in each layer.

For categorical predictions, softmax activation with cross-entropy loss is used, while for
continuous predictions, linear activation with squared loss is applied.

As in the case of single-layer networks, bias neurons can be used both in the hidden layers
and in the output layers.

Examples of multilayer networks with or without the bias neurons are


These networks typically have three layers (excluding the input layer, which only transmits
data). If each of the k layers contains p₁ ... pₖ units, the outputs are represented as vectors
h₁ ... hₖ, with the number of units in each layer defining its dimensionality.

● Weights between layers are represented as matrices and Weights are represented by a
matrix W₁ of size d × p₁.
● Weights between the r-th hidden layer and the (r + 1)-th hidden layer are denoted by
a matrix Wᵣ of size pᵣ × pᵣ₊₁.
● If the output layer contains o nodes, the final weight matrix Wₖ₊₁ has a size of pₖ ×
o.

Outputs are computed recursively with activation functions applied element-wise.

Neural network layers can be represented as vectors (rectangles) for simplicity, with
connections as weight matrices. Most layers use the same activation function, except for the
output layer.
Networks can vary by adding multiple output nodes, depending on the application (e.g.,
classification or dimensionality reduction). Autoencoders reduce data dimensions by
reconstructing inputs, similar to singular value decomposition.

Fully connected networks perform well but risk overfitting, especially with too many
parameters. Convolutional neural networks reduce overfitting by using domain-specific
designs for image data.

Advanced methods like pre-training help improve learning quality and minimize overfitting,
which occurs when networks perform well on training data but poorly on unseen data.

SLO-2 The Multilayer Network as a Computational Graph

A neural network can be viewed as a computational graph composed of basic parametric


models. While the term "perceptron" is often used for the basic unit, other units like logistic
(sigmoid) and linear units are more commonly used as building blocks in different settings.

A multilayer network evaluates compositions of functions computed at individual nodes.

A path of length 2, where f() follows g(), represents the composition f(g()).

In layer m+1, if a node computes f() based on g1(),g2(),…,gk() from layer m, the composition is f(g1(),…,gk

Nonlinear activation functions are essential for increasing the power of multiple layers. If
all layers use identity activation functions, the network simplifies to linear regression.

A network with one hidden layer of nonlinear units and a linear output layer can approximate
almost any function. However, this often needs a lot of hidden units, which makes training
harder when there isn’t much data.

Deeper networks are often preferred because they reduce the number of hidden units per layer
and the total number of parameters, simplifying training.

The backpropagation algorithm automates the learning of weights by using dynamic


programming to handle complex parameter updates, allowing analysts to avoid manual
calculations.
Neural networks support modular design, where layers and units act as building blocks. Off-
the-shelf software makes it easy to design, experiment with, and implement various network
architectures.

As a result, building neural networks requires relatively simple code, making the process
more accessible to engineers rather than requiring advanced mathematical expertise.

Session 4

SLO-1 Training a Neural Network with Backpropagation

In a single-layer neural network, the error (or loss function) is a direct function of the
weights, making gradient computation straightforward.

In multi-layer neural networks, the loss function is a complex composition of weights from
earlier layers, complicating gradient computation.

A backpropagation algorithm, or backward propagation of errors, is an algorithm that's


used to help train neural network models. The algorithm adjusts the network's weights to
minimize any gaps -- referred to as errors -- between predicted outputs and the actual target
output.

The backpropagation algorithm is used to compute gradients in multi-layer networks by


applying the chain rule of differential calculus.

Backpropagation calculates error gradients as a summation of local-gradient products over


multiple paths from a node to the output.

The backpropagation algorithm has two phases: the forward phase and the backward
phase.

The forward phase computes output values and local derivatives at each node.

The backward phase accumulates the products of these local values over all paths from each
node to the output.

Forward Phase:

● The inputs for a training instance are fed into the neural network.
● These inputs trigger a cascade of computations across the layers using the current
weights.
● The network produces a predicted output based on the computations.
● The predicted output is compared to the actual output to calculate the loss.
● The derivative of the loss function with respect to the output is then computed.
● This derivative will be used in the backward phase to calculate gradients for the
weights in all layers.

Backward Phase:

● The goal of the backward phase is to compute the gradient of the loss function with
respect to the weights.
● The chain rule of differential calculus is used to compute these gradients.
● The computed gradients are then used to update the weights in the network.
● The gradients are calculated by propagating backward, starting from the output node.
● This process is called the backward phase because learning happens in the backward
direction.

If there is only a single path from h1 to o, the gradient of the loss function for any edge
weight on that path can be calculated using the chain rule.

The multivariable chain rule computes the gradient in a computational graph where
multiple paths may exist. It does this by adding the contributions along each path from h1
to o.

An example of the chain rule in a computational graph with two paths is


SLO-2 Practical Issues in Neural Network Training

Vanishing and Exploding Gradients

Deep learning networks can be problematic when the numbers change too quickly or slowly
through many layers. This can make it hard for the network to learn and stay stable. This can
cause difficulties for the network in learning and remaining stable.

Solution: Gradient clipping, advanced weight initialization, and skip connections help a
computer learn things accurately and consistently.

Overfitting

Overfitting happens when a model knows too much about the training data, so it can't make
good predictions about new data. As a result, the model performs well on the training data but
struggles to make accurate predictions on new, unseen data. It's essential to address
overfitting by employing techniques like regularization, cross-validation, and more diverse
datasets to ensure the model generalizes well to unseen examples.
Solution: Regularisation techniques help us ensure our models memorize the data and use
what they've learned to make good predictions about new data. Techniques like dropout,
L1/L2 regularisation, and early stopping can help us do this.

Data Augmentation and Preprocessing

Data augmentation and preprocessing are techniques used to provide better information to the
model during training, enabling it to learn more effectively and make accurate predictions.

Solution: Apply data augmentation techniques like rotation, translation, and flipping
alongside data normalization and proper handling of missing values.

Label Noise

Training data sometimes need to be corrected, making it hard for computers to do things well.

Solution: Using special kinds of math called "loss functions" can help ensure that the model
you are using is not affected by label mistakes.

Imbalanced Datasets

Datasets can have too many of one type of thing and need more of another type. This can
cause models not to work very well for things not represented as much.

Solution: Classes can sometimes be uneven, meaning more people are in one group than
another. To fix this, we can use special techniques like class weighting, oversampling, or data
synthesis to ensure that all the classes have the same number of people.

Computational Resource Constraints

Training deep neural networks can be very difficult and take a lot of computer power,
especially if the model is very big.

Solution: Using multiple computers or special chips called GPUs and TPUs can help make
learning faster and easier.

Hyperparameter Tuning

Deep neural networks have numerous hyperparameters that require careful tuning to achieve
optimal performance.
Solution: To efficiently find the best hyperparameters, such as Bayesian optimization or
genetic algorithms, utilize automated hyperparameter optimization methods.

Convergence Speed

It is important to ensure a model works quickly when using lots of data and complicated
designs.

Solution: Adopt learning rate scheduling or adaptive algorithms like Adam or RMSprop to
expedite convergence.

Activation Function Selection

Using the proper activation function when building a machine-learning model is important.
This helps ensure the model works properly and yields correct results.

Solution: ReLU and its variants (Leaky ReLU, Parametric ReLU) are popular choices due to
their ability to mitigate vanishing activation issues.

Gradient Descent Optimization

Gradient descent algorithms help computers solve problems but sometimes need help when it
is very difficult.

Solution: Advanced techniques can help us navigate difficult problems better. Examples are
stochastic gradient descent with momentum and Nesterov Accelerated Gradient.

Memory Constraints

Computers need a lot of memory to train large models and datasets, but they can work
properly if there is enough memory.

Solution: Reduce memory usage by applying model quantization, using mixed-precision


training, or employing memory-efficient architectures like MobileNet or EfficientNet.

Transfer Learning and Domain Adaptation

Deep learning networks need lots of data to work well. If they don't get enough data or the
data is different, they won't work as well.

Solution: Leverage transfer learning or domain adaptation techniques to transfer knowledge


from pre-trained models or related domains.
Exploring Architecture Design Space

Designing buildings is difficult because there are many different ways to do it. Choosing the
best way to create a building for a specific purpose can take time and effort.

Solution: Use automated neural architecture search (NAS) algorithms to explore the design
space and discover architectures tailored to the task.

Adversarial Attacks

Deep neural networks are unique ways of understanding data. But they can be tricked by
minimal changes that we can't see. This can make them give wrong answers.

Solution: Employ adversarial training, defensive distillation, or certified robustness methods


to enhance the model's robustness against adversarial attacks.

Interpretability and Explainability

Understanding the decisions made by deep neural networks is crucial in critical applications
like healthcare and autonomous driving.

Solution: Adopt techniques such as LIME (Local Interpretable Model-Agnostic


Explanations) or SHAP (SHapley Additive exPlanations) to explain model predictions.

Handling Sequential Data

Training deep neural networks on sequential data, such as time series or natural language
sequences, presents unique challenges.

Solution: Utilize specialized architectures like recurrent neural networks (RNNs) or


transformers to handle sequential data effectively.

Limited Data

Training deep neural networks with limited labeled data is a common challenge, especially in
specialized domains.

Solution: Consider semi-supervised, transfer, or active learning to make the most of available
data.

Catastrophic Forgetting
When a model forgets previously learned knowledge after training on new data, it encounters
the issue of catastrophic forgetting.

Solution: Implement techniques like elastic weight consolidation (EWC) or knowledge


distillation to retain old knowledge during continual learning.

Hardware and Deployment Constraints

Using trained models on devices with not much computing power can be hard.

Solution: Scientists use special techniques to make computer models run better on devices
with limited resources.

Data Privacy and Security

When training computers to do complex tasks, it is essential to keep data private and ensure
the computers are secure.

Solution: Employ federated learning, secure aggregation, or differential privacy techniques to


protect data and model privacy.

Long Training Times

Training deep neural networks is like doing a challenging puzzle. It takes a lot of time to
assemble the puzzle, especially if it is vast and has a lot of pieces.

Solution: Special tools like GPUs or TPUs can help us train our computers faster. We can
also try using different computers simultaneously to make the training even quicker.

Exploding Memory Usage

Some models are too big and need a lot of space, so they are hard to use on regular
computers.

Solution: Explore memory-efficient architectures, use gradient checkpointing, or consider


model parallelism for training.

Learning Rate Scheduling

Setting an appropriate learning rate schedule can be challenging, affecting model


convergence and performance.
Solution: Using special learning rate schedules can help make learning easier and faster.
These schedules can be used to help teach things in a better way.

Avoiding Local Minima

Deep neural networks can get stuck in local minima during training, impacting the model's
final performance.

Solution: Using unique strategies like simulated annealing, momentum-based optimization,


and evolutionary algorithms can help us escape difficult spots.

Unstable Loss Surfaces

Finding the best way to do something can be very hard when there are many different options
because the surface it is on is complicated and bumpy.

Solution: Utilize weight noise injection, curvature-based optimization, or geometric methods


to stabilize loss surfaces.

Session 5

SLO-1 The Problem of Overfitting

Overfitting occurs when a neural network learns to model not only the underlying patterns in
the data but also noise and random fluctuations. This leads to excellent performance on the
training data but poor generalization to unseen data.

Causes:

● High model complexity (too many layers or neurons).


● Insufficient training data relative to the model's capacity.

Solutions:

● Regularization: Add a penalty term (e.g., L1 or L2 regularization) to the loss


function to discourage overly complex models.
● Dropout: Randomly drop neurons during training to prevent reliance on specific
pathways.
● Data Augmentation: Increase the effective size of the training dataset by creating
transformed versions of existing data.
SLO-2 The Vanishing and Exploding Gradient Problems

These problems arise during backpropagation:

● Vanishing Gradients: Gradients shrink to very small values, particularly in deeper


layers, making it difficult for weights to update effectively.
● Exploding Gradients: Gradients grow excessively large, leading to instability in
weight updates.

Causes:

● Use of activation functions like sigmoid or tanh, which squash gradients for large
inputs.
● Poor weight initialization in deep networks.

Solutions:

● Use activation functions like ReLU that avoid gradient squashing.


● Use advanced initialization techniques like Xavier Initialization or He Initialization.
● Apply Gradient Clipping to cap gradients at a maximum threshold.

Session 6

SLO-1 Difficulties in Convergence

Convergence refers to the process of minimizing the loss function to reach the optimal
solution. A neural network may converge too slowly or fail to converge at all.

Causes:

● Improper learning rate (too high causes divergence; too low slows convergence).
● Noisy gradients due to small batch sizes in stochastic gradient descent.
● Poor scaling or normalization of input data.

Solutions:

● Learning Rate Adjustment: Use techniques like learning rate decay or adaptive
optimizers (e.g., Adam, RMSprop).
● Batch Normalization: Normalize activations across mini-batches to stabilize
training.
● Preprocess data by normalizing or standardizing features to improve gradient flow.

SLO-2 Local and Spurious Optima

Neural networks have nonlinear optimization functions with many local optima, making good
initialization essential in large parameter spaces. Pretraining, performed in a greedy,
layerwise manner, trains each layer individually to find better initial weights.

Unsupervised pre-training reduces overfitting and spurious optima by aligning the


initialization closer to "good" optima, improving generalization to test data. Unlike traditional
optimization, which focuses only on training loss, neural networks emphasize generalization,
where local optima are less problematic than training challenges like convergence failures.

Session 7

SLO-1 Computational Challenges

● Training neural networks, particularly in text and image domains, can take weeks.
● GPUs significantly accelerate neural network operations and have greatly benefited
training processes.
● Frameworks like Torch integrate GPU support to optimize performance.
● Recent progress in deep learning is largely due to applying existing algorithms on
modern hardware.
● Enhanced hardware allows for frequent testing of computationally intensive
algorithms, improving experimentation and development.
● Models like long short-term memory (LSTM), proposed in 1997, became practical
only recently due to advances in hardware and experimentation.
● Neural networks require heavy computation during training, but predictions are
efficient, involving fewer operations.
● Efficient predictions are vital for real-time tasks like image classification.
● Compression techniques enable trained networks to be deployed in mobile or
resource-constrained environments.
SLO-2 The Secrets to the Power of Function Composition

● Neural networks are computational structures that combine simpler functions


into complex ones.
● Their effectiveness comes from repeated application of nonlinear functions.
● A single layer with many squashing functions can approximate any function
but requires many parameters.
● Excessive parameters increase the risk of overfitting unless the dataset is very
large.
● Deep networks achieve similar expressiveness with fewer parameters, reducing
this risk.
● Nonlinear functions allow neural networks to model complex, non-linear
relationships.
● Using identity (linear) activations limits the network to a basic linear model,
reducing its power.
● Functions like ReLU and sigmoid are chosen for their ability to improve
representation and learning.
● These activations help enhance performance and control overfitting.

1.5.1 The Importance of Nonlinear Activation

● Purpose of Nonlinearity: Linear models can't solve problems where data isn't linearly
separable. Nonlinear activation functions (e.g., ReLU, sigmoid, tanh) enable networks
to model complex functions.
● Power of Nonlinearity: Nonlinear activation functions allow neural networks to
approximate any continuous function (universal approximation theorem). This makes
them crucial for capturing intricate patterns in data.
1.5.2 Reducing Parameter Requirements with Depth

● Why Depth Helps: Adding layers increases the representational capacity without a
proportional increase in parameters. A deep model can compactly represent functions
that would require exponentially more parameters in a shallow network.
● Hierarchical Representation: Depth allows networks to learn hierarchical features,
with lower layers capturing simple patterns and higher layers capturing complex
abstractions.
1.5.3 Unconventional Neural Architectures

● Breaking the Mold: Traditional feedforward architectures have been challenged by


innovations like attention mechanisms, graph neural networks (GNNs), and capsule
networks.
● Examples:
○ Transformers: Leverage self-attention to capture global dependencies.
○ Recurrent Architectures: Capture temporal dynamics in sequential data.
○ Spiking Neural Networks: Mimic biological processes for energy-efficient
computation.
○ Neural Architecture Search (NAS): Automates the discovery of
unconventional architectures tailored for specific tasks.

Session 8

SLO-1 Common Neural Architectures

An overview of some common neural network architectures:

Feedforward Neural Networks (FNNs): These are the simplest type of neural networks,
where information flows in one direction from input to output, passing through hidden layers.

Convolutional Neural Networks (CNNs): Primarily used for image processing, CNNs use
convolutional layers to automatically learn spatial hierarchies of features in data, making
them effective for tasks like image recognition.

Recurrent Neural Networks (RNNs): Designed for sequential data, RNNs have loops that
allow information to persist, making them ideal for tasks like language modeling, speech
recognition, and time series prediction.

Long Short-Term Memory Networks (LSTMs): A type of RNN, LSTMs are specialized to
handle long-term dependencies by using gates to control the flow of information, solving
issues like vanishing gradients.

Generative Adversarial Networks (GANs): Comprising two networks (a generator and a


discriminator), GANs are used for generating new, synthetic data that resembles real data,
commonly applied in image generation.
Autoencoders: These networks learn to encode input data into a compressed representation
and then decode it back to its original form, typically used for dimensionality reduction,
denoising, and anomaly detection.

Transformers: These architectures use self-attention mechanisms to process sequences of


data in parallel, achieving state-of-the-art results in natural language processing tasks like
translation and text generation.

Radial Basis Function Networks (RBFNs): These are a type of neural network that uses
radial basis functions as activation functions, typically applied in classification and regression
problems.

Siamese Networks: These networks consist of two or more identical subnetworks that share
weights, commonly used for similarity learning tasks like face verification.

Capsule Networks (CapsNet): A newer architecture designed to improve CNNs by better


capturing spatial hierarchies and relationships between objects, addressing some of the
weaknesses in CNNs regarding viewpoint variations.

SLO-2 Simulating Basic Machine Learning with Shallow Models

● Basic machine learning models (e.g., linear regression, classification, SVMs, logistic
regression, matrix factorization) can be simulated using shallow neural networks with
one or two layers.
● Exploring these architectures highlights the power of neural networks, as many
traditional machine learning techniques can be replicated with simple models.
● Some basic neural network models, such as the Widrow-Hoff learning model, are
closely related to traditional models like Fisher's discriminant.
● Deeper architectures are often created by creatively stacking simpler models.
● The chapter will cover applications in text mining, graphs, and recommender systems.
Session-9

SLO-1 Radial Basis Function Networks

Radial Basis Function Networks (RBFNs) are a type of artificial neural network widely used
for pattern recognition, function approximation, and time-series prediction. They are
particularly valued for their simplicity and ability to model complex nonlinear relationships.

Key Components:

Input Layer: Accepts the input features.

Hidden Layer: Comprises neurons with radial basis functions, typically Gaussian functions.
Each neuron computes the similarity between the input and a prototype vector, producing an
activation based on the distance.

Output Layer: Provides the final output by combining the weighted activations from the
hidden layer.

Working Principle:

Each hidden neuron in an RBFN responds to the input based on its distance from a predefined
center. The response decreases as the input moves farther from the center, determined by a
radial basis function (e.g., Gaussian).

Training involves two steps:

● Determining the centers and widths of the radial basis functions.


● Optimizing the weights in the output layer, often using least squares or other
regression techniques.

Advantages:

Fast Training: Since only the output weights are trained, the training process is relatively
quick compared to deep networks.

Interpretability: The localized nature of radial basis functions provides insights into how the
network makes decisions.

Applications:
RBFNs are used in classification, regression, and control systems. Their ability to
approximate any continuous function makes them particularly suited for engineering,
financial modeling, and medical diagnosis.

RBFNs are conceptually simple yet powerful tools in machine learning, offering a balance
between interpretability and computational efficiency.

SLO-2 Restricted Boltzmann Machines

Restricted Boltzmann Machines (RBMs) as probabilistic generative models capable of


learning a probability distribution over a set of inputs. RBMs are energy-based models and a
foundational building block for deep learning architectures, such as deep belief networks
(DBNs).

Key Characteristics:

1. Bipartite Architecture:
○ RBMs have two layers:
■ Visible Layer: Represents the observed input data.
■ Hidden Layer: Learns to capture latent features of the data.
○ There are no connections between nodes within the same layer, making the
structure "restricted."
2. Energy Function:
○ The RBM defines an energy function to quantify the quality of the visible-
hidden layer configuration:

where v and h are the visible and hidden units, Wij is the weight between vi
and hj, and bi,cj are biases.

3. Training Using Contrastive Divergence (CD):


○ RBMs are trained by minimizing the divergence between the observed and
reconstructed data.
○ Aggarwal emphasizes the use of CD for efficient training, involving
alternating Gibbs sampling to approximate gradients.
4. Generative and Feature Learning:
○ RBMs learn to reconstruct inputs, making them effective as generative
models.
○ The hidden layer features can be used for tasks like dimensionality reduction
and classification.

Applications:

Aggarwal highlights that RBMs are useful in:

● Collaborative filtering (e.g., recommendation systems).


● Feature extraction and pre-training in deep neural networks.
● Image recognition and dimensionality reduction.

RBMs are an important stepping stone in the evolution of deep learning, offering insight into
unsupervised learning and feature discovery in neural architectures.

You might also like