0% found this document useful (0 votes)
26 views14 pages

Advanced ML Slides Intro

advance ml course-ee782 by amit sethi of iit bombay of autmn sem 2025, covers conv,nlp,lstm,attention,nlp,glove etc.

Uploaded by

cherishjain01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views14 pages

Advanced ML Slides Intro

advance ml course-ee782 by amit sethi of iit bombay of autmn sem 2025, covers conv,nlp,lstm,attention,nlp,glove etc.

Uploaded by

cherishjain01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Why this course What is the relation between AI, ML, DL etc.?

Introduction to Advanved • Rapid expansion of data and compute


• Memory and data transfer has become cheaper Artificial intelligence
Topics in Machine Learning • Multi-modal data being collected and shared
• Compute has become cheaper
Machine learning
• Sudden explosion in the use of ML
EE 782 Advanved Topics in Machine Learning • Face recognition Neural networks
July-Nov 2025 • Autonomous driving
• Text and image generation
Amit Sethi, EE, IITB
Contact: asethi@iitb, 3528, 7483 Deep learning
• Rapid pace of research and development

AML will be synomyous with deep learning Learning outcomes What will be covered
• Advanced machine learning is very broad • Formulate advanced machine learning problems • ML problems • Training methods
• Deep learning • Inputs, outputs, labels, annotations, data quantity, data quality, prior • Lack of labels • Loss functions
• Probabilistic methods knowledge • Mislabeled data • Speed of convergence
• Theoretical ML • Analyze and propose neural architectures • Use of prior knowledge • Pre-training and fine-tuning
• Reinforcement learning • How are data dimensions related • Robustness
• ... • Changes in architectures to improve outcomes • Neural network architectures
• Most practical and exciting applications are coming from DL • Analyze and formulate training methods • For different types of input data
• Amount and quality of data • For different formats of output
• Using prior knowledge • For different levels of compute
• More than enough to be covered in deep learning alone

Course material Pre-requisites Evaluation structure


• Daily notes upload (20 x 0.5 = 10 marks)
• Textbooks: • Introduction to ML • Programming assignments (8 x 3 = 24 marks)
• “Dive into Deep Learning” by Aston Zhang, Zachary C. Lipton, Mu Li, and Alex • Supervised learning: regression, classification
Smola
• Mid-sem exam (20 marks)
• Unsupervised learning: clustering, dimension reduction
• “Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville • End-sem exam (30 marks)
• Linear algebra
• Project proposal (2 marks)
• Matrix and vector arithmetic
• Papers: • Subspaces, eigen decomposition etc. • Project (14 marks)
• Check out reading list on Moodle • (IEEE style self published paper on ArXiv, Github repo, video demo)
• Probability
• Absolute grading (90+ AA, 80+ BB etc.)
• Joint distributions: marginals, conditionals
• Audit: 30+
• KL divergence etc.
Agenda ML is… When not to use ML
• The practice of automating the use of related data to estimate
• Overview of machine learning models that make useful predictions about new data, where • Possible inputs are countable and few
the model is too complex for standard statistical analysis, e.g. • Use look up tables
• Improve accuracy of classification of images using labeled images • Algorithm is well-known and efficient
• Revision of neural networks • E.g. sorting, Dijkstra's shortest path
• Improve win percentage on alpha-go using several simulated game
move sequences and their results • Model is well-known and tractable
• Tentative list of topics in AML • Improve the Turing test confusion between human and machine for • Use statistical estimation
NLP Q&A using a large sample of text including Q&A
• There is no notion of contiguity
• Use dicrete variable methods or give up
• Lack of data
• Use transfer learning or few-shot learning, or give up

ML model training and


When to use ML Sweet spot for ML deployment
• Possible inputs are many or continuous Training on past data Prediction on future data
• No well-known or efficient algorithm
• Model is not well-known or tractable
• Lots of structured data
• Strong notion of contiguity
• Explainability is not critical
• Good amount of data
• Desired output known • Prediction accuracy is the primary goal
• Well-defined inputs
• Underlying model is complex but stationary

Broad types of ML problems


Type of ML problems Supervised Learning
● Supervised learning: uses labeled data ● Predictor variables/features and a target variable (label)
○ Classification: Labels are discrete ● Aim: Predict the target variable (label), given the predictor
○ Regression: Labels are continuous Output  Categorical Ordinal Continuous
○ Ranking: Labels are ordinal variables
○ Classification: Target variable (y) consists of categories
● Unsupervised learning: uses unlabeled data ○ Regression: Target variable is continuous (Label)
Supervised Classification Ranking Regression
○ Clustering: Divide data into discrete groups
○ Dimension reduction: Represent data with fewer numbers (Examples) {Cats, dogs} {Low, Med, High} [-20,+10)

● Somewhere in between: fewer labels than one per example Unsupervised Clustering Dimension reduction
○ Semi-supervised learning: some examples are labeled
○ Weakly supervised learning: groups of examples are labeled
○ Reinforcement learning: Label (reward) is available after a sequence of steps
Some popular ML frameworks Recipe for ML training ML gives a model
Supervised Machine Learning System - Training

Dimension • Decide on the type of the ML problem


Classification Regression Clustering
reduction • Elements of a model: Model
• Prepare data Input xi
Output fθ(xi)

Logistic Linear • Input xi

(and regularization)
K-means, PCA, k-PCA, Hyper-parameters

Loss function
Vector regression regression Fuzzy C-means, LLE, • Function fθ(xi)
• Shortlist ML frameworks Parameters θ
SVM, RF, NN DB-SCAN ISOMAP
RNN, LSTM, Transformer, • Prepare training, validation, and test sets • Utility of the model:
Series, text
1-D CNN, HMM • Target output ti Target output t i

Images 2-D CNN, MRF • Train, validate, repeat • Bring fθ(xi) close to ti Learning algorithm
• Minimize loss L(ti, fθ(xi), θ)
Video, MRI 3-D CNN, CNN+LSTM, MRF • Use test data only once

Components of a Trained ML System Mathematically speaking… Preparing data


§ Determine f such that ti=f(xi) and g(T, X) is minimized for unseen set T
• Remove useless data • Handle missing data
and X pairs, where T is the ground truth that cannot be used
Supervised Machine Learning System - Testing • No variance • Impute, if sporadic
§ Form of f is fixed, but some parameters can be tuned:
Input xi Model Output fθ(xi) • Falsely assumed to be available • Drop, if too frequent
Hyper-parameters § So, y=fθ(x), where, x is observed, and y needs to be inferred

§ e.g. y = 1, if mx > c, y = 0 otherwise, so θ = (m,c)


• Reduce redundancy • Transform variables
Parameters θ

• Correlated • Convert discrete to one-hot-bit


§ Machine Learning is concerned with designing algorithms that learn
• Pearson and Spearman • Normalize continuous variables
“better” values of θ given “more” x (and t) for a given problem

23

Models must exploit structure of data Loss and accuracy


Loss versus performance metric
Product Price Margin Volume
SKU • Training accuracy saturates to a maximum ● Loss is a convenient expression used for guiding the learning
• Records (optimization)
A123ajkhdf $ 120 30% 1,000,000 • Training loss saturates to a minimum
B456ddsjh $200 10% 2,000,000 • Loss is a measure of error ● Loss is related to performance metric, but it is not the same
• Temporal order ● Loss also includes regularization
● Performance metric is what is used to judge the model
● Performance metric on only the held-out (validation or test) data
• Spatial order makes sense

• Web of relationships
Images courtesy: Pixabay.com
Loss function tells how bad the model is Properties of a good loss function Convex vs. non-convex loss
• Loss trends opposite of accuracy
• Loss is low when accuracy is high
• Minimum value for perfect accuracy
• Usually zero
• Loss is zero for perfect accuracy (by convention)
• Note: low loss on training does not guarantee low
• Loss is high when accuracy is low
loss on validation or testing
• Varies smoothly with input
• Loss is a function of actual and desired output • Varies smoothly with parameters
• Good to be convex in parameters (but is usually
• Minimizing the loss function with respect to not)
parameters leads to good parameters • Like a paraboloid

Non-convex loss can have multiple minima Examples of loss functions MSE loss for regression
Error
• Regression with continuous output
• Mean square error (MSE), log MSE, mean absolute error
f(x) = wx+b
• Classification with probabilistic output Error

y→
• Cross entropy (negative log likelihood), hinge loss
Error: 𝑦 − 𝑓(𝑥 )
• Similarity between vectors or clustering Square error: 𝑦 − 𝑓(𝑥 )
• Euclidean distance, cosine MSE:
1
∑𝑁 𝑦 − 𝑓(𝑥 )
𝑁 𝑖=1

x→

Original image source unknown

Is MSE always appropriate? MAE loss is less affected by outliers than MSE Is MSE appropriate for classification?

1
1 Model: 𝑓(𝑥 ) = ( )
1+
y→

y→

y→
Error: 𝑦 − 𝑓(𝑥 )
Absolute error: |𝑦 − 𝑓(𝑥 )|
1
MAE: ∑𝑁 𝑖=1
|𝑦 − 𝑓(𝑥 )|
𝑁
0
x→ x→ x→

Original image source unknown


Cross entropy loss is preferred for classification Overfitting and underfitting Regularization is a key concept in ML
● Regularization means constraining the model
• How much does one (estimated) • Compare training and validation loss
probability distribution q(x) ● More constraints may reduce model fit on training data
deviates from another (real) p(x)
Loss, if y = 1 →

Underfitting Overfitting
• KL-divergence of q(x) from p(x) ● However, it may improve fit on validation and test data

● Training performance of more constrained models are more likely to


• For binary classification: reflect test performance

0 1
f(x) →

−{𝑦log 𝑓(𝒙) + (1 − 𝑦)log 1 − 𝑓(𝒙) }


Original image source unknown

How to regularize models Examples of hyper-parameters and parameters


Parameters and Hyperparameters
● Parameters: These are the variable whose values are updated • f(x) = wTx + b
during the training process of model. • No hyperparameter
Feature coefficient in regression model

○ Weights of a neural network • Parameters are w and b
● Hyperparameters: These are the variables/ parameter whose
values are fixed by model developer before the beginning of • f(x) = w2 x2 + w1 x1 + w0 x0
learning process. • Hyper-parameter is degree 2
○ Number of variables in a tree node
○ Height of a tree • Parameters are w2 , w1 , and w0
○ Number of layers of a neural network

Image source: Wikipedia

Under-constrained models lead to overfitting Regularization is constraining a model L2-regularization visualized


• An n-degree polynomial can fit • How to regularize?
n points perfectly • Reduce the number of parameters
• But, is it overfitting? • Share weights in structure
• Constrain parameters to be small
• Is it being swayed by outliers?
• Encourage sparsity of output in loss
• “Models should be as simple as
possible, but not simplistic” • Most commonly Tikhonov (or L2, or ridge)
regularization (a.k.a. weight decay)
• To make model simpler:
• Penalty on sums of squares of individual weights
• Restrict number of parameters,
• Or, restrict the set of values that
they can take
• Always check validation 1
𝑁
𝜆
𝑛 𝑛

performance 𝐽=
𝑁
𝑦 − 𝑓(𝑥 ) +
2
𝑤 ; 𝑓(𝑥 ) = 𝑤 𝑥 ;
𝑖=1 𝑗=1 𝑗=0
Original image source unknown Original image source unknown
Other forms of regularization Preparing data for training and validation
Cross-validation
• Convolutional filter structure in CNN neurons • Data splits: ● Model performance measurement is dependent on way the data
• Max-pooling • Training  Used to optimize the parameters (e.g. random 70%) is split
• Validation  Used to compare models (e.g. random 15%) ● Not representative of the model’s ability to generalize
• Dropout • Testing  One final check after multiple rounds of validation (e.g. ● Solution: Cross-validation, especially when data is less
random 15%)
• L1-regularization (sparsity ● Con: more computations
• Cross-validation:
inducing norm) • K-folds: One fold for validation, K-1 folds for training
• Penalty on sums of absolute • Rotate folds K times
values of weights • Select framework (hyperparameters) best average performance
• Re-train best framework on entire data
• Test one final time on held-out data that was not a part of any fold

Original image source unknown

Try to ask critical questions in


ML can fail to perform in deployment ML life stages response to the following statements
• Lack of training diversity: data had limited confounders “We need to store every keystroke and mouse movement of use
of company computers so that we can run ML on it later.”
• Single speaker, author, camera, background, accent, ethnicity, etc.
• Data imbalance between high-value rare and more common examples “Our product can give biometric access based on face
recognition. No need for cumbersome finger, ID, or iris scans.”
• Proxy label leak during training:
“We can detect pneumonia in chest x-ray with 99% accuracy.”
• E.g. Only speakers A and B provide emotion “anger,” so ML confused their
voice characteristics with “anger” “We can recognize people’s emotions with 91% accuracy from
their faces as they watch videos on our website.”
• Too much manual cleansing of training data
“Our video call plug-in can tell when people are lying.”
• Too little training data, and very complex models “Our autonomous vehicle is 20 times safer than an average
• Concept drift: The assumptions behind training are no longer valid driver.”
Most ML courses

Overlooked questions Is AI / ML just a fad?


Relation of ML to other fields
Machine
Probability • Other hot buzzwords for VCs, businesses, and colleges over the
• Problem • Validation and Statistics last 30 years:
• Right business or societal need Learning
• Realistic; not too optimistic Optimization
• Programming, data structures, databases, IT
• Data and provenance
• Covers diversity of use cases • Computer networks, wireless communication
• Relevant and enough
• Diverse and representative • Meaningful performance metrics • Web 2.0
Programming
• Ethically gathered, stored • User guidance Linear Data Science • Nano technology
• Meticulously recorded Algebra • Crypto
• Declaration of intended use case
• Model • Description of data used • Some of these have deeply affected our economies and society,
• Exploits structure in data and have reached maturity from the PoV of ongoing innovation
• Sufficiently complex • Monitoring in-field performance
• Not too complex • Data differences, concept drift • Others have made a limited impact as challenges to their
• Meets deployment constraints • Unethical and unauthorized use promises became better understood
Image courtesy: Pixabay.com
Technology life cycle has to be ML is being deeply embedded in our
understood critically economy and society Waves of AI
• Automated pattern recognition is already here • Discriminative AI • Agentic AI
New wave of • Images and videos: find people, objects, diseases … • CNN, UNet, YOLO • Multi-objective
Maturity • LSTM, GRU • Multi-modal input
innovation • Voice: convert to text, …
• Text: queries, chat bots, translation, … • Transformer • Goal oriented
Value

Value
• Time series: stocks, power consumption, …
n

• Generative AI
t io

or
ova

co
De odific
• GAN, VAE
m

• Automated decision-making is coming


Inn

clin ati
m

• Diffusion
e

• Economic decisions: customer targeting, credit approval, …


• State space models
• Autonomous machines: Driverless cars, drones, robots, …
on

Time Time
• Critical decisions: medical diagnosis, criminal forensics …
Image courtesy: Pixabay.com

Where would basic ML struggle? Where would basic ML struggle?

Discriminative AI distinguished between inputs Large number of dimensions on a grid Learning from similar domains
• Vision • What features • If we have lots of
• Image classificaiton: Does the image have a cat or a dog? should we extract labeled digits
• Image segmentation: Which pixels represent a cat? from an image? • But only a few
• Object detection: Put a box on the cat. • Should pixels be labeled examples
• NLP features? of an ancient script
• Text classification: Is this a positive or a negative review? • Is Euclidean • Can we transfer
• Text annotation: Where is the positive clause in this paragraph? distance a good learning?
• Audio metric?
• Video

Image source: Wikipedia Image source: Wikipedia

Where would basic ML struggle? Where would basic ML struggle? Where would basic ML struggle?

Scaling feature extraction with input Depth vs. width Data dimensions form a graph
• More samples y1 y2 … yn • What if variables are
related as graphs?

… … …
• What if different
h11 h12 … h1n1 variables are missing in
• Higher different samples?
dimensional
samples x1 x2 … xd

Image source: Wikipedia


Where would basic ML struggle? Where would basic ML struggle?

How to use unlabeled data What if labels are given to bags of samples Advances in AI - image recognition

Source: devopedia.org/imagenet

Where would basic ML struggle?

How to generate more samples How does generative AI work? Generative AI - “Gandhi in Monet style”
Discriminative Generative We pit “counterfeiter” and “police” models to compete with each other

• Counterfeiter generates “artificial” output

• Police learns to distinguish between “artificial” and “natural”

• Goal of counterfeiter is to generate more believable “natural” outputs

• Goal of the police is to get better at spotting minute differences

Source: openai.com/product/dall-e-2

“Advantages of brushing teeth at night instead of Literature review on domain adaptation in


Generative AI created realistic output the morning” machine learning
Brushing teeth at night has several advantages over brushing in the morning: ChatGPT Google Scholar
• Vision:
• Removes plaque and bacteria: Brushing at night removes plaque and bacteria that have accumulated
• Image enhancement throughout the day, preventing them from causing damage to your teeth and gums overnight.
• Super resolution
• Prevents bad breath: Bad breath in the morning is often caused by bacteria that have been allowed to
• Image inpainting multiply in the mouth overnight. Brushing at night helps to remove these bacteria, preventing bad breath in
• Text: the morning.
• Q&A, chatbots, RAG • Reduces the risk of cavities: When you brush at night, you remove food particles and sugar from your teeth,
reducing the risk of cavities. If these particles are left on your teeth overnight, they can cause tooth decay.
• Article and creative writing
• Promotes better sleep: Brushing your teeth before bed can promote a sense of cleanliness and relaxation,
• Multi-modal: helping you to fall asleep more easily.
• Vision-language models • Gives fluoride more time to work: If you use fluoride toothpaste, brushing at night gives the fluoride more
• Creating images in certain sytles, image editing time to work on your teeth, protecting them against decay.
Overall, brushing your teeth at night is an essential part of maintaining good oral hygiene and preventing dental
problems.
Source: chat.openai.com
How to show risk table but change the font of axis Agentic AI will orchestrate various tools for specific
titles of risk tables in R ggsurvplot “The age of average” (by Alex Murrell) purposes
“So, this is your call to arms. Whether you’re in film or
fashion, media or marketing, architecture, • Agentic AI can make decisions, take autonomous actions, and
automotive or advertising, it doesn’t matter. Our continually learn from interactions to achieve specific objectives
visual culture is flatlining and the only cure is
creativity.

• Characteristics
It’s time to cast aside conformity. It’s time to exorcise
the expected. It’s time to decline the • Autonomy
indistinguishable. • Goal-oriented
• Adaptability
For years the world has been moving in the same • Proactive
stylistic direction. And it’s time we reintroduced some
originality.”

Source: chat.openai.com

Example utilization of different AI components


Other predictions for AI in 2025 Other considerations
Sensors Agentic AI

• Generative AI will plateau and hype will die down • Cost - dedicated hardware or cloud services and manpower
• Realism will set in but more use cases will be discovered • Power and impact on environment
• Businesses will become savvy about evaluating AI
• Concentration of AI talent and innovations
• Agentic AI will become the buzzword
• Conformity
• Human feedback will make a comeback
• Loss of nuance
• Debate about ethics and regulations will take pace
• Tailoring to in-house data and knowledge
Discriminative AI Generative AI Actuators

Activation function is the secret sauce of neural


Agenda networks
Types of activation functions
• Neural network training is • Step: original concept
behind classification and
• Overview of machine learning all about tuning weights region bifurcation. Not
used anymore
x1 and biases
w1 • Sigmoid and tanh:
trainable approximations
• Revision of neural networks of the step-function
x2 w2 • If there was no activation • ReLU: currently preferred
w3
Σ g function f, the output of
due to fast convergence
• Tentative list of topics in AML • Softmax: currently
x3 the entire neural network preferred for output of a
b classification net.
would be a linear function Generalized sigmoid
of the inputs • Linear: good for modeling
1 a range in the output of a
regression net
• The earliest models used a
step function
Step function divides the input space into two
Formulas for activation functions halves  0 and 1
The sigmoid function is a smoother step function
sign(𝑥)+1 • In a single neuron, step function is a
• Step: 𝑔(𝑥) = linear binary classifier
2
• The weights and biases determine
1
• Sigmoid: 𝑔(𝑥) = where the step will be in n-
1+ dimensions
• Tanh: 𝑔(𝑥) = tanh(𝑥) • But, as we shall see later, it gives
little information about how to
change the weights if we make a • Smoothness ensures that there is more
• ReLU: 𝑔(𝑥) = max(0, 𝑥) information about the direction in which
mistake
• So, we need a smoother version of a to change the weights if there are errors
• Softmax: 𝑔(𝑥 ) = step function • Sigmoid function is also mathematically
∑𝑖
• Enter: the Sigmoid function linked to logistic regression, which is a
• Linear: 𝑔(𝑥) = 𝑥 theoretically well-backed linear classifier

The problem with sigmoid is (near) zero gradient Output activation functions can only be of the
on both extremes following kinds
Basic structure of a neural network
• For both large • Sigmoid gives binary y1 y2 • It is feed forward
… yn
positive and classification output • Connections from
negative input • Tanh can also do that inputs towards
values, sigmoid provided the desired outputs
doesn’t change output is in {-1, +1} … • No connection comes
much with change • Softmax generalizes backwards
of input sigmoid to n-ary … … … • It consists of layers
classification • Current layer’s input is
• ReLU has a constant • Linear is used for h1n previous layer’s output
h11 h12 …
gradient for almost regression 1
• No lateral (intra-layer)
half of the inputs • ReLU is only used in connections
• But, ReLU cannot internal nodes (non- x1 x2 … xd • That’s it!
give a meaningful output)
final output

Basic structure of a neural network Importance of hidden layers Overall function of a neural network
− + +
• Output layer
y1 y2 … yn • Represent the output of the neural +
+ Single • First hidden • 𝑓(𝒙 ) = 𝑔 (𝑾 ∗ 𝑔 (𝑾 …𝑔 (𝑾 ∗ 𝒙 )…))
network − sigmoid
• For a two class problem or − + + − − layer extracts • Weights form a matrix
regression with a 1-d output, we
need only one output node − + features

• Hidden layer(s) +
+ + − • Output of the previous layer form a vector
… • Represent the intermediary nodes • Second hidden
that divide the input space into
regions with (soft) boundaries
layer extracts • The activation (nonlinear) function is applied point-
… … …
• These usually form a hidden layer
• Usually, there is only one such layer features of wise to the weight times input
• Given enough hidden nodes, we can
model an arbitrary input-output
relation. features
h1n • Input layer − + +
h11 h12 …
1 • Represent dimensions of the input + Sigmoid •… • Design questions (hyper parameters):
vector (one node for each + − hidden
dimension)
− − • Output layer • Number of layers
• These usually form an input layer, − + + layers and
and − + • Number of neurons in each layer (rows of weight matrices)
x1 x2 … xd • Usually there is only one such layer
+
− sigmoid gives the
− output
+ +
desired output
Training the neural network Loss function choice Some loss functions and their derivatives
• Given 𝒙 and 𝑦 • There are positive
and negative errors
• Think of what hyper-parameters and neural network in classification and • Terminology
design might work MSE is the most • 𝑦 is the output
common loss • 𝑡 is the target output
• Form a neural network: function
𝑓(𝒙 ) = 𝑔 (𝑾 ∗ 𝑔 (𝑾 …𝑔 (𝑾 ∗ 𝒙 )…)) • Mean square error
• There is probability
• Compute 𝑓𝒘 (𝒙 ) as an estimate of 𝑦 for all samples error→
of correct class in • Loss: (𝑦 − 𝑡 )
1 1 classification, for • Derivative of the loss: 2(𝑦 − 𝑡 )
• Compute loss: ∑𝑁 𝐿(𝑓𝒘 (𝒙 ), 𝑦 ) = ∑𝑁 𝑙 (𝒘) which cross entropy
𝑁 𝑖=1 𝑁 𝑖=1 • Cross entropy
is the most common
• Tweak 𝒘 to reduce loss (optimization algorithm) loss function • Loss: − ∑𝐶𝑐=1 𝑡 log 𝑦
1
• Repeat last three steps • Derivative of the loss: − |

error→

Computational graph of a single hidden layer NN Overall function of a neural network Training the neural network
• Given 𝒙 and 𝑦
• 𝑓(𝒙 ) = 𝑔 (𝑾 ∗ 𝑔 (𝑾 …𝑔 (𝑾 ∗ 𝒙 )…))
• Think of what hyper-parameters and neural network
• Weights form a matrix design might work
x
• Output of the previous layer form a vector • Form a neural network:
* ?
W1 + Z1 ReLU A1 • The activation (nonlinear) function is applied point- 𝑓(𝒙 ) = 𝑔 (𝑾 ∗ 𝑔 (𝑾 …𝑔 (𝑾 ∗ 𝒙 )…))
b1 * ? wise to the weight times input • Compute 𝑓𝒘 (𝒙 ) as an estimate of 𝑦 for all samples
W2 + Z2 SoftMax A2
1 1
b2 CE Loss
• Compute loss: ∑𝑁
𝑖=1
𝐿(𝑓𝒘 (𝒙 ), 𝑦 ) = ∑𝑁
𝑖=1
𝑙 (𝒘)
• Design questions (hyper parameters): 𝑁 𝑁
target • Number of layers • Tweak 𝒘 to reduce loss (optimization algorithm)
• Number of neurons in each layer (rows of weight matrices) • Repeat last three steps

Derivative of a function of a scalar


Gradient ascent Gradient descent minimizes the loss function
• At every point, compute
• If you didn’t know the shape of a mountain • Loss (scalar): 𝑙 (𝒘)
• But at every step you knew the slope • Gradient of loss with respect to weights
• Can you reach the top of the mountain? (vector):
𝛻𝒘 𝑙 (𝒘)
• Take a step towards negative gradient:
1 𝑁 E.g. 𝑓(𝑥) = 𝑎𝑥 + 𝑏𝑥 + 𝑐, 𝑓 (𝑥) = 2𝑎𝑥 + 𝑏, 𝑓′′(𝑥) = 2𝑎
𝒘 ← 𝒘 − 𝜂 𝛻𝒘 𝑙 (𝒘) 𝑑 𝑓(𝑥)
𝑁 𝑖=1 • Derivative 𝑓’(𝑥) = is the rate of change of 𝑓(𝑥) with 𝑥
𝑑𝑥
• It is zero when then function is flat (horizontal), such as at the
minimum or maximum of 𝑓(𝑥)
• It is positive when 𝑓(𝑥) is sloping up, and negative when 𝑓(𝑥) is
sloping down
• To move towards the maxima, taking a small step in a direction of
the derivative
Gradient of a function of a vector Gradient of a function of a vector Example of gradient
• Derivative with respect to each • Gradient gives a direction for
dimension, holding other • Let 𝑓(𝒙) = 𝑓(𝑥 , 𝑥 ) = 5𝑥 + 3𝑥
moving towards the minima 𝜕𝑓
dimensions constant
• Take a small step towards 10𝑥
𝜕𝑓
negative of the gradient • Then 𝛻𝑓(𝒙) = 𝛻𝑓(𝑥 , 𝑥 ) = =
f(x1, x2) →

f(x1, x2) →
𝜕𝑓 6𝑥
• 𝛻𝑓(𝒙) = 𝛻𝑓(𝑥 , 𝑥 ) = 𝜕𝑓

• At a minima or a maxima the 20 0.958


• At a location (2,1) a step in or direction will lead to
→ gradient is a zero vector → 6 0.287
x1 x1 maximal increase in the function
x2 → The function is flat in every x2 →
direction
• At a minima or a maxima the
gradient is a zero vector
Original image source unknown Original image source unknown

This story is unfolding in multiple dimensions Backpropagation Chain rule of differentiation


• Backpropagation is an
efficient method to do
y1 y2 … yn gradient descent • Very handy for complicated functions
• It saves the gradient § Especially functions of functions
w.r.t. the upper layer
output to compute the § E.g. NN outputs are functions of previous layers
gradient w.r.t. the

weights immediately § For example: Let 𝑓(𝑥) = 𝑔 ℎ(𝑥)
below
§ Let 𝑦 = ℎ(𝑥), 𝑧 = 𝑔(𝑦) = 𝑔 ℎ(𝑥)
… … … • It is linked to the chain
rule of derivatives 𝑑𝑧 𝑑𝑧𝑑𝑦
§ Then 𝑓 (𝑥) = = = 𝑔 (𝑦)ℎ (𝑥)
• All intermediary 𝑑𝑥 𝑑𝑦𝑑𝑥
h11 h12 … h1n functions must be 𝑑 sin( )
1 differentiable, including § For example: = 2𝑥 cos(𝑥 )
the activation functions 𝑑𝑥

x1 x2 … xd

Original image source unknown

Backpropagation makes use of Vector valued functions and Jacobians


chain rule of derivatives • We often deal with functions that give multiple
Jacobian of each layer
outputs
𝜕𝑓(𝑔(𝑥)) 𝜕𝑓(𝑔(𝑥)) 𝜕𝑔(𝑥)
• Chain rule: = 𝑓 (𝒙) 𝑓 (𝑥 , 𝑥 , 𝑥 ) • Compute the derivatives of a higher layer’s output with respect to
𝜕𝑥 𝜕𝑔(𝑥) 𝜕𝑥 • Let 𝒇(𝒙) = = those of the lower layer
𝑓 (𝒙) 𝑓 (𝑥 , 𝑥 , 𝑥 )
x • Thinking in terms of vector of functions can make the
* ? representation less cumbersome and computations • What if we scale all the weights by a factor R?
W1 + Z1 ReLU A1 more efficient
b1 * ? • Then the Jacobian is • What happens a few layers down?
W2 + Z2 SoftMax A2 𝜕 𝜕 𝜕
b2 CE Loss 𝜕𝒇 𝜕𝒇 𝜕𝒇
• 𝑱(𝒇) = = 𝜕 𝜕 𝜕
target
Double derivative
Role of step size and learning rate The perfect step size is impossible to guess
• Tale of two loss functions • Goldilocks finds the perfect balance only in a fairy tale
• Same value, and
• Same gradient (first derivative), but
• Different Hessian (second derivative)
• Different step sizes needed
E.g. 𝑓(𝑥) = 𝑎𝑥 + 𝑏𝑥 + 𝑐, 𝑓 (𝑥) = 2𝑎𝑥 + 𝑏, 𝑓′′(𝑥) = 2𝑎
• Success not guaranteed • The step size is decided by learning rate 𝜂 and the gradient
𝑓(𝑥)
• Double derivative 𝑓’’(𝑥) = is the derivative of
𝑑
derivative of 𝑓(𝑥)
• Double derivative is positive for convex functions (have a
single minima), and negative for concave functions (have a
single maxima)

Double derivative Perfect step size for a paraboloid


• Let 𝑓(𝑥) = 𝑎𝑥 + 𝑏𝑥 + 𝑐
Hessian of a function of a vector
𝑓(𝑥) • Assuming 𝑎 < 0 • Double derivative with respect
= 𝑎𝑥 + 𝑏𝑥 + 𝑐, 𝑏 to a pair of dimensions forms
• Minima is at: 𝑥∗ =− the Hessian matrix:
𝑓 (𝑥) = 2𝑎𝑥 + 𝑏, 2𝑎
𝑓′′(𝑥) = 2𝑎 • For any 𝑥 the perfect step would be:

f(x1, x2) →
𝑏 2𝑎𝑥+𝑏 (𝑥)
− − 𝑥 =− =− (𝑥)
2𝑎 2𝑎
1
• Double derivative tells how far the minima might be from • So, the perfect learning rate is: 𝜂∗ = (𝑥) →
a given point. x1
• In multiple dimensions, 𝒙 ← 𝒙 − 𝐻 𝑓(𝒙) 𝛻(𝑓(𝒙)) x2 →
• From 𝑥 = 0 the minima is closer for the red dashed curve • If all eigenvalues of a Hessian
than for the blue solid curve, because the former has a • Practically, we do not want to compute the inverse of a
matrix are positive, then the
larger second derivative (its slope reverses faster) Hessian matrix, so we approximate Hessian inverse function is convex
Original image source unknown

Example of Hessian Saddle points, Hessian and long local furrows Complicated loss functions
• Let 𝑓(𝒙) = 𝑓(𝑥 , 𝑥 ) = 5𝑥 + 3𝑥 + 4𝑥 𝑥 • Some variables may have
𝜕𝑓 reached a local minima
10𝑥 + 4𝑥 while others have not
• Then 𝛻𝑓(𝒙) = 𝛻𝑓(𝑥 , 𝑥 ) = =
𝜕𝑓 6𝑥 + 4𝑥 • Some weights may have
almost zero gradient
𝑓 𝑓
10 4 • At least some eigenvalues
• And, 𝐻(𝑓(𝒙)) = = may not be negative
𝑓 𝑓 4 6

Image source: Wikipedia Original image source unknown


Global
minima?

Saddle
point
Tentative list of topics Next week
• Neural architectures - I • Training methods - II • Neural architectures for vision
• Vision • Lack of labels • LeNet
A realistic picture

• Audio • Lack of training samples • ResNet


• NLP • Generative • UNet
• Graphs • Multi-modal • YOLO
• Training Methods - I • Pre-training and fine-tuning
• Convex optimization • Self-supervised learning
• Semi and weak supervision
• Cover later
Local • Layers to help training • ViT
minima • LR scheduling • Swin Transformer
• Robustness
Local
maxima

Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

You might also like