0% found this document useful (0 votes)

26 views14 pages

Advanced ML Slides Intro

advance ml course-ee782 by amit sethi of iit bombay of autmn sem 2025, covers conv,nlp,lstm,attention,nlp,glove etc.

Uploaded by

cherishjain01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views14 pages

Advanced ML Slides Intro

advance ml course-ee782 by amit sethi of iit bombay of autmn sem 2025, covers conv,nlp,lstm,attention,nlp,glove etc.

Uploaded by

cherishjain01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Why this course What is the relation between AI, ML, DL etc.?

Introduction to Advanved • Rapid expansion of data and compute

• Memory and data transfer has become cheaper Artificial intelligence
Topics in Machine Learning • Multi-modal data being collected and shared
• Compute has become cheaper
Machine learning
• Sudden explosion in the use of ML
EE 782 Advanved Topics in Machine Learning • Face recognition Neural networks
July-Nov 2025 • Autonomous driving
• Text and image generation
Amit Sethi, EE, IITB
Contact: asethi@iitb, 3528, 7483 Deep learning
• Rapid pace of research and development

AML will be synomyous with deep learning Learning outcomes What will be covered
• Advanced machine learning is very broad • Formulate advanced machine learning problems • ML problems • Training methods
• Deep learning • Inputs, outputs, labels, annotations, data quantity, data quality, prior • Lack of labels • Loss functions
• Probabilistic methods knowledge • Mislabeled data • Speed of convergence
• Theoretical ML • Analyze and propose neural architectures • Use of prior knowledge • Pre-training and fine-tuning
• Reinforcement learning • How are data dimensions related • Robustness
• ... • Changes in architectures to improve outcomes • Neural network architectures
• Most practical and exciting applications are coming from DL • Analyze and formulate training methods • For different types of input data
• Amount and quality of data • For different formats of output
• Using prior knowledge • For different levels of compute
• More than enough to be covered in deep learning alone

Course material Pre-requisites Evaluation structure

• Daily notes upload (20 x 0.5 = 10 marks)
• Textbooks: • Introduction to ML • Programming assignments (8 x 3 = 24 marks)
• “Dive into Deep Learning” by Aston Zhang, Zachary C. Lipton, Mu Li, and Alex • Supervised learning: regression, classification
Smola
• Mid-sem exam (20 marks)
• Unsupervised learning: clustering, dimension reduction
• “Deep Learning” by Ian Goodfellow, Yoshua Bengio, and Aaron Courville • End-sem exam (30 marks)
• Linear algebra
• Project proposal (2 marks)
• Matrix and vector arithmetic
• Papers: • Subspaces, eigen decomposition etc. • Project (14 marks)
• Check out reading list on Moodle • (IEEE style self published paper on ArXiv, Github repo, video demo)
• Probability
• Absolute grading (90+ AA, 80+ BB etc.)
• Joint distributions: marginals, conditionals
• Audit: 30+
• KL divergence etc.
Agenda ML is… When not to use ML
• The practice of automating the use of related data to estimate
• Overview of machine learning models that make useful predictions about new data, where • Possible inputs are countable and few
the model is too complex for standard statistical analysis, e.g. • Use look up tables
• Improve accuracy of classification of images using labeled images • Algorithm is well-known and efficient
• Revision of neural networks • E.g. sorting, Dijkstra's shortest path
• Improve win percentage on alpha-go using several simulated game
move sequences and their results • Model is well-known and tractable
• Tentative list of topics in AML • Improve the Turing test confusion between human and machine for • Use statistical estimation
NLP Q&A using a large sample of text including Q&A
• There is no notion of contiguity
• Use dicrete variable methods or give up
• Lack of data
• Use transfer learning or few-shot learning, or give up

ML model training and

When to use ML Sweet spot for ML deployment
• Possible inputs are many or continuous Training on past data Prediction on future data
• No well-known or efficient algorithm
• Model is not well-known or tractable
• Lots of structured data
• Strong notion of contiguity
• Explainability is not critical
• Good amount of data
• Desired output known • Prediction accuracy is the primary goal
• Well-defined inputs
• Underlying model is complex but stationary

Broad types of ML problems

Type of ML problems Supervised Learning
● Supervised learning: uses labeled data ● Predictor variables/features and a target variable (label)
○ Classification: Labels are discrete ● Aim: Predict the target variable (label), given the predictor
○ Regression: Labels are continuous Output  Categorical Ordinal Continuous
○ Ranking: Labels are ordinal variables
○ Classification: Target variable (y) consists of categories
● Unsupervised learning: uses unlabeled data ○ Regression: Target variable is continuous (Label)
Supervised Classification Ranking Regression
○ Clustering: Divide data into discrete groups
○ Dimension reduction: Represent data with fewer numbers (Examples) {Cats, dogs} {Low, Med, High} [-20,+10)

● Somewhere in between: fewer labels than one per example Unsupervised Clustering Dimension reduction
○ Semi-supervised learning: some examples are labeled
○ Weakly supervised learning: groups of examples are labeled
○ Reinforcement learning: Label (reward) is available after a sequence of steps
Some popular ML frameworks Recipe for ML training ML gives a model
Supervised Machine Learning System - Training

Dimension • Decide on the type of the ML problem

Classification Regression Clustering
reduction • Elements of a model: Model
• Prepare data Input xi
Output fθ(xi)

Logistic Linear • Input xi

(and regularization)
K-means, PCA, k-PCA, Hyper-parameters

Loss function
Vector regression regression Fuzzy C-means, LLE, • Function fθ(xi)
• Shortlist ML frameworks Parameters θ
SVM, RF, NN DB-SCAN ISOMAP
RNN, LSTM, Transformer, • Prepare training, validation, and test sets • Utility of the model:
Series, text
1-D CNN, HMM • Target output ti Target output t i

Images 2-D CNN, MRF • Train, validate, repeat • Bring fθ(xi) close to ti Learning algorithm
• Minimize loss L(ti, fθ(xi), θ)
Video, MRI 3-D CNN, CNN+LSTM, MRF • Use test data only once

Components of a Trained ML System Mathematically speaking… Preparing data

§ Determine f such that ti=f(xi) and g(T, X) is minimized for unseen set T
• Remove useless data • Handle missing data
and X pairs, where T is the ground truth that cannot be used
Supervised Machine Learning System - Testing • No variance • Impute, if sporadic
§ Form of f is fixed, but some parameters can be tuned:
Input xi Model Output fθ(xi) • Falsely assumed to be available • Drop, if too frequent
Hyper-parameters § So, y=fθ(x), where, x is observed, and y needs to be inferred

§ e.g. y = 1, if mx > c, y = 0 otherwise, so θ = (m,c)

• Reduce redundancy • Transform variables
Parameters θ

• Correlated • Convert discrete to one-hot-bit

§ Machine Learning is concerned with designing algorithms that learn
• Pearson and Spearman • Normalize continuous variables
“better” values of θ given “more” x (and t) for a given problem

Models must exploit structure of data Loss and accuracy

Loss versus performance metric
Product Price Margin Volume
SKU • Training accuracy saturates to a maximum ● Loss is a convenient expression used for guiding the learning
• Records (optimization)
A123ajkhdf $ 120 30% 1,000,000 • Training loss saturates to a minimum
B456ddsjh $200 10% 2,000,000 • Loss is a measure of error ● Loss is related to performance metric, but it is not the same
• Temporal order ● Loss also includes regularization
● Performance metric is what is used to judge the model
● Performance metric on only the held-out (validation or test) data
• Spatial order makes sense

• Web of relationships
Images courtesy: Pixabay.com
Loss function tells how bad the model is Properties of a good loss function Convex vs. non-convex loss
• Loss trends opposite of accuracy
• Loss is low when accuracy is high
• Minimum value for perfect accuracy
• Usually zero
• Loss is zero for perfect accuracy (by convention)
• Note: low loss on training does not guarantee low
• Loss is high when accuracy is low
loss on validation or testing
• Varies smoothly with input
• Loss is a function of actual and desired output • Varies smoothly with parameters
• Good to be convex in parameters (but is usually
• Minimizing the loss function with respect to not)
parameters leads to good parameters • Like a paraboloid

Non-convex loss can have multiple minima Examples of loss functions MSE loss for regression
Error
• Regression with continuous output
• Mean square error (MSE), log MSE, mean absolute error
f(x) = wx+b
• Classification with probabilistic output Error

y→
• Cross entropy (negative log likelihood), hinge loss
Error: 𝑦 − 𝑓(𝑥 )
• Similarity between vectors or clustering Square error: 𝑦 − 𝑓(𝑥 )
• Euclidean distance, cosine MSE:
1
∑𝑁 𝑦 − 𝑓(𝑥 )
𝑁 𝑖=1

x→

Original image source unknown

Is MSE always appropriate? MAE loss is less affected by outliers than MSE Is MSE appropriate for classification?

1
1 Model: 𝑓(𝑥 ) = ( )
1+
y→

y→

y→
Error: 𝑦 − 𝑓(𝑥 )
Absolute error: |𝑦 − 𝑓(𝑥 )|
1
MAE: ∑𝑁 𝑖=1
|𝑦 − 𝑓(𝑥 )|
𝑁
0
x→ x→ x→

Original image source unknown

Cross entropy loss is preferred for classification Overfitting and underfitting Regularization is a key concept in ML
● Regularization means constraining the model
• How much does one (estimated) • Compare training and validation loss
probability distribution q(x) ● More constraints may reduce model fit on training data
deviates from another (real) p(x)
Loss, if y = 1 →

Underfitting Overfitting
• KL-divergence of q(x) from p(x) ● However, it may improve fit on validation and test data

● Training performance of more constrained models are more likely to

• For binary classification: reflect test performance

0 1
f(x) →

−{𝑦log 𝑓(𝒙) + (1 − 𝑦)log 1 − 𝑓(𝒙) }

Original image source unknown

How to regularize models Examples of hyper-parameters and parameters

Parameters and Hyperparameters
● Parameters: These are the variable whose values are updated • f(x) = wTx + b
during the training process of model. • No hyperparameter
Feature coefficient in regression model
○
○ Weights of a neural network • Parameters are w and b
● Hyperparameters: These are the variables/ parameter whose
values are fixed by model developer before the beginning of • f(x) = w2 x2 + w1 x1 + w0 x0
learning process. • Hyper-parameter is degree 2
○ Number of variables in a tree node
○ Height of a tree • Parameters are w2 , w1 , and w0
○ Number of layers of a neural network

Image source: Wikipedia

Under-constrained models lead to overfitting Regularization is constraining a model L2-regularization visualized

• An n-degree polynomial can fit • How to regularize?
n points perfectly • Reduce the number of parameters
• But, is it overfitting? • Share weights in structure
• Constrain parameters to be small
• Is it being swayed by outliers?
• Encourage sparsity of output in loss
• “Models should be as simple as
possible, but not simplistic” • Most commonly Tikhonov (or L2, or ridge)
regularization (a.k.a. weight decay)
• To make model simpler:
• Penalty on sums of squares of individual weights
• Restrict number of parameters,
• Or, restrict the set of values that
they can take
• Always check validation 1
𝑁
𝜆
𝑛 𝑛

performance 𝐽=
𝑁
𝑦 − 𝑓(𝑥 ) +
2
𝑤 ; 𝑓(𝑥 ) = 𝑤 𝑥 ;
𝑖=1 𝑗=1 𝑗=0
Original image source unknown Original image source unknown
Other forms of regularization Preparing data for training and validation
Cross-validation
• Convolutional filter structure in CNN neurons • Data splits: ● Model performance measurement is dependent on way the data
• Max-pooling • Training  Used to optimize the parameters (e.g. random 70%) is split
• Validation  Used to compare models (e.g. random 15%) ● Not representative of the model’s ability to generalize
• Dropout • Testing  One final check after multiple rounds of validation (e.g. ● Solution: Cross-validation, especially when data is less
random 15%)
• L1-regularization (sparsity ● Con: more computations
• Cross-validation:
inducing norm) • K-folds: One fold for validation, K-1 folds for training
• Penalty on sums of absolute • Rotate folds K times
values of weights • Select framework (hyperparameters) best average performance
• Re-train best framework on entire data
• Test one final time on held-out data that was not a part of any fold

Original image source unknown

Try to ask critical questions in

ML can fail to perform in deployment ML life stages response to the following statements
• Lack of training diversity: data had limited confounders “We need to store every keystroke and mouse movement of use
of company computers so that we can run ML on it later.”
• Single speaker, author, camera, background, accent, ethnicity, etc.
• Data imbalance between high-value rare and more common examples “Our product can give biometric access based on face
recognition. No need for cumbersome finger, ID, or iris scans.”
• Proxy label leak during training:
“We can detect pneumonia in chest x-ray with 99% accuracy.”
• E.g. Only speakers A and B provide emotion “anger,” so ML confused their
voice characteristics with “anger” “We can recognize people’s emotions with 91% accuracy from
their faces as they watch videos on our website.”
• Too much manual cleansing of training data
“Our video call plug-in can tell when people are lying.”
• Too little training data, and very complex models “Our autonomous vehicle is 20 times safer than an average
• Concept drift: The assumptions behind training are no longer valid driver.”
Most ML courses

Overlooked questions Is AI / ML just a fad?

Relation of ML to other fields
Machine
Probability • Other hot buzzwords for VCs, businesses, and colleges over the
• Problem • Validation and Statistics last 30 years:
• Right business or societal need Learning
• Realistic; not too optimistic Optimization
• Programming, data structures, databases, IT
• Data and provenance
• Covers diversity of use cases • Computer networks, wireless communication
• Relevant and enough
• Diverse and representative • Meaningful performance metrics • Web 2.0
Programming
• Ethically gathered, stored • User guidance Linear Data Science • Nano technology
• Meticulously recorded Algebra • Crypto
• Declaration of intended use case
• Model • Description of data used • Some of these have deeply affected our economies and society,
• Exploits structure in data and have reached maturity from the PoV of ongoing innovation
• Sufficiently complex • Monitoring in-field performance
• Not too complex • Data differences, concept drift • Others have made a limited impact as challenges to their
• Meets deployment constraints • Unethical and unauthorized use promises became better understood
Image courtesy: Pixabay.com
Technology life cycle has to be ML is being deeply embedded in our
understood critically economy and society Waves of AI
• Automated pattern recognition is already here • Discriminative AI • Agentic AI
New wave of • Images and videos: find people, objects, diseases … • CNN, UNet, YOLO • Multi-objective
Maturity • LSTM, GRU • Multi-modal input
innovation • Voice: convert to text, …
• Text: queries, chat bots, translation, … • Transformer • Goal oriented
Value

Value
• Time series: stocks, power consumption, …
n

• Generative AI
t io

or
ova

co
De odific
• GAN, VAE
m

• Automated decision-making is coming

Inn

clin ati
m

• Diffusion
e

• Economic decisions: customer targeting, credit approval, …

• State space models
• Autonomous machines: Driverless cars, drones, robots, …
on

Time Time
• Critical decisions: medical diagnosis, criminal forensics …
Image courtesy: Pixabay.com

Where would basic ML struggle? Where would basic ML struggle?

Discriminative AI distinguished between inputs Large number of dimensions on a grid Learning from similar domains
• Vision • What features • If we have lots of
• Image classificaiton: Does the image have a cat or a dog? should we extract labeled digits
• Image segmentation: Which pixels represent a cat? from an image? • But only a few
• Object detection: Put a box on the cat. • Should pixels be labeled examples
• NLP features? of an ancient script
• Text classification: Is this a positive or a negative review? • Is Euclidean • Can we transfer
• Text annotation: Where is the positive clause in this paragraph? distance a good learning?
• Audio metric?
• Video

Image source: Wikipedia Image source: Wikipedia

Where would basic ML struggle? Where would basic ML struggle? Where would basic ML struggle?

Scaling feature extraction with input Depth vs. width Data dimensions form a graph
• More samples y1 y2 … yn • What if variables are
related as graphs?

… … …
• What if different
h11 h12 … h1n1 variables are missing in
• Higher different samples?
dimensional
samples x1 x2 … xd

Image source: Wikipedia

Where would basic ML struggle? Where would basic ML struggle?

How to use unlabeled data What if labels are given to bags of samples Advances in AI - image recognition

Source: devopedia.org/imagenet

Where would basic ML struggle?

How to generate more samples How does generative AI work? Generative AI - “Gandhi in Monet style”
Discriminative Generative We pit “counterfeiter” and “police” models to compete with each other

• Counterfeiter generates “artificial” output

• Police learns to distinguish between “artificial” and “natural”

• Goal of counterfeiter is to generate more believable “natural” outputs

• Goal of the police is to get better at spotting minute differences

Source: openai.com/product/dall-e-2

“Advantages of brushing teeth at night instead of Literature review on domain adaptation in

Generative AI created realistic output the morning” machine learning
Brushing teeth at night has several advantages over brushing in the morning: ChatGPT Google Scholar
• Vision:
• Removes plaque and bacteria: Brushing at night removes plaque and bacteria that have accumulated
• Image enhancement throughout the day, preventing them from causing damage to your teeth and gums overnight.
• Super resolution
• Prevents bad breath: Bad breath in the morning is often caused by bacteria that have been allowed to
• Image inpainting multiply in the mouth overnight. Brushing at night helps to remove these bacteria, preventing bad breath in
• Text: the morning.
• Q&A, chatbots, RAG • Reduces the risk of cavities: When you brush at night, you remove food particles and sugar from your teeth,
reducing the risk of cavities. If these particles are left on your teeth overnight, they can cause tooth decay.
• Article and creative writing
• Promotes better sleep: Brushing your teeth before bed can promote a sense of cleanliness and relaxation,
• Multi-modal: helping you to fall asleep more easily.
• Vision-language models • Gives fluoride more time to work: If you use fluoride toothpaste, brushing at night gives the fluoride more
• Creating images in certain sytles, image editing time to work on your teeth, protecting them against decay.
Overall, brushing your teeth at night is an essential part of maintaining good oral hygiene and preventing dental
problems.
Source: chat.openai.com
How to show risk table but change the font of axis Agentic AI will orchestrate various tools for specific
titles of risk tables in R ggsurvplot “The age of average” (by Alex Murrell) purposes
“So, this is your call to arms. Whether you’re in film or
fashion, media or marketing, architecture, • Agentic AI can make decisions, take autonomous actions, and
automotive or advertising, it doesn’t matter. Our continually learn from interactions to achieve specific objectives
visual culture is flatlining and the only cure is
creativity.

• Characteristics
It’s time to cast aside conformity. It’s time to exorcise
the expected. It’s time to decline the • Autonomy
indistinguishable. • Goal-oriented
• Adaptability
For years the world has been moving in the same • Proactive
stylistic direction. And it’s time we reintroduced some
originality.”

Source: chat.openai.com

Example utilization of different AI components

Other predictions for AI in 2025 Other considerations
Sensors Agentic AI

• Generative AI will plateau and hype will die down • Cost - dedicated hardware or cloud services and manpower
• Realism will set in but more use cases will be discovered • Power and impact on environment
• Businesses will become savvy about evaluating AI
• Concentration of AI talent and innovations
• Agentic AI will become the buzzword
• Conformity
• Human feedback will make a comeback
• Loss of nuance
• Debate about ethics and regulations will take pace
• Tailoring to in-house data and knowledge
Discriminative AI Generative AI Actuators

Activation function is the secret sauce of neural

Agenda networks
Types of activation functions
• Neural network training is • Step: original concept
behind classification and
• Overview of machine learning all about tuning weights region bifurcation. Not
used anymore
x1 and biases
w1 • Sigmoid and tanh:
trainable approximations
• Revision of neural networks of the step-function
x2 w2 • If there was no activation • ReLU: currently preferred
w3
Σ g function f, the output of
due to fast convergence
• Tentative list of topics in AML • Softmax: currently
x3 the entire neural network preferred for output of a
b classification net.
would be a linear function Generalized sigmoid
of the inputs • Linear: good for modeling
1 a range in the output of a
regression net
• The earliest models used a
step function
Step function divides the input space into two
Formulas for activation functions halves  0 and 1
The sigmoid function is a smoother step function
sign(𝑥)+1 • In a single neuron, step function is a
• Step: 𝑔(𝑥) = linear binary classifier
2
• The weights and biases determine
1
• Sigmoid: 𝑔(𝑥) = where the step will be in n-
1+ dimensions
• Tanh: 𝑔(𝑥) = tanh(𝑥) • But, as we shall see later, it gives
little information about how to
change the weights if we make a • Smoothness ensures that there is more
• ReLU: 𝑔(𝑥) = max(0, 𝑥) information about the direction in which
mistake
• So, we need a smoother version of a to change the weights if there are errors
• Softmax: 𝑔(𝑥 ) = step function • Sigmoid function is also mathematically
∑𝑖
• Enter: the Sigmoid function linked to logistic regression, which is a
• Linear: 𝑔(𝑥) = 𝑥 theoretically well-backed linear classifier

The problem with sigmoid is (near) zero gradient Output activation functions can only be of the
on both extremes following kinds
Basic structure of a neural network
• For both large • Sigmoid gives binary y1 y2 • It is feed forward
… yn
positive and classification output • Connections from
negative input • Tanh can also do that inputs towards
values, sigmoid provided the desired outputs
doesn’t change output is in {-1, +1} … • No connection comes
much with change • Softmax generalizes backwards
of input sigmoid to n-ary … … … • It consists of layers
classification • Current layer’s input is
• ReLU has a constant • Linear is used for h1n previous layer’s output
h11 h12 …
gradient for almost regression 1
• No lateral (intra-layer)
half of the inputs • ReLU is only used in connections
• But, ReLU cannot internal nodes (non- x1 x2 … xd • That’s it!
give a meaningful output)
final output

Basic structure of a neural network Importance of hidden layers Overall function of a neural network
− + +
• Output layer
y1 y2 … yn • Represent the output of the neural +
+ Single • First hidden • 𝑓(𝒙 ) = 𝑔 (𝑾 ∗ 𝑔 (𝑾 …𝑔 (𝑾 ∗ 𝒙 )…))
network − sigmoid
• For a two class problem or − + + − − layer extracts • Weights form a matrix
regression with a 1-d output, we
need only one output node − + features
−
• Hidden layer(s) +
+ + − • Output of the previous layer form a vector
… • Represent the intermediary nodes • Second hidden
that divide the input space into
regions with (soft) boundaries
layer extracts • The activation (nonlinear) function is applied point-
… … …
• These usually form a hidden layer
• Usually, there is only one such layer features of wise to the weight times input
• Given enough hidden nodes, we can
model an arbitrary input-output
relation. features
h1n • Input layer − + +
h11 h12 …
1 • Represent dimensions of the input + Sigmoid •… • Design questions (hyper parameters):
vector (one node for each + − hidden
dimension)
− − • Output layer • Number of layers
• These usually form an input layer, − + + layers and
and − + • Number of neurons in each layer (rows of weight matrices)
x1 x2 … xd • Usually there is only one such layer
+
− sigmoid gives the
− output
+ +
desired output
Training the neural network Loss function choice Some loss functions and their derivatives
• Given 𝒙 and 𝑦 • There are positive
and negative errors
• Think of what hyper-parameters and neural network in classification and • Terminology
design might work MSE is the most • 𝑦 is the output
common loss • 𝑡 is the target output
• Form a neural network: function
𝑓(𝒙 ) = 𝑔 (𝑾 ∗ 𝑔 (𝑾 …𝑔 (𝑾 ∗ 𝒙 )…)) • Mean square error
• There is probability
• Compute 𝑓𝒘 (𝒙 ) as an estimate of 𝑦 for all samples error→
of correct class in • Loss: (𝑦 − 𝑡 )
1 1 classification, for • Derivative of the loss: 2(𝑦 − 𝑡 )
• Compute loss: ∑𝑁 𝐿(𝑓𝒘 (𝒙 ), 𝑦 ) = ∑𝑁 𝑙 (𝒘) which cross entropy
𝑁 𝑖=1 𝑁 𝑖=1 • Cross entropy
is the most common
• Tweak 𝒘 to reduce loss (optimization algorithm) loss function • Loss: − ∑𝐶𝑐=1 𝑡 log 𝑦
1
• Repeat last three steps • Derivative of the loss: − |

error→

Computational graph of a single hidden layer NN Overall function of a neural network Training the neural network
• Given 𝒙 and 𝑦
• 𝑓(𝒙 ) = 𝑔 (𝑾 ∗ 𝑔 (𝑾 …𝑔 (𝑾 ∗ 𝒙 )…))
• Think of what hyper-parameters and neural network
• Weights form a matrix design might work
x
• Output of the previous layer form a vector • Form a neural network:
* ?
W1 + Z1 ReLU A1 • The activation (nonlinear) function is applied point- 𝑓(𝒙 ) = 𝑔 (𝑾 ∗ 𝑔 (𝑾 …𝑔 (𝑾 ∗ 𝒙 )…))
b1 * ? wise to the weight times input • Compute 𝑓𝒘 (𝒙 ) as an estimate of 𝑦 for all samples
W2 + Z2 SoftMax A2
1 1
b2 CE Loss
• Compute loss: ∑𝑁
𝑖=1
𝐿(𝑓𝒘 (𝒙 ), 𝑦 ) = ∑𝑁
𝑖=1
𝑙 (𝒘)
• Design questions (hyper parameters): 𝑁 𝑁
target • Number of layers • Tweak 𝒘 to reduce loss (optimization algorithm)
• Number of neurons in each layer (rows of weight matrices) • Repeat last three steps

Derivative of a function of a scalar

Gradient ascent Gradient descent minimizes the loss function
• At every point, compute
• If you didn’t know the shape of a mountain • Loss (scalar): 𝑙 (𝒘)
• But at every step you knew the slope • Gradient of loss with respect to weights
• Can you reach the top of the mountain? (vector):
𝛻𝒘 𝑙 (𝒘)
• Take a step towards negative gradient:
1 𝑁 E.g. 𝑓(𝑥) = 𝑎𝑥 + 𝑏𝑥 + 𝑐, 𝑓 (𝑥) = 2𝑎𝑥 + 𝑏, 𝑓′′(𝑥) = 2𝑎
𝒘 ← 𝒘 − 𝜂 𝛻𝒘 𝑙 (𝒘) 𝑑 𝑓(𝑥)
𝑁 𝑖=1 • Derivative 𝑓’(𝑥) = is the rate of change of 𝑓(𝑥) with 𝑥
𝑑𝑥
• It is zero when then function is flat (horizontal), such as at the
minimum or maximum of 𝑓(𝑥)
• It is positive when 𝑓(𝑥) is sloping up, and negative when 𝑓(𝑥) is
sloping down
• To move towards the maxima, taking a small step in a direction of
the derivative
Gradient of a function of a vector Gradient of a function of a vector Example of gradient
• Derivative with respect to each • Gradient gives a direction for
dimension, holding other • Let 𝑓(𝒙) = 𝑓(𝑥 , 𝑥 ) = 5𝑥 + 3𝑥
moving towards the minima 𝜕𝑓
dimensions constant
• Take a small step towards 10𝑥
𝜕𝑓
negative of the gradient • Then 𝛻𝑓(𝒙) = 𝛻𝑓(𝑥 , 𝑥 ) = =
f(x1, x2) →

f(x1, x2) →
𝜕𝑓 6𝑥
• 𝛻𝑓(𝒙) = 𝛻𝑓(𝑥 , 𝑥 ) = 𝜕𝑓

• At a minima or a maxima the 20 0.958

• At a location (2,1) a step in or direction will lead to
→ gradient is a zero vector → 6 0.287
x1 x1 maximal increase in the function
x2 → The function is flat in every x2 →
direction
• At a minima or a maxima the
gradient is a zero vector
Original image source unknown Original image source unknown

This story is unfolding in multiple dimensions Backpropagation Chain rule of differentiation

• Backpropagation is an
efficient method to do
y1 y2 … yn gradient descent • Very handy for complicated functions
• It saves the gradient § Especially functions of functions
w.r.t. the upper layer
output to compute the § E.g. NN outputs are functions of previous layers
gradient w.r.t. the
…
weights immediately § For example: Let 𝑓(𝑥) = 𝑔 ℎ(𝑥)
below
§ Let 𝑦 = ℎ(𝑥), 𝑧 = 𝑔(𝑦) = 𝑔 ℎ(𝑥)
… … … • It is linked to the chain
rule of derivatives 𝑑𝑧 𝑑𝑧𝑑𝑦
§ Then 𝑓 (𝑥) = = = 𝑔 (𝑦)ℎ (𝑥)
• All intermediary 𝑑𝑥 𝑑𝑦𝑑𝑥
h11 h12 … h1n functions must be 𝑑 sin( )
1 differentiable, including § For example: = 2𝑥 cos(𝑥 )
the activation functions 𝑑𝑥

x1 x2 … xd

Original image source unknown

Backpropagation makes use of Vector valued functions and Jacobians

chain rule of derivatives • We often deal with functions that give multiple
Jacobian of each layer
outputs
𝜕𝑓(𝑔(𝑥)) 𝜕𝑓(𝑔(𝑥)) 𝜕𝑔(𝑥)
• Chain rule: = 𝑓 (𝒙) 𝑓 (𝑥 , 𝑥 , 𝑥 ) • Compute the derivatives of a higher layer’s output with respect to
𝜕𝑥 𝜕𝑔(𝑥) 𝜕𝑥 • Let 𝒇(𝒙) = = those of the lower layer
𝑓 (𝒙) 𝑓 (𝑥 , 𝑥 , 𝑥 )
x • Thinking in terms of vector of functions can make the
* ? representation less cumbersome and computations • What if we scale all the weights by a factor R?
W1 + Z1 ReLU A1 more efficient
b1 * ? • Then the Jacobian is • What happens a few layers down?
W2 + Z2 SoftMax A2 𝜕 𝜕 𝜕
b2 CE Loss 𝜕𝒇 𝜕𝒇 𝜕𝒇
• 𝑱(𝒇) = = 𝜕 𝜕 𝜕
target
Double derivative
Role of step size and learning rate The perfect step size is impossible to guess
• Tale of two loss functions • Goldilocks finds the perfect balance only in a fairy tale
• Same value, and
• Same gradient (first derivative), but
• Different Hessian (second derivative)
• Different step sizes needed
E.g. 𝑓(𝑥) = 𝑎𝑥 + 𝑏𝑥 + 𝑐, 𝑓 (𝑥) = 2𝑎𝑥 + 𝑏, 𝑓′′(𝑥) = 2𝑎
• Success not guaranteed • The step size is decided by learning rate 𝜂 and the gradient
𝑓(𝑥)
• Double derivative 𝑓’’(𝑥) = is the derivative of
𝑑
derivative of 𝑓(𝑥)
• Double derivative is positive for convex functions (have a
single minima), and negative for concave functions (have a
single maxima)

Double derivative Perfect step size for a paraboloid

• Let 𝑓(𝑥) = 𝑎𝑥 + 𝑏𝑥 + 𝑐
Hessian of a function of a vector
𝑓(𝑥) • Assuming 𝑎 < 0 • Double derivative with respect
= 𝑎𝑥 + 𝑏𝑥 + 𝑐, 𝑏 to a pair of dimensions forms
• Minima is at: 𝑥∗ =− the Hessian matrix:
𝑓 (𝑥) = 2𝑎𝑥 + 𝑏, 2𝑎
𝑓′′(𝑥) = 2𝑎 • For any 𝑥 the perfect step would be:

f(x1, x2) →
𝑏 2𝑎𝑥+𝑏 (𝑥)
− − 𝑥 =− =− (𝑥)
2𝑎 2𝑎
1
• Double derivative tells how far the minima might be from • So, the perfect learning rate is: 𝜂∗ = (𝑥) →
a given point. x1
• In multiple dimensions, 𝒙 ← 𝒙 − 𝐻 𝑓(𝒙) 𝛻(𝑓(𝒙)) x2 →
• From 𝑥 = 0 the minima is closer for the red dashed curve • If all eigenvalues of a Hessian
than for the blue solid curve, because the former has a • Practically, we do not want to compute the inverse of a
matrix are positive, then the
larger second derivative (its slope reverses faster) Hessian matrix, so we approximate Hessian inverse function is convex
Original image source unknown

Example of Hessian Saddle points, Hessian and long local furrows Complicated loss functions
• Let 𝑓(𝒙) = 𝑓(𝑥 , 𝑥 ) = 5𝑥 + 3𝑥 + 4𝑥 𝑥 • Some variables may have
𝜕𝑓 reached a local minima
10𝑥 + 4𝑥 while others have not
• Then 𝛻𝑓(𝒙) = 𝛻𝑓(𝑥 , 𝑥 ) = =
𝜕𝑓 6𝑥 + 4𝑥 • Some weights may have
almost zero gradient
𝑓 𝑓
10 4 • At least some eigenvalues
• And, 𝐻(𝑓(𝒙)) = = may not be negative
𝑓 𝑓 4 6

Image source: Wikipedia Original image source unknown

Global
minima?

Saddle
point
Tentative list of topics Next week
• Neural architectures - I • Training methods - II • Neural architectures for vision
• Vision • Lack of labels • LeNet
A realistic picture

• Audio • Lack of training samples • ResNet

• NLP • Generative • UNet
• Graphs • Multi-modal • YOLO
• Training Methods - I • Pre-training and fine-tuning
• Convex optimization • Self-supervised learning
• Semi and weak supervision
• Cover later
Local • Layers to help training • ViT
minima • LR scheduling • Swin Transformer
• Robustness
Local
maxima

Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

Aspiring Data Scientist Guide
No ratings yet
Aspiring Data Scientist Guide
10 pages
EE353 - 769 06 Intro To ML
No ratings yet
EE353 - 769 06 Intro To ML
27 pages
Intro DL 01
No ratings yet
Intro DL 01
64 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
22 pages
ML 01
No ratings yet
ML 01
24 pages
Week 1 - Artificial Neural Networks - Part I - Justin
No ratings yet
Week 1 - Artificial Neural Networks - Part I - Justin
56 pages
CSE 445 - Lecture 1 - Machine Learning Introduction
No ratings yet
CSE 445 - Lecture 1 - Machine Learning Introduction
23 pages
July4 SaketAnand FriendlyIntroToML
No ratings yet
July4 SaketAnand FriendlyIntroToML
84 pages
L1 Overview
No ratings yet
L1 Overview
28 pages
1 - Machine Learning Overview
No ratings yet
1 - Machine Learning Overview
56 pages
MATH 370: Intro to Machine Learning
No ratings yet
MATH 370: Intro to Machine Learning
60 pages
ML Intro Theory
No ratings yet
ML Intro Theory
10 pages
1 - Machine Learning Overview
No ratings yet
1 - Machine Learning Overview
53 pages
Advanced Machine Learning (AML)
No ratings yet
Advanced Machine Learning (AML)
70 pages
Basic Concepts of Machine Learning For Beginners
No ratings yet
Basic Concepts of Machine Learning For Beginners
102 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Fundamentals of ML 1
No ratings yet
Fundamentals of ML 1
38 pages
Intro to Machine Learning Concepts
100% (1)
Intro to Machine Learning Concepts
73 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
Machine Learning Updated
No ratings yet
Machine Learning Updated
14 pages
Lecture 1 Ai
No ratings yet
Lecture 1 Ai
38 pages
2021 Machine Learning Intro
No ratings yet
2021 Machine Learning Intro
43 pages
Lecture Notes
No ratings yet
Lecture Notes
86 pages
Machine Learning - Course
No ratings yet
Machine Learning - Course
6 pages
Ai - Foundations of Machine Learning I
No ratings yet
Ai - Foundations of Machine Learning I
40 pages
Made By: Swati Tripathi
No ratings yet
Made By: Swati Tripathi
31 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
15 pages
Study Notes - Lesson 1 - 7 PDF
No ratings yet
Study Notes - Lesson 1 - 7 PDF
25 pages
Basic Concepts of Machine Learning For Beginners 1732109263
No ratings yet
Basic Concepts of Machine Learning For Beginners 1732109263
102 pages
Introduction To ML Unit-1
No ratings yet
Introduction To ML Unit-1
90 pages
ELE-COI-521 Machine Learning Topics
No ratings yet
ELE-COI-521 Machine Learning Topics
40 pages
Unit 1
No ratings yet
Unit 1
38 pages
Introduction To Machine Learning Lecture Notes
No ratings yet
Introduction To Machine Learning Lecture Notes
3 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
Chatgpt Unit - 1
No ratings yet
Chatgpt Unit - 1
5 pages
CE880 Lecture5 Slides
No ratings yet
CE880 Lecture5 Slides
32 pages
ChatGPT - Machine Learning Overview
No ratings yet
ChatGPT - Machine Learning Overview
34 pages
Lec2 Intro To ML
No ratings yet
Lec2 Intro To ML
35 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
51 pages
Unit I1
No ratings yet
Unit I1
8 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
24 pages
ML Intro Beginner Detailed
No ratings yet
ML Intro Beginner Detailed
22 pages
01 ML Basics
No ratings yet
01 ML Basics
61 pages
ML1-Introduction To Machine Learning
No ratings yet
ML1-Introduction To Machine Learning
46 pages
Unit Iv Parametric Machine Learning
No ratings yet
Unit Iv Parametric Machine Learning
4 pages
Ai - Foundations of Machine Learning I
No ratings yet
Ai - Foundations of Machine Learning I
39 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
132 pages
Lec1 Intoduction
No ratings yet
Lec1 Intoduction
34 pages
DIR Notes 1
No ratings yet
DIR Notes 1
39 pages
Advance ML - Unit 1
No ratings yet
Advance ML - Unit 1
12 pages
16-Intro To ML
No ratings yet
16-Intro To ML
52 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
5 pages
Firoz Topic 0
No ratings yet
Firoz Topic 0
24 pages
Summary Machine Learning
No ratings yet
Summary Machine Learning
6 pages
Machine Learning (ML) - Comprehensive Summary
No ratings yet
Machine Learning (ML) - Comprehensive Summary
7 pages
AI Module 1 Simple Notes
No ratings yet
AI Module 1 Simple Notes
14 pages
Lecture 1
No ratings yet
Lecture 1
47 pages
AI For Trading Syllabus: Contact Info
100% (1)
AI For Trading Syllabus: Contact Info
7 pages
1 Development of Monitoring Robot System For Tomato
No ratings yet
1 Development of Monitoring Robot System For Tomato
14 pages
Assignment 1B
No ratings yet
Assignment 1B
2 pages
(W-6597) Deep-Neural-Network Approach To Solving The Ab Initio Nuclear Structure Problem
No ratings yet
(W-6597) Deep-Neural-Network Approach To Solving The Ab Initio Nuclear Structure Problem
9 pages
(Atkg1.) River Publishers - English - Applications of Machine Learning in Big-Data Analytics and Cloud Computing - Subhendu Kumar Pani, Somanath Tripathy Etc. (Z-Library)
No ratings yet
(Atkg1.) River Publishers - English - Applications of Machine Learning in Big-Data Analytics and Cloud Computing - Subhendu Kumar Pani, Somanath Tripathy Etc. (Z-Library)
348 pages
Geoscience Knowledge Graph in The Big Data Era
No ratings yet
Geoscience Knowledge Graph in The Big Data Era
11 pages
Hardware Accelerators For Machine Learning (CS 217) by cs217
No ratings yet
Hardware Accelerators For Machine Learning (CS 217) by cs217
8 pages
Civil
100% (1)
Civil
491 pages
Enhancing Academic Resource Evaluation in Computer Science and Engineering Through Automated Assessment
No ratings yet
Enhancing Academic Resource Evaluation in Computer Science and Engineering Through Automated Assessment
4 pages
Health Disease Project Report
No ratings yet
Health Disease Project Report
53 pages
Machine Learning and Python For Human Behavior, Emotion, and Health Status Analysis 1st Edition ISBN 1032544783, 9781032544786 Final Version Download
No ratings yet
Machine Learning and Python For Human Behavior, Emotion, and Health Status Analysis 1st Edition ISBN 1032544783, 9781032544786 Final Version Download
15 pages
Grape Leaf p2 Final
No ratings yet
Grape Leaf p2 Final
28 pages
Click Here For Download: (PDF) Make Your Own Neural Network
100% (1)
Click Here For Download: (PDF) Make Your Own Neural Network
3 pages
Deep Learning Methods and Applications Li Deng Dong Yu PDF Download
100% (1)
Deep Learning Methods and Applications Li Deng Dong Yu PDF Download
49 pages
Sushma
No ratings yet
Sushma
7 pages
A Proposal On Machine Learning Via Dynamical Systems
No ratings yet
A Proposal On Machine Learning Via Dynamical Systems
11 pages
Lecture 16 Hao
No ratings yet
Lecture 16 Hao
56 pages
Feedforward Neural Networks in Depth, Part 1 - Forward and Backward Propagations - I, Deep Learning
No ratings yet
Feedforward Neural Networks in Depth, Part 1 - Forward and Backward Propagations - I, Deep Learning
11 pages
Python Deep Learning: Understand How Deep Neural Networks Work and Apply Them To Real-World Tasks 3rd Edition Vasilev Ebook All Chapters PDF
100% (8)
Python Deep Learning: Understand How Deep Neural Networks Work and Apply Them To Real-World Tasks 3rd Edition Vasilev Ebook All Chapters PDF
46 pages
Image-Based Table Recognition Data Model and Metric
No ratings yet
Image-Based Table Recognition Data Model and Metric
11 pages
Sign Connect
No ratings yet
Sign Connect
25 pages
Internet of Things-Based Remote Animal Tracking Near Railway Tracks
No ratings yet
Internet of Things-Based Remote Animal Tracking Near Railway Tracks
23 pages
Corsera
50% (2)
Corsera
6 pages
AI Image Caption Generator Project
No ratings yet
AI Image Caption Generator Project
14 pages
G. Narayanamma Institute of Technology & Science
No ratings yet
G. Narayanamma Institute of Technology & Science
2 pages
Data Science Course in Hyderabad
100% (1)
Data Science Course in Hyderabad
29 pages
5 Big Myths of Ai and Machine Learning Debunked
No ratings yet
5 Big Myths of Ai and Machine Learning Debunked
9 pages
AI in Material Science Revolutionizing Construction in The Age of Industry 4.0
No ratings yet
AI in Material Science Revolutionizing Construction in The Age of Industry 4.0
50 pages
AI & Python Programming Course
No ratings yet
AI & Python Programming Course
10 pages

Advanced ML Slides Intro

Uploaded by

Advanced ML Slides Intro

Uploaded by

Why this course What is the relation between AI, ML, DL etc.?

Introduction to Advanved • Rapid expansion of data and compute

Course material Pre-requisites Evaluation structure

ML model training and

Broad types of ML problems

Dimension • Decide on the type of the ML problem

Logistic Linear • Input xi

Components of a Trained ML System Mathematically speaking… Preparing data

§ e.g. y = 1, if mx > c, y = 0 otherwise, so θ = (m,c)

• Correlated • Convert discrete to one-hot-bit

Models must exploit structure of data Loss and accuracy

Original image source unknown

Original image source unknown

● Training performance of more constrained models are more likely to

−{𝑦log 𝑓(𝒙) + (1 − 𝑦)log 1 − 𝑓(𝒙) }

How to regularize models Examples of hyper-parameters and parameters

Image source: Wikipedia

Under-constrained models lead to overfitting Regularization is constraining a model L2-regularization visualized

Original image source unknown

Try to ask critical questions in

Overlooked questions Is AI / ML just a fad?

• Automated decision-making is coming

• Economic decisions: customer targeting, credit approval, …

Where would basic ML struggle? Where would basic ML struggle?

Image source: Wikipedia Image source: Wikipedia

Image source: Wikipedia

Where would basic ML struggle?

• Counterfeiter generates “artificial” output

• Police learns to distinguish between “artificial” and “natural”

• Goal of counterfeiter is to generate more believable “natural” outputs

• Goal of the police is to get better at spotting minute differences

“Advantages of brushing teeth at night instead of Literature review on domain adaptation in

Example utilization of different AI components

Activation function is the secret sauce of neural

Derivative of a function of a scalar

• At a minima or a maxima the 20 0.958

This story is unfolding in multiple dimensions Backpropagation Chain rule of differentiation

Original image source unknown

Backpropagation makes use of Vector valued functions and Jacobians

Double derivative Perfect step size for a paraboloid

Image source: Wikipedia Original image source unknown

• Audio • Lack of training samples • ResNet

Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

You might also like