Advanced ML Slides Intro
Advanced ML Slides Intro
AML will be synomyous with deep learning Learning outcomes What will be covered
• Advanced machine learning is very broad • Formulate advanced machine learning problems • ML problems • Training methods
• Deep learning • Inputs, outputs, labels, annotations, data quantity, data quality, prior • Lack of labels • Loss functions
• Probabilistic methods knowledge • Mislabeled data • Speed of convergence
• Theoretical ML • Analyze and propose neural architectures • Use of prior knowledge • Pre-training and fine-tuning
• Reinforcement learning • How are data dimensions related • Robustness
• ... • Changes in architectures to improve outcomes • Neural network architectures
• Most practical and exciting applications are coming from DL • Analyze and formulate training methods • For different types of input data
• Amount and quality of data • For different formats of output
• Using prior knowledge • For different levels of compute
• More than enough to be covered in deep learning alone
● Somewhere in between: fewer labels than one per example Unsupervised Clustering Dimension reduction
○ Semi-supervised learning: some examples are labeled
○ Weakly supervised learning: groups of examples are labeled
○ Reinforcement learning: Label (reward) is available after a sequence of steps
Some popular ML frameworks Recipe for ML training ML gives a model
Supervised Machine Learning System - Training
(and regularization)
K-means, PCA, k-PCA, Hyper-parameters
Loss function
Vector regression regression Fuzzy C-means, LLE, • Function fθ(xi)
• Shortlist ML frameworks Parameters θ
SVM, RF, NN DB-SCAN ISOMAP
RNN, LSTM, Transformer, • Prepare training, validation, and test sets • Utility of the model:
Series, text
1-D CNN, HMM • Target output ti Target output t i
Images 2-D CNN, MRF • Train, validate, repeat • Bring fθ(xi) close to ti Learning algorithm
• Minimize loss L(ti, fθ(xi), θ)
Video, MRI 3-D CNN, CNN+LSTM, MRF • Use test data only once
23
• Web of relationships
Images courtesy: Pixabay.com
Loss function tells how bad the model is Properties of a good loss function Convex vs. non-convex loss
• Loss trends opposite of accuracy
• Loss is low when accuracy is high
• Minimum value for perfect accuracy
• Usually zero
• Loss is zero for perfect accuracy (by convention)
• Note: low loss on training does not guarantee low
• Loss is high when accuracy is low
loss on validation or testing
• Varies smoothly with input
• Loss is a function of actual and desired output • Varies smoothly with parameters
• Good to be convex in parameters (but is usually
• Minimizing the loss function with respect to not)
parameters leads to good parameters • Like a paraboloid
Non-convex loss can have multiple minima Examples of loss functions MSE loss for regression
Error
• Regression with continuous output
• Mean square error (MSE), log MSE, mean absolute error
f(x) = wx+b
• Classification with probabilistic output Error
y→
• Cross entropy (negative log likelihood), hinge loss
Error: 𝑦 − 𝑓(𝑥 )
• Similarity between vectors or clustering Square error: 𝑦 − 𝑓(𝑥 )
• Euclidean distance, cosine MSE:
1
∑𝑁 𝑦 − 𝑓(𝑥 )
𝑁 𝑖=1
x→
Is MSE always appropriate? MAE loss is less affected by outliers than MSE Is MSE appropriate for classification?
1
1 Model: 𝑓(𝑥 ) = ( )
1+
y→
y→
y→
Error: 𝑦 − 𝑓(𝑥 )
Absolute error: |𝑦 − 𝑓(𝑥 )|
1
MAE: ∑𝑁 𝑖=1
|𝑦 − 𝑓(𝑥 )|
𝑁
0
x→ x→ x→
Underfitting Overfitting
• KL-divergence of q(x) from p(x) ● However, it may improve fit on validation and test data
0 1
f(x) →
performance 𝐽=
𝑁
𝑦 − 𝑓(𝑥 ) +
2
𝑤 ; 𝑓(𝑥 ) = 𝑤 𝑥 ;
𝑖=1 𝑗=1 𝑗=0
Original image source unknown Original image source unknown
Other forms of regularization Preparing data for training and validation
Cross-validation
• Convolutional filter structure in CNN neurons • Data splits: ● Model performance measurement is dependent on way the data
• Max-pooling • Training Used to optimize the parameters (e.g. random 70%) is split
• Validation Used to compare models (e.g. random 15%) ● Not representative of the model’s ability to generalize
• Dropout • Testing One final check after multiple rounds of validation (e.g. ● Solution: Cross-validation, especially when data is less
random 15%)
• L1-regularization (sparsity ● Con: more computations
• Cross-validation:
inducing norm) • K-folds: One fold for validation, K-1 folds for training
• Penalty on sums of absolute • Rotate folds K times
values of weights • Select framework (hyperparameters) best average performance
• Re-train best framework on entire data
• Test one final time on held-out data that was not a part of any fold
Value
• Time series: stocks, power consumption, …
n
• Generative AI
t io
or
ova
co
De odific
• GAN, VAE
m
clin ati
m
• Diffusion
e
Time Time
• Critical decisions: medical diagnosis, criminal forensics …
Image courtesy: Pixabay.com
Discriminative AI distinguished between inputs Large number of dimensions on a grid Learning from similar domains
• Vision • What features • If we have lots of
• Image classificaiton: Does the image have a cat or a dog? should we extract labeled digits
• Image segmentation: Which pixels represent a cat? from an image? • But only a few
• Object detection: Put a box on the cat. • Should pixels be labeled examples
• NLP features? of an ancient script
• Text classification: Is this a positive or a negative review? • Is Euclidean • Can we transfer
• Text annotation: Where is the positive clause in this paragraph? distance a good learning?
• Audio metric?
• Video
Where would basic ML struggle? Where would basic ML struggle? Where would basic ML struggle?
Scaling feature extraction with input Depth vs. width Data dimensions form a graph
• More samples y1 y2 … yn • What if variables are
related as graphs?
… … …
• What if different
h11 h12 … h1n1 variables are missing in
• Higher different samples?
dimensional
samples x1 x2 … xd
How to use unlabeled data What if labels are given to bags of samples Advances in AI - image recognition
Source: devopedia.org/imagenet
How to generate more samples How does generative AI work? Generative AI - “Gandhi in Monet style”
Discriminative Generative We pit “counterfeiter” and “police” models to compete with each other
Source: openai.com/product/dall-e-2
• Characteristics
It’s time to cast aside conformity. It’s time to exorcise
the expected. It’s time to decline the • Autonomy
indistinguishable. • Goal-oriented
• Adaptability
For years the world has been moving in the same • Proactive
stylistic direction. And it’s time we reintroduced some
originality.”
Source: chat.openai.com
• Generative AI will plateau and hype will die down • Cost - dedicated hardware or cloud services and manpower
• Realism will set in but more use cases will be discovered • Power and impact on environment
• Businesses will become savvy about evaluating AI
• Concentration of AI talent and innovations
• Agentic AI will become the buzzword
• Conformity
• Human feedback will make a comeback
• Loss of nuance
• Debate about ethics and regulations will take pace
• Tailoring to in-house data and knowledge
Discriminative AI Generative AI Actuators
The problem with sigmoid is (near) zero gradient Output activation functions can only be of the
on both extremes following kinds
Basic structure of a neural network
• For both large • Sigmoid gives binary y1 y2 • It is feed forward
… yn
positive and classification output • Connections from
negative input • Tanh can also do that inputs towards
values, sigmoid provided the desired outputs
doesn’t change output is in {-1, +1} … • No connection comes
much with change • Softmax generalizes backwards
of input sigmoid to n-ary … … … • It consists of layers
classification • Current layer’s input is
• ReLU has a constant • Linear is used for h1n previous layer’s output
h11 h12 …
gradient for almost regression 1
• No lateral (intra-layer)
half of the inputs • ReLU is only used in connections
• But, ReLU cannot internal nodes (non- x1 x2 … xd • That’s it!
give a meaningful output)
final output
Basic structure of a neural network Importance of hidden layers Overall function of a neural network
− + +
• Output layer
y1 y2 … yn • Represent the output of the neural +
+ Single • First hidden • 𝑓(𝒙 ) = 𝑔 (𝑾 ∗ 𝑔 (𝑾 …𝑔 (𝑾 ∗ 𝒙 )…))
network − sigmoid
• For a two class problem or − + + − − layer extracts • Weights form a matrix
regression with a 1-d output, we
need only one output node − + features
−
• Hidden layer(s) +
+ + − • Output of the previous layer form a vector
… • Represent the intermediary nodes • Second hidden
that divide the input space into
regions with (soft) boundaries
layer extracts • The activation (nonlinear) function is applied point-
… … …
• These usually form a hidden layer
• Usually, there is only one such layer features of wise to the weight times input
• Given enough hidden nodes, we can
model an arbitrary input-output
relation. features
h1n • Input layer − + +
h11 h12 …
1 • Represent dimensions of the input + Sigmoid •… • Design questions (hyper parameters):
vector (one node for each + − hidden
dimension)
− − • Output layer • Number of layers
• These usually form an input layer, − + + layers and
and − + • Number of neurons in each layer (rows of weight matrices)
x1 x2 … xd • Usually there is only one such layer
+
− sigmoid gives the
− output
+ +
desired output
Training the neural network Loss function choice Some loss functions and their derivatives
• Given 𝒙 and 𝑦 • There are positive
and negative errors
• Think of what hyper-parameters and neural network in classification and • Terminology
design might work MSE is the most • 𝑦 is the output
common loss • 𝑡 is the target output
• Form a neural network: function
𝑓(𝒙 ) = 𝑔 (𝑾 ∗ 𝑔 (𝑾 …𝑔 (𝑾 ∗ 𝒙 )…)) • Mean square error
• There is probability
• Compute 𝑓𝒘 (𝒙 ) as an estimate of 𝑦 for all samples error→
of correct class in • Loss: (𝑦 − 𝑡 )
1 1 classification, for • Derivative of the loss: 2(𝑦 − 𝑡 )
• Compute loss: ∑𝑁 𝐿(𝑓𝒘 (𝒙 ), 𝑦 ) = ∑𝑁 𝑙 (𝒘) which cross entropy
𝑁 𝑖=1 𝑁 𝑖=1 • Cross entropy
is the most common
• Tweak 𝒘 to reduce loss (optimization algorithm) loss function • Loss: − ∑𝐶𝑐=1 𝑡 log 𝑦
1
• Repeat last three steps • Derivative of the loss: − |
error→
Computational graph of a single hidden layer NN Overall function of a neural network Training the neural network
• Given 𝒙 and 𝑦
• 𝑓(𝒙 ) = 𝑔 (𝑾 ∗ 𝑔 (𝑾 …𝑔 (𝑾 ∗ 𝒙 )…))
• Think of what hyper-parameters and neural network
• Weights form a matrix design might work
x
• Output of the previous layer form a vector • Form a neural network:
* ?
W1 + Z1 ReLU A1 • The activation (nonlinear) function is applied point- 𝑓(𝒙 ) = 𝑔 (𝑾 ∗ 𝑔 (𝑾 …𝑔 (𝑾 ∗ 𝒙 )…))
b1 * ? wise to the weight times input • Compute 𝑓𝒘 (𝒙 ) as an estimate of 𝑦 for all samples
W2 + Z2 SoftMax A2
1 1
b2 CE Loss
• Compute loss: ∑𝑁
𝑖=1
𝐿(𝑓𝒘 (𝒙 ), 𝑦 ) = ∑𝑁
𝑖=1
𝑙 (𝒘)
• Design questions (hyper parameters): 𝑁 𝑁
target • Number of layers • Tweak 𝒘 to reduce loss (optimization algorithm)
• Number of neurons in each layer (rows of weight matrices) • Repeat last three steps
f(x1, x2) →
𝜕𝑓 6𝑥
• 𝛻𝑓(𝒙) = 𝛻𝑓(𝑥 , 𝑥 ) = 𝜕𝑓
x1 x2 … xd
f(x1, x2) →
𝑏 2𝑎𝑥+𝑏 (𝑥)
− − 𝑥 =− =− (𝑥)
2𝑎 2𝑎
1
• Double derivative tells how far the minima might be from • So, the perfect learning rate is: 𝜂∗ = (𝑥) →
a given point. x1
• In multiple dimensions, 𝒙 ← 𝒙 − 𝐻 𝑓(𝒙) 𝛻(𝑓(𝒙)) x2 →
• From 𝑥 = 0 the minima is closer for the red dashed curve • If all eigenvalues of a Hessian
than for the blue solid curve, because the former has a • Practically, we do not want to compute the inverse of a
matrix are positive, then the
larger second derivative (its slope reverses faster) Hessian matrix, so we approximate Hessian inverse function is convex
Original image source unknown
Example of Hessian Saddle points, Hessian and long local furrows Complicated loss functions
• Let 𝑓(𝒙) = 𝑓(𝑥 , 𝑥 ) = 5𝑥 + 3𝑥 + 4𝑥 𝑥 • Some variables may have
𝜕𝑓 reached a local minima
10𝑥 + 4𝑥 while others have not
• Then 𝛻𝑓(𝒙) = 𝛻𝑓(𝑥 , 𝑥 ) = =
𝜕𝑓 6𝑥 + 4𝑥 • Some weights may have
almost zero gradient
𝑓 𝑓
10 4 • At least some eigenvalues
• And, 𝐻(𝑓(𝒙)) = = may not be negative
𝑓 𝑓 4 6
Saddle
point
Tentative list of topics Next week
• Neural architectures - I • Training methods - II • Neural architectures for vision
• Vision • Lack of labels • LeNet
A realistic picture