u
Convulotional neural net
Data statistics
Outline
▪ Brief recap of neural networks
• Application: autoencoders
▪ Convolutional neural networks
3
Introduction Linear regression Logistic regression
Feature engineering Data statistics Naive Bayes
KNN Clustering Dimensionality reduction
Neural networks Convolutional neural Decision-trees
networks
Last time
“3-layer Neural Net”, or
“2-hidden-layer Neural Net”
Additional resource: https://www.deeplearningbook.org/
Example application of Neural Networks
Autoencoder
Recall - dimensionality reduction
PCA
Autoencoder
Autoencoder vs. PCA
Top: Some examples of the original MNIST
test samples
Middle: Reconstructed output from an auto-
encoder with a latent space of 8 dimensions
This auto-encoder uses convolutional layers, and
was trained on the MNIST training set
Bottom: Reconstructed output from PCA with
8 latent dimensions
Image credit: F. Fleuret, Deep Learning (EPFL)
Autoencoder
Autoencoder vs. PCA
Training for NN
Goal of optimization in ML:
N (i) (i) (i)
Minimize cost over batch: ∑ L =L(y , ŷ ) of the i
i=1
-th training example of batch
Want the optimization to:
• Converge quickly
• Find a good local minima (or even global minima)
Gradient descent (and variants) is the preferred way to
optimize neural networks
Choice of optimizer and hyper-parameters affect speed of
convergence and kind of local minima found
A. Amini et al. Spatial Uncertainty Sampling for
End-to-End Control, 2019
Gradient descent variants
(Vanilla / Batch) Gradient descent (GD):
1 N
▪ J= N
∑i=1 L (i)
▪ Weights updated after calculating the gradient over the entire dataset
• slow
• requires large memory
Stochastic gradient descent (SGD):
▪ J = L (i)
▪ Weights updated after calculating the gradient of a single example
• requires much less memory than GD
• high variance in parameter updates
Mini-batch Gradient descent
1 Nb
▪ J= Nb
∑i=1 L (i)
▪ Weights updated after calculating the gradient over the entire dataset
• Faster than SGD
• Reduces variance of gradient estimation
Optimization
Learning rate
Image credit: Jeremy Jordan (https://www.jeremyjordan.me/nn-learning-rate/)
Optimization
Optimizers
Variants of gradient descent are
commonly used in practice to speed-up
and improve convergence:
▪ Momentum update
▪ Nesterov Accelerated Gradient (NAG)
▪ Adam
▪ and more…
Convolutional Neural Networks
Real-World Problem
Detecting and Classifying Pavement Distress
! ! ! !
! !
Why? On-time preventive maintenance
Lack of on-time maintenance
• x3 the maintenance cost
• traffic delay
• more fuel consumption
• accidents
• …
Automatic pavement distress monitoring
!
Pavement
distress
Convolutional Neural Networks (CNN)
Intro - Handling images with fully-connected NN
3072x1 input
3x32x32 image
Flatten
height (32)
By flattening, spatial structure gets lost!
width (32)
depth (3)
→ for the 3 color channels: R, G, B
Convolutional Neural Networks
Intro - Handling images with fully-connected NN
A fully-connected neural net:
▪ Requires flattening the image
→ spatial structure gets lost Flatten
▪ Doesn’t scale well to large images
• e.g. 1024x1024x3 image results in
3’145’728 weights for each neuron of first
hidden layer
How to efficiently model correlation between neighboring pixels?
=> Convolutional Neural Networks
Convolution definition
Convolution of a signal with a filter
Convolution - what to do at boundaries
Convolution to get features
Filters can approximate what happens in neighborhood with a few numbers:
give average value of signal in a neighborhood
sharpen the signal
blur the signal
approximate derivative of a signal in a neighborhood:
approximate second derivative of a signal in a neighborhood
Convolution with a filter is a linear function
Convolution extensions
Example of image convolution
-1 -1 -1
-1 8 -1
-1 -1 -1
https://muthu.co/basics-of-image-convolution/
Convolutional
2D Convolution computation example
To show how the convolution operation is computed, let’s use a simpler example:
5x5 input, 3x3 filter
Input (5x5) Filter (3x3)
1 0 3 0 2 1 -1 0
0 3 4 0 2 0 2 -3
1 0 2 0 1 -1 0 2
8 12 0 1 0 Bias: b = 0
0 6 3 2 0
Convolutional
Convolution computation example
1 0 3 0 2 1 -1 0 2
0 3 4 0 2
⋅ 0 2 -2 =
1 0 2 0 1 -1 0 2
8 12 0 1 0 Bias: b = 0
0 6 3 2 0 1x1 + 0x(-1) + 3x0 + 0x0 + 3x2 +
4x(-2) + 1x(-1) + 0x0 + 2x2 + 0
=2
Convolutional
Convolution computation example
1 0 3 0 2 1 -1 0 2 5
0 3 4 0 2
⋅ 0 2 -2 =
1 0 2 0 1 -1 0 2
8 12 0 1 0 Bias: b = 0
0 6 3 2 0 0x1 + 3x(-1) + 0x0 + 3x0 + 4x2 +
0x(-2) + 0x(-1) + 2x0 + 0x2 + 0 b
ye
↓
=5
I
Convolutional
Convolution computation example
1 0 3 0 2 1 -1 0 2 5 -1
0 3 4 0 2
⋅ 0 2 -2 =
1 0 2 0 1 -1 0 2
8 12 0 1 0 Bias: b = 0
0 6 3 2 0 3x1 + 0x(-1) + 2x0 + 4x0 + 0x2 +
2x(-2) + 2x(-1) + 0x0 + 1x2 + 0
= -1
Convolutional
Convolution computation example
1 0 3 0 2 1 -1 0 2 5 -1
0 3 4 0 2
⋅ 0 2 -2 = -15
1 0 2 0 1 -1 0 2
8 12 0 1 0 Bias: b = 0
0 6 3 2 0 0x1 + 3x(-1) + 4x0 + 1x0 + 0x2 +
2x(-2) + 8x(-1) + 12x0 + 0x2 + 0
= -15
Convolutional
Convolution computation example
1 0 3 0 2 1 -1 0 2 5 -1
0 3 4 0 2
⋅ 0 2 -2 = -15 -7
1 0 2 0 1 -1 0 2
8 12 0 1 0 Bias: b = 0
0 6 3 2 0 3x1 + 4x(-1) + 0x0 + 0x0 + 2x2 +
0x(-2) + 12x(-1) + 0x0 + 1x2 + 0
= -7
Convolutional
Convolution computation example
1 0 3 0 2 1 -1 0 2 5 -1
0 3 4 0 2
⋅ 0 2 -2 = -15 -7 2
1 0 2 0 1 -1 0 2
8 12 0 1 0 Bias: b = 0
0 6 3 2 0 4x1 + 0x(-1) + 2x0 + 2x0 + 0x2 +
1x(-2) + 0x(-1) + 1x0 + 0x2 + 0
=2
Convolutional
Convolution computation example
1 0 3 0 2 1 -1 0 2 5 -1
0 3 4 0 2
⋅ 0 2 -2 = -15 -7 2
1 0 2 0 1 -1 0 2 31
8 12 0 1 0 Bias: b = 0
0 6 3 2 0 1x1 + 0x(-1) + 2x0 + 8x0 + 12x2
+ 0x(-2) + 0x(-1) + 6x0 + 3x2 + 0
= 31
Convolutional
Convolution computation example
1 0 3 0 2 1 -1 0 2 5 -1
0 3 4 0 2
⋅ 0 2 -2 = -15 -7 2
1 0 2 0 1 -1 0 2 31 -8
8 12 0 1 0 Bias: b = 0
0 6 3 2 0 0x1 + 2x(-1) + 0x0 + 12x0 + 0x2 +
1x(-2) + 6x(-1) + 3x0 + 2x2 + 0
= -8
Convolutional Neural Networks
Convolution computation example
1 0 3 0 2 1 -1 0 2 5 -1
0 3 4 0 2
⋅ 0 2 -2 = -15 -7 2
1 0 2 0 1 -1 0 2 31 -8 1
8 12 0 1 0 Bias: b = 0
0 6 3 2 0 2x1 + 0x(-1) + 1x0 + 0x0 + 1x2 +
0x(-2) + 3x(-1) + 2x0 + 0x2 + 0
=1
Convolution example
Example from ML4Engineers book.
Convolutional Neural Networks (CNN)
Convolution Layer
3x32x32 image
In PyTorch, images are represented as
(CxHxW)
• C: number of channels (depth)
• H: height
• W: width
height (32)
A pixel can be represented by a vector of 3 color (R, G, B)
intensities
width (32) I(c, h, w)
depth (3)
Convolutional Neural Networks
Convolution Layer
3x32x32 image Filters always extend the full
depth of the input volume
3x5x5 filter
Convolve the filter with the image
32 i.e. “slide over the image spatially,
computing dot products”
32
3 Note:
Filters are sometimes referred to as kernels
Convolutional Neural Networks
Convolution Layer
3x32x32 image
3x5x5 filter (w)
32
1 number:
32 ▪ The result of taking a dot product between the
filter and a small 3x5x5 chunk of the image
3 • (i.e. 5x5x3=75-dimensional dot product + bias)
Convolutional Neural Networks
Convolution Layer
3x32x32 image activation map
3x5x5 filter (w)
28
32 Convolve (slide) over all
spatial locations
32 28
3 1
Convolutional Neural Networks
Convolution Layer
3x32x32 image activation map
28
32 Convolve (slide) over all
spatial locations
32 28
3 1
Convolutional Neural Networks
Convolution Layer
3x32x32 image activation map
28
32 Convolve (slide) over all
spatial locations
32 28
3 1
Convolutional Neural Networks
Convolution Layer
3x32x32 image activation map
28
32 Convolve (slide) over all
spatial locations
32 28
3 1
Convolutional Neural Networks
Convolution Layer
3x32x32 image activation map
28
32 Convolve (slide) over all
spatial locations
32 28
3 1
Convolutional Neural Networks
Convolution Layer
3x32x32 image activation map
28
32 Convolve (slide) over all
spatial locations
32 28
3 1
Convolutional Neural Networks
Convolution Layer
3x32x32 image activation map
28
32 Convolve (slide) over all
spatial locations
32 28
3 1
Convolutional Neural Networks
Convolution Layer
3x32x32 image activation maps
3x5x5 filter
28
32
32 28
Consider a second filter
3 Perform same convolution operation with 1
this filter to get a second activation map
Pooling
Pooling example
Max-pooling, stride of 2
Input (1x6x6) Output (1x3x3)
3 0 1 0 2 4
0 1 8 12 0 0
3 12 4
4 0 0 3 2 2
4 3 2
2 0 1 0 1 1
3 6 9
3 2 0 6 0 5
1 0 6 0 0 9
Pooling example
Example from ML4Engineers book.
of stude a
max
pooky
a
ppked to the
image
(after Laplacian Kernel
applied to
image
Convolutional Neural Networks
Pooling layer
CNNs may include pooling layers to reduce the spatial size of the representation
Pooling layers require two hyper-parameters: their spatial extent F and their stride S
▪ Most common layer uses 2x2 filters of stride 2 (F = 2, S = 2)
Convolutional applied with a stride
We can also apply convolution with a stride of 2
Input (1x5x5) Filter (1x3x3)
1 0 3 0 2 1 -1 0
0 3 4 0 2 0 2 -3
1 0 2 0 1 -1 0 2
8 12 0 1 0 Bias: b = 0
0 6 3 2 0
Convolutional Neural Networks
Changing the stride
Back to our simple example, but change to stride of 2
Input (1x5x5) Filter (1x3x3) Output (1x2x2)
1 0 3 0 2 1 -1 0 2
0 3 4 0 2
⋅ 0 2 -2 =
1 0 2 0 1 -1 0 2
8 12 0 1 0 Bias: b = 0
0 6 3 2 0 1x1 + 0x(-1) + 3x0 + 0x0 + 3x2 +
4x(-2) + 1x(-1) + 0x0 + 2x2 + 0
=2
Convolutional Neural Networks
Changing the stride
Back to our simple example, but change to stride of 2
Input (1x5x5) Filter (1x3x3) Output (1x2x2)
1 0 3 0 2 1 -1 0 2 -1
0 3 4 0 2
⋅ 0 2 -2 =
1 0 2 0 1 -1 0 2
8 12 0 1 0 Bias: b = 0
0 6 3 2 0 3x1 + 0x(-1) + 2x0 + 4x0 + 0x2 +
2x(-2) + 2x(-1) + 0x0 + 1x2 + 0
= -1
Convolutional Neural Networks
Changing the stride
Back to our simple example, but change to stride of 2
Input (1x5x5) Filter (1x3x3) Output (1x2x2)
1 0 3 0 2 1 -1 0 2 -1
0 3 4 0 2
⋅ 0 2 -2 = 31
1 0 2 0 1 -1 0 2
8 12 0 1 0 Bias: b = 0
0 6 3 2 0 1x1 + 0x(-1) + 2x0 + 8x0 + 12x2
+ 0x(-2) + 0x(-1) + 6x0 + 3x2 + 0
= 31
Convolutional Neural Networks
Changing the stride
Back to our simple example, but change to stride of 2
Input (1x5x5) Filter (1x3x3) Output (1x2x2)
1 0 3 0 2 1 -1 0 2 -1
0 3 4 0 2
⋅ 0 2 -2 = 31 1
1 0 2 0 1 -1 0 2
8 12 0 1 0 Bias: b = 0
0 6 3 2 0 2x1 + 0x(-1) + 1x0 + 0x0 + 1x2 +
0x(-2) + 3x(-1) + 2x0 + 0x2 + 0
=1
Convolutional Neural Networks
Zero-padding
Height and width shrink quite quickly due to the repeated convolutions
To avoid this, we can add zero-padding:
Zero-padded input (1x7x7)
Input (1x5x5) 0 0 0 0 0 0 0
1 0 3 0 2 0 1 0 3 0 2 0
0 3 4 0 2 0 0 3 4 0 2 0
1 0 2 0 1 Zero-padding = 1 0 1 0 2 0 1 0
8 12 0 1 0 0 8 12 0 1 0 0
0 6 3 2 0 0 0 6 3 2 0 0
0 0 0 0 0 0 0
If we use a 3x3 filter with a stride of 1 on the padded input, we get a 5x5 output
→ same size as input
Convolutional Neural Networks
Convolution layer summary
The convolution layer:
▪ Accepts a volume of size Cin × H1 × W1
▪ Requires four hyper-parameters:
• Number of filters K
• Spatial extent of filters F
• Stride S
• Amount of zero padding P
▪ Produces of a volume of size Cout × H2 × W2 where:
• Cout = K
• H2 = (H1 − F + 2P)/S + 1
• W2 = (W1 − F + 2P)/S + 1
Convolutional Neural Networks
Convolution layer summary
The convolution layer:
▪ Accepts a volume of size Cin × H1 × W1
▪ Requires four hyper-parameters:
• Number of filters K
• Spatial extent of filters F
• Stride S
• Amount of zero padding P
▪ Produces of a volume of size Cout × H2 × W2 where: Note:
• Cout = K There are F ⋅ F ⋅ Cin weights per filter,
for a total of (F ⋅ F ⋅ Cin) ⋅ K weights
• H2 = (H1 − F + 2P)/S + 1
and K biases per layer
• W2 = (W1 − F + 2P)/S + 1
Convolutional neural net
From: Machine Learning for Engineers book
Convolutional Neural Networks
Perception tasks
Convolutional Neural Networks
Optional
Popular architectures
LeNet-5
LeCun et al. ,1998
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 6, 28, 28] 156
-1 in output shape represents
ReLU-2 [-1, 6, 28, 28] 0 the mini-batch dimension
MaxPool2d-3 [-1, 6, 14, 14] 0
Conv2d-4 [-1, 16, 10, 10] 2,416
ReLU-5 [-1, 16, 10, 10] 0
MaxPool2d-6 [-1, 16, 5, 5] 0
Linear-7 [-1, 120] 48,120
ReLU-8 [-1, 120] 0
Linear-9 [-1, 84] 10,164
ReLU-10 [-1, 84] 0
Linear-11 [-1, 10] 850
Softmax-12 [-1, 10] 0
================================================================
Total params: 61,706
Convolutional Neural Networks
Optional
Popular architectures
AlexNet
Krizhevsky et al., 2012
Winner of ImageNet Competition 2012
Convolutional Neural Networks
Optional
Popular architectures ----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 64, 224, 224] 1,792
ReLU-2 [-1, 64, 224, 224] 0
Conv2d-3 [-1, 64, 224, 224] 36,928
ReLU-4 [-1, 64, 224, 224] 0
MaxPool2d-5 [-1, 64, 112, 112] 0
Conv2d-6 [-1, 128, 112, 112] 73,856
ReLU-7 [-1, 128, 112, 112] 0
Conv2d-8 [-1, 128, 112, 112] 147,584
ReLU-9 [-1, 128, 112, 112] 0
MaxPool2d-10 [-1, 128, 56, 56] 0
Conv2d-11 [-1, 256, 56, 56] 295,168
ReLU-12 [-1, 256, 56, 56] 0
Conv2d-13 [-1, 256, 56, 56] 590,080
ReLU-14 [-1, 256, 56, 56] 0
Conv2d-15 [-1, 256, 56, 56] 590,080
ReLU-16 [-1, 256, 56, 56] 0
MaxPool2d-17 [-1, 256, 28, 28] 0
Conv2d-18 [-1, 512, 28, 28] 1,180,160
ReLU-19 [-1, 512, 28, 28] 0
Conv2d-20 [-1, 512, 28, 28] 2,359,808
ReLU-21 [-1, 512, 28, 28] 0
Conv2d-22 [-1, 512, 28, 28] 2,359,808
ReLU-23 [-1, 512, 28, 28] 0
MaxPool2d-24 [-1, 512, 14, 14] 0
Conv2d-25 [-1, 512, 14, 14] 2,359,808
ReLU-26 [-1, 512, 14, 14] 0
Conv2d-27 [-1, 512, 14, 14] 2,359,808
ReLU-28 [-1, 512, 14, 14] 0
Conv2d-29 [-1, 512, 14, 14] 2,359,808
ReLU-30 [-1, 512, 14, 14] 0
MaxPool2d-31 [-1, 512, 7, 7] 0
VGG16 Linear-32
ReLU-33
[-1, 4096]
[-1, 4096]
102,764,544
0
Simonian & Zisserman, 2014
Dropout-34 [-1, 4096] 0
Linear-35 [-1, 4096] 16,781,312
ReLU-36 [-1, 4096] 0
Dropout-37 [-1, 4096] 0
Linear-38 [-1, 1000] 4,097,000
Softmax-39 [-1, 1000] 0
================================================================
Total params: 138,357,544
Convolutional Neural Networks
Optional
Popular architectures
GoogLeNet (Inception v1)
Szegedy et al., 2014
ConvNets have been getting deeper and deeper
e.g. ResNet-152 (He et al, 2015) → 152 layers
Example application: transfer learning
Train network for a task
Example: image classification
Requires large number of training data, and training resources
Modify the trained network for a different task (transfer learning)
Why? Can address limited data and time/resources for training
Case study 6.5 from Machine Learning for Engineers book: “Finding volcanos on
Venus with pre-fit models”
Summary - exercises