Deep Learning Tutorial
Brains, Minds, and Machines Summer Course 2018
TA: Eugenio Piasini & Yen-Ling Kuo
Roadmap
● Supervised Learning with Neural Nets
● Convolutional Neural Networks for Object Recognition
● Recurrent Neural Network
● Other Deep Learning Models
Supervised Learning
with Neural Nets
General references:
Hertz, Krogh, Palmer 1991
Goodfellow, Bengio, Courville 2016
Supervised learning
Given example input-output pairs (X,Y),
learn to predict output Y from input X
Logistic regression, support vector machines, decision trees, neural networks...
Binary classification:
simple perceptron
(McCulloch & Pitts 1943)
Perceptron learning rule
(Rosenblatt 1962)
g is a nonlinear activation function, in this case
Linear separability
Simple perceptrons can only learn to solve linearly separable problems (Minsky and Papert 1969).
We can solve more complex problems by composing many units in multiple layers.
Multilayer perceptron (MLP)
(“forward propagation”)
MLPs are universal function approximators (Cybenko 1989; Hornik 1989).
(under some assumptions… exercise: show that if g is linear, this architecture reduces to a simple perceptron)
Deep vs shallow
Universality: “shallow” MLPs with one hidden layer can represent any continuous function to arbitrary
precision, given a large enough number of units. But:
● No guarantee that the number of required units is reasonably small (expressivity).
● No guarantee that the desired MLP can actually be found with our chosen learning method
(learnability).
Two motivations for using deep nets instead (see Goodfellow et al 2016, section 6.4.1):
● Statistical: deep nets are compositional, and naturally well suited to representing hierarchical
structures where simpler patterns are composed and reused to form more complex ones
recursively. It can be argued that many interesting structures in real world data are like this.
● Computational: under certain conditions, it can be proved that deep architectures are more
expressive than shallow ones, i.e. they can learn more patterns for a given total size of the network.
Problem: compute all
Key insights: the loss depends
● on the weights w of a unit only through that unit’s
Backpropagation
activation h
● on a unit’s activation h only through the activation of
those units that are downstream from h.
(Rumelhart, Hinton, Williams 1986)
The “errors” being
backpropagated
These give the gradient of the loss with respect to the weights,
which you can then use with your favorite gradient descent method.
Backpropagation - example
(exercise: derive gradient wrt bias terms b)
The Navy revealed the embryo of The perceptron has shown itself
an electronic computer today that it worthy of study despite (and even
expects will be able to walk, talk, because of!) its severe limitations. It
see, write, reproduce itself and has many features to attract
be conscious of its existence […] attention: its linearity; its intriguing
Dr. Frank Rosenblatt, a research learning theorem; its clear
psychologist at the Cornell paradigmatic simplicity as a kind of
Aeronautical Laboratory, Buffalo, parallel computation. There is no
said Perceptrons might be fired to reason to suppose that any of
the planets as mechanical space these virtues carry over to the
explorers. many-layered version.
Nevertheless, we consider it to be
The New York Times an important research problem to
July 8th, 1958 elucidate (or reject) our intuitive
judgement that the extension to
multilayer systems is sterile.
Minsky and Papert 1969
(section 13.2)
Convolutional Neural Networks
for Object Recognition
General (excellent!) reference:
“Convolutional Networks for Visual Recognition”, Stanford university
http://cs231n.stanford.edu/
Traditional Object Detection/Recognition Idea
● Match low-level
vision features
(e.g. edge, HOG,
SIFT, etc)
● Parts-based
models (Lowe 2004)
Learning the features - inspiration from neuroscience
Hubel and Wiesel:
● Topographic organization of
connections
● Hierarchical organization of
simple/complex cells
(Hubel and Wiesel 1962)
(Fukushima 1980)
“Canonical” CNN structure
INPUT -> [[CONV -> RELU]*K -> POOL?]*L -> [FC -> RELU]*M -> FC
Credit: cs231n.github.io
Four basic operations:
1. Convolution
2. Nonlinearity (ReLU)
3. Pooling
4. Fully connected layers
(LeCun et al 1998)
2D Convolution
Example: blurring an image
Replacing each pixel with an
average of its neighbors
2D Convolution
kernel / filter
Input image Output image
2D Convolution
kernel / filter
Input image Output image
2D Convolution
kernel / filter
Input image Output image
2D Convolution
kernel / filter
Input image Output image
2D Convolution
kernel / filter
Input image Output image
2D Convolution
If N=input size, K=filter size, S=stride
(stride is the size of the step you take
on the input every time you move by
one on the output)
Output size = (N-K)/S + 1
kernel / filter
Input image Output image
N=32, K=5, S=1 →(N-K)/S + 1 = 28
More on convolution sizing
32 5x5x3 filter 32 28
1 1
32 32 28
1
3
3
Input depth = # of channels in previous layer
(often 3 for input layer (RGB); can be arbitrary Output depth = # of filters
for deeper layers) (feature maps)
Convolve with Different Filters
Ⓧ =
Convolution (with learned filters)
● Dependencies are local input
● Filter has few parameters to learn Multiple filters ...
○ Share the same parameters
across different locations
Feature map
Fully Connected vs. Locally Connected
Credit: Ranzato’s
CVPR 2014 tutorial
output
Non-linearity
input
● Rectified linear function (ReLU)
○ Applied per-pixel, output = max(0, input)
Input feature map Output feature map
Pooling
● Reduce size of representation in following layers
● Introduce some invariance to small translation
Image credit: http://cs231n.github.io/convolutional-networks/
Learning
LeNet - LeCun et al 1998
Backpropagation, gradient descent
Key evolutionary steps
Neocognitron - Fukushima 1980
Inspired by Hubel and Wiesel
“Convolutional” structure,
alternating “pooling” layers
AlexNet - Krizhevsky et al 2012
Larger, deeper network (~10^7 params), much more data (ImageNet -
~10^6 images), more compute (incl. GPUs), better regularization (Dropout)
Image classification Image retrieval
test
image
smallest Euclidian distance to test image
But also object detection, image segmentation, captioning...
Recurrent Neural Network
Handling Sequential Information
● Natural language processing: sentences, translations
● Speech / Audio: signal processing, speech recognition
● Video: action recognition, captioning
● Sequential decision making / Planning
● Time-series data
● Biology / Chemistry: protein sequences, molecule structures
● ...
Dynamic System / Hidden Markov Model
Classical form of a dynamic system Hidden Markov Model
With an external signal x
Recurrent Network / RNN
● A general form to process a sequence.
y ○ Applying a recurrence formula at each time step
● The state consists of a vector h.
It summarizes input up to time t.
RNN h
New state Old state Input at time t
A function with
x parameter W
Processing a Sequence: Unrolling in Time
I like this course
RNN RNN RNN RNN
predict predict predict predict
prediction prediction prediction prediction
Training: Backpropagation Through Time
I like this course
RNN RNN RNN RNN
predict predict predict predict
prediction prediction prediction prediction
loss loss loss loss
PRP VBP DT NN
Total loss
Parameter Sharing Across Time
● The parameters are shared and derivatives are accumulated.
● Make it possible to generalize to sequences of different lengths.
t
Vanishing Gradient
X f f f Loss
● expanded quickly!
○ |.| > 1, gradient explodes → clipping gradients
○ |.| < 1, gradient vanishes → introducing memory via LSTMs, GRUs
● Have problem in learning long-term dependency.
Long Short Term Memory (LSTM)
● Introducing gates to
optionally let information
flow through.
○ An LSTM cell has three
gates to protect and
control the cell state.
Forget the Selected update Output certain
irrelevant part of cell state values parts of the
previous state cell state Image credit: http://harinisuresh.com/2016/10/09/lstms/
Flexibility of RNNs
Image Sentiment Machine POS
Captioning Classification Translation Tagging
Image credit: Andrej Karpathy
Other Deep Learning Models
Auto-encoder
● Learning representations
○ a good representation should keep the information well
Encoder Decoder
Original input Learned Reconstructed
representation image
○ → objective: minimize reconstruction error
[LeCun, 1987]
latent variables:
color, shape, position, ...
Generative Models
● What are the learned representations?
observed data
○ One view: latent variables to generate the observed data
● Goal of learning a generative model: to recover p(x) from data
Desirable properties Problem
Sampling new data Directly computing
Evaluating likelihood of data
Extracting latent features
is intractable!
Adapt from IJCAI 2018 deep generative model tutorial
Variational Autoencoder (VAE)
p(z) p(x|z)
● Idea: approximate p(z|x)
with a simpler, tractable q(z|x)
z Decoder x
● Learning objective
Reconstruction error
z Encoder x
q(z|x) Measure how close q is to p
[Kingma et al., 2013]
Generative Adversarial Network (GAN)
● An implicit generative model, formulated as a minimax game.
○ The discriminator is trying to distinguish real and fake samples.
○ The generator is trying to generate fake samples to fool the discriminator.
[Goodfellow et al., 2014]
● Link to the slides
Thanks & Questions? ○ https://goo.gl/pUXdc1
● Hands-on session on
Monday!
Yen-Ling Kuo (
[email protected])