Basic Design Approaches To Accelerating Deep Neural Networks
Basic Design Approaches To Accelerating Deep Neural Networks
Networks
Rangharajan Venkatesan
NVIDIA Corporation
Contact Info
email: [email protected]
Research Interests
Machine Learning Accelerators
High-Level Synthesis
Low Power VLSI design
SoC Design methodologies
Focus on inference
Most of the techniques are generic and applicable to training as well
Does cover ..
Key metrics
Design considerations
Hardware optimizations
Hardware/software co-design techniques
Efficient Compute
Data
Bianco et al., IEEE Access, 2018. Ack: Bill Dally, GTC China, 2020.
Ack: Anand Raghunathan, Purdue University Ref: "Showdown". The Economist, 19 Nov. 2016
Reconfigurable FPGAs
Leverage reconfigurability of FPGA to accelerate a specific neural network
Accelerators
Programmable Processors Fixed Function Accelerators
Different
platforms
Wide range of performance
Source: https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/
Artificial
Intelligence
Machine
Learning
Deep
Learning
Input Layer
Hidden Layer
Activation layer
ReLU, tanh, sigmoid, Leaky ReLU, Clipped ReLU, Swish
Pooling layer
Max. pooling, Average pooling, Unpooling
Fully-connected layer
S W F
Element-wise Partial Sum (psum)
Multiplication Accumulation
for e=[0:E)
for f=[0:F)
for r=[0:R)
for s=[0:S)
Out[e][f] += Weight[r][s] * Input[e+r][f+s]
S W F
Sliding Window Processing
for e=[0:E)
for f=[0:F)
for r=[0:R)
for s=[0:S)
Out[e][f] += Weight[r][s] * Input[e+r][f+s]
…
output fmap for c=[0:C)
C
…
for r=[0:R)
H … for s=[0:S)
R E
Out[e][f] +=
…
…
Weight[r][s][c] *
S W F
Input[e+r][f+s][c]
Many Input Channels (C)
…
for m=[0:M)
…
C for e=[0:E)
…
…
H for f=[0:F)
R E
1 for c=[0:C)
…
…
S W F for r=[0:R)
Many
…
for s=[0:S)
Output Channels (M) Out[e][f][m] +=
C
…
Weight[r][s][c][m] *
R Input[e+r][f+s][c]
M
…
Many
Input fmaps (N) Many
C
Output fmaps (N) for n=[0:N)
filters
…
…
M
…
C for m=[0:M)
…
H for e=[0:E)
R E
1 for f=[0:F)
1
…
…
S W F for c=[0:C)
for r=[0:R)
…
…
C
for s=[0:S)
…
…
…
Out[e][f][m][n] +=
R E Weight[r][s][c][m] *
…
H
…
N Input[e+r][f+s][c][n]
…
S
N
…
F
W
Sze et al., Synthesis Lectures on Computer Architecture, 2020
Rangharajan Venkatesan ISSCC 2021 23 of 93
Activation Layer
Introduce non-linearity in the network
Max(1,2,4,6)
1 2 2 3
4 6 5 8 2x2 max pooling with stride = 2 6 8
3 1 4 4 3 4
2 1 3 3
Max. Pooling Example
Rangharajan Venkatesan ISSCC 2021 25 of 93
Fully-Connected Layer
1
C 1 for m=[0:M)
for c=[0:C)
M M Out[m] +=
C
Weight[m][c] *
Input[c]
Matrix-Vector Multiplication
Neural
ResNet MobileNet BERT
Networks
DL
PyTorch TensorFlow Caffe
Framework
Neural
ResNet MobileNet BERT
Networks
DL
PyTorch TensorFlow Caffe
Framework
Co-design across
Compiler TVM TimeLoop ZigZag different levels for
efficient hardware
Energy efficiency
Energy/inference, TOPS/W
Area efficiency
Inference/sec/mm2, TOPS/mm2
Flexibility
Support different types of neural networks and layers
System
Array of PEs
Global buffer
Controller
DRAM
MVA
MMA
High effort to achieve good
utilization some layer types
Efficiency
• No spatial reuse • High spatial reuse
• High control overheads • Low control overheads
Scratchpads
Global Buffer
Large Capacity,
High Latency, DRAM
High Energy
Large Capacity,
High Latency, DRAM
High Energy
Examples: Venkatesan et al. HotChips 2019, Zimmer et al. JSSC 2020, Google TPU,
NVDLA
Rangharajan Venkatesan ISSCC 2021 41 of 93
Dataflows: Temporal Data Reuse
Output-Stationary (OS) Dataflow
Network: ResNet-50
Dataset: ImageNet
Technology: 16ff
Datapath
Buffer 1 Buffer 2
Lower-level
Memory
Buffet achieves 2.3X reduction in energy-delay product (EDP) and 2.1X area
efficiency gains over DMA with double buffering
Weights
Input Activations
PE PE PE PE Output Activations
Partial Sums
Interconnect
Memory
Weights
Input Activations
PE PE PE PE Output Activations
Partial Sums
Interconnect
Pattern-1
• Unicast Weights
Memory
Weights
Input Activations
PE PE PE PE Output Activations
Partial Sums
Interconnect
Pattern-1
• Unicast Weights
• Multicast Input activations
Memory
Weights
Input Activations
PE PE PE PE Output Activations
Partial Sums
Interconnect
Pattern-1
• Unicast Weights
• Multicast Input activations
Memory • Unicast Output activations
Weights
Input Activations
PE PE PE PE Output Activations
Partial Sums
Interconnect
Pattern-2
• Multicast Weights
• Unicast Input activations
Memory • Unicast Output activations
Weights
Input Activations
PE PE PE PE Output Activations
Partial Sums
Interconnect Pattern-3
• Unicast Weights
• Unicast Input activations
Memory • Unicast Partial sums
• Unicast Output activations
Weights
Input Activations
PE PE PE PE Output Activations
Partial Sums
Interconnect Pattern-4
• Unicast Weights
• Unicast Input activations
Memory • Unicast Partial sums
• Unicast Output activations
input activations
from different
input channels (C)
Network On-Package
Network On-Chip
Venkatesan et al., HotChips 2019
Zimmer et al. JSSC 2020
Opportunities
Data reuse
Parallelism
Pipelining
Example
An architecture implementing weight-
stationary dataflow PE1
Tile weights and distribute them to
different PEs
Compute different output activations PE2
by streaming in the input activations
PE3
PE3
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
Large number of possible tilings for a given layer and hardware configuration
>10x difference in performance and energy
Need to explore optimized tiling to achieve best energy and performance
Accuracy
Small accuracy loss for
large efficiency gain
Hardware cost
Pre-trained Quantized
Quantization
Model Model
Pre-trained Quantized
Quantization
Model Model
Re-Training
Source: https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/
* =
Venkatesan et al.,
ICCAD 2019
Rangharajan Venkatesan ISSCC 2021 85 of 93
Summary
Deep neural networks are increasing
used across a wide range of
applications
Large amounts of data
High computation demand