0% found this document useful (0 votes)
340 views

Basic Design Approaches To Accelerating Deep Neural Networks

This document provides an overview of basic design approaches to accelerate deep neural networks. It begins with motivations for hardware acceleration of neural networks due to increasing complexity and performance/energy challenges. It then provides basics of deep neural networks including common layer types like convolution, activation and pooling layers. The rest of the document outlines key aspects of neural network accelerator design like compute units, memory hierarchy, interconnects, compiler mapping strategies, and neural network optimizations like quantization and sparsity.

Uploaded by

dxzhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
340 views

Basic Design Approaches To Accelerating Deep Neural Networks

This document provides an overview of basic design approaches to accelerate deep neural networks. It begins with motivations for hardware acceleration of neural networks due to increasing complexity and performance/energy challenges. It then provides basics of deep neural networks including common layer types like convolution, activation and pooling layers. The rest of the document outlines key aspects of neural network accelerator design like compute units, memory hierarchy, interconnects, compiler mapping strategies, and neural network optimizations like quantization and sparsity.

Uploaded by

dxzhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Basic Design Approaches to Accelerating Deep Neural

Networks

Rangharajan Venkatesan
NVIDIA Corporation

Live Q&A Session: Feb. 13, 2020, 7:00-7:20am, PST

Contact Info
email: [email protected]

Rangharajan Venkatesan ISSCC 2021 1 of 93


Self-Introduction
 B.Tech in Electronics and Communication
Engineering from IIT Roorkee in 2009
 Ph.D. in Electrical and Computer Engineering
from Purdue University in 2014
 Joined NVIDIA
 Research Scientist: 2014-2016
 Senior Research Scientist: 2016-Present

 Research Interests
 Machine Learning Accelerators
 High-Level Synthesis
 Low Power VLSI design
 SoC Design methodologies

Rangharajan Venkatesan ISSCC 2021 2 of 93


About this Tutorial
 Overview of design approaches to improve hardware efficiency

 Focus on inference
 Most of the techniques are generic and applicable to training as well

 Does cover ..
 Key metrics
 Design considerations
 Hardware optimizations
 Hardware/software co-design techniques

 Does not cover ..


 Algorithms and model design
 Detailed software stack
Rangharajan Venkatesan ISSCC 2021 3 of 93
Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design


 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow


 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations


 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 4 of 93


Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design


 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow


 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations


 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 5 of 93


Vast World of DNN Applications

DATA CENTER EMBEDDED COMPUTERS EMBEDDED DEVICES

Rangharajan Venkatesan ISSCC 2021 6 of 93


Hardware-enabled AI Revolution

Efficient Compute

Data

Rangharajan Venkatesan ISSCC 2021 7 of 93


Growth in Application Complexity

Bianco et al., IEEE Access, 2018. Ack: Bill Dally, GTC China, 2020.

Rangharajan Venkatesan ISSCC 2021 8 of 93


Hardware Performance Challenges

End of Moore’s Law Increasing cost

John Hennessy and David Patterson, Computer


S. Naffziger et al., ISSCC 2020
Architecture: A Quantitative Approach, 6/e. 2018
Rangharajan Venkatesan ISSCC 2021 9 of 93
Energy Efficiency Challenges
Super-human performance at low energy efficiency
1920 CPUs, 280 GPUs, Power: ~300,000 W

Google AlphaGo vs. Lee Sedol

Ack: Anand Raghunathan, Purdue University Ref: "Showdown". The Economist, 19 Nov. 2016

Rangharajan Venkatesan ISSCC 2021 10 of 93


Solution: Hardware Specialization

Reconfigurable FPGAs
Leverage reconfigurability of FPGA to accelerate a specific neural network

Accelerators
Programmable Processors Fixed Function Accelerators

Programmable hardware with support Customized hardware accelerators to


for scalar and vector math functions support a class of neural networks

Rangharajan Venkatesan ISSCC 2021 11 of 93


Many DNN Accelerators Exist!!

Different
platforms
Wide range of performance

Different power targets

Source: https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/

Rangharajan Venkatesan ISSCC 2021 12 of 93


Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design


 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow


 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations


 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 13 of 93


Deep Learning (aka DNN)

Artificial
Intelligence
Machine
Learning

Deep
Learning

Rangharajan Venkatesan ISSCC 2021 14 of 93


DNN: A Simple Example
 3 
Y j = activation  Wij × Xi 
W11  i=1 
Y1
X1
Y2
X2
Y3
X3
W34 Y4 Output Layer

Input Layer
Hidden Layer

Sze et al., Synthesis Lectures on Computer Architecture, 2020


Rangharajan Venkatesan ISSCC 2021 15 of 93
Practical DNNs

Input Layer 1 Layer 2 Layer N Output

 10s to 100s of layers


 Different types of layers
 Convolution, Fully-connected, Depthwise-Separable, Attention etc.
 Process several Gigabytes of data
 Perform billions to trillions of operations
 Wide range of applications
 Image processing, language processing, etc.

Rangharajan Venkatesan ISSCC 2021 16 of 93


Examples of DNN Applications
Image Classification Object Detection Recommendation Systems

Dog Cat Sheep

Text Summarization Automatic Speech Recognition Image Captioning

Rangharajan Venkatesan ISSCC 2021 17 of 93


Several Layer Types and Kernels
 Convolution layer
 Many different variants also
 Strided convolution, Dilated convolution, Groupwise convolution, Depthwise convolution,
Pointwise convolution, Depthwise-Separable convolution

 Activation layer
 ReLU, tanh, sigmoid, Leaky ReLU, Clipped ReLU, Swish

 Pooling layer
 Max. pooling, Average pooling, Unpooling

 Fully-connected layer

 And many more .. Attention, Deconvolution, etc.


Rangharajan Venkatesan ISSCC 2021 18 of 93
Convolution Layer
input output
feature map (fmap) feature map (fmap)
filter (weights) an output
activation
H
R E

S W F
Element-wise Partial Sum (psum)
Multiplication Accumulation
for e=[0:E)
for f=[0:F)
for r=[0:R)
for s=[0:S)
Out[e][f] += Weight[r][s] * Input[e+r][f+s]

Sze et al., Synthesis Lectures on Computer Architecture, 2020


Rangharajan Venkatesan ISSCC 2021 19 of 93
Convolution Layer

input fmap output fmap


filter (weights) an output
activation
H
R E

S W F
Sliding Window Processing
for e=[0:E)
for f=[0:F)
for r=[0:R)
for s=[0:S)
Out[e][f] += Weight[r][s] * Input[e+r][f+s]

Sze et al., Synthesis Lectures on Computer Architecture, 2020


Rangharajan Venkatesan ISSCC 2021 20 of 93
Convolution Layer

input fmap for e=[0:E)


filter … C for f=[0:F)


output fmap for c=[0:C)
C

for r=[0:R)
H … for s=[0:S)
R E
Out[e][f] +=


Weight[r][s][c] *
S W F
Input[e+r][f+s][c]
Many Input Channels (C)

Sze et al., Synthesis Lectures on Computer Architecture, 2020


Rangharajan Venkatesan ISSCC 2021 21 of 93
Convolution Layer

many input fmap


output fmap
filters (M) C


for m=[0:M)


C for e=[0:E)


H for f=[0:F)
R E
1 for c=[0:C)


S W F for r=[0:R)
Many

for s=[0:S)
Output Channels (M) Out[e][f][m] +=
C

Weight[r][s][c][m] *
R Input[e+r][f+s][c]
M

Sze et al., Synthesis Lectures on Computer Architecture, 2020


Rangharajan Venkatesan ISSCC 2021 22 of 93
Convolution Layer

Many
Input fmaps (N) Many
C
Output fmaps (N) for n=[0:N)
filters


M


C for m=[0:M)

H for e=[0:E)
R E
1 for f=[0:F)
1


S W F for c=[0:C)
for r=[0:R)


C
for s=[0:S)


Out[e][f][m][n] +=
R E Weight[r][s][c][m] *

H

N Input[e+r][f+s][c][n]


S
N

F
W
Sze et al., Synthesis Lectures on Computer Architecture, 2020
Rangharajan Venkatesan ISSCC 2021 23 of 93
Activation Layer
 Introduce non-linearity in the network

Common Activation Layers in DNNs

Rangharajan Venkatesan ISSCC 2021 24 of 93


Pooling Layer
 Down-sampling operation to provide invariance against minor changes in the
image such as shifting, rotation, etc.
 Can be different types
 Max. pooling, Average pooling

Max(1,2,4,6)

1 2 2 3
4 6 5 8 2x2 max pooling with stride = 2 6 8
3 1 4 4 3 4
2 1 3 3
Max. Pooling Example
Rangharajan Venkatesan ISSCC 2021 25 of 93
Fully-Connected Layer

1
C 1 for m=[0:M)
for c=[0:C)
M M Out[m] +=
C
Weight[m][c] *
Input[c]

Matrix-Vector Multiplication

Rangharajan Venkatesan ISSCC 2021 26 of 93


Deep Learning Stack

Neural
ResNet MobileNet BERT
Networks

DL
PyTorch TensorFlow Caffe
Framework

Compiler TVM TimeLoop ZigZag

Hardware CPU GPU ASIC FPGA

TimeLoop, ISPASS 2019. TVM, OSDI 2018. ZigZag, arXiv 2020


Rangharajan Venkatesan ISSCC 2021 27 of 93
Deep Learning Stack

Neural
ResNet MobileNet BERT
Networks

DL
PyTorch TensorFlow Caffe
Framework

Co-design across
Compiler TVM TimeLoop ZigZag different levels for
efficient hardware

Hardware CPU GPU ASIC FPGA

TimeLoop, ISPASS 2019. TVM, OSDI 2018. ZigZag, arXiv 2020


Rangharajan Venkatesan ISSCC 2021 28 of 93
Evaluation Metrics
 Accuracy
 % predicted correctly

 Performance (Throughput, Latency)


 Inferences/sec, TOPS, delay

 Energy efficiency
 Energy/inference, TOPS/W

 Area efficiency
 Inference/sec/mm2, TOPS/mm2

 Flexibility
 Support different types of neural networks and layers

Rangharajan Venkatesan ISSCC 2021 29 of 93


Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design


 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow


 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations


 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 30 of 93


Hierarchical View of DL Accelerator
 Multiply & Accumulate (MAC) unit
 Multiply-Accumulate Datapath
 Registers

 Processing Element (PE)


 Arrays of MAC units
 Scratchpad
 PE Control Finite State Machine (FSM)

 System
 Array of PEs
 Global buffer
 Controller
 DRAM

Rangharajan Venkatesan ISSCC 2021 31 of 93


Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design


 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow


 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations


 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 32 of 93


MAC Unit Implementations
 Scalar MAC unit

 One Multiply and Accumulate Weight Input


operation per cycle Register Register
 Registers are used to store values X
and achieve temporal reuse across
cycles
 Registers are optional +
 All operands cannot be reused across
cycles
Accumulation
 Typically, only one (sometimes two) Register
operands are stored in registers
 Examples
 Google TPU, Eyeriss ISSCC 2016

Rangharajan Venkatesan ISSCC 2021 33 of 93


MAC Unit Implementations
 Vector MAC unit
Weight Input
 Vector-wide multiply-accumulate Register Register
operations every cycle
 Performs vector dot-product Vector Size
 Takes a weight vector, an input activation
vector, and a partial sum scalar as inputs
X X X
 Computes a partial sum as output
 Achieves spatial reuse of partial sum
+
values by reducing them during dot-
product computation
 Registers can be used to store one or Accumulation
more operands for temporal reuse Register

Rangharajan Venkatesan ISSCC 2021 34 of 93


MAC Unit Implementations
 Matrix-Vector Accumulate (MVA) unit
Weight Input
Register Register
 Multiple vector-dot products every cycle
 Example:
 Takes a weight matrix, an input activation
vector, and a partial sum vector as inputs Vector Size
 Computes a partial sum vector as output
 Achieves spatial reuse of activation as well
X X
as partial sum values X

 Registers can be used to store one or


more operands for temporal reuse +
 Examples
Accumulation
 NVDLA, Venkatesan et al. HotChips 2019,
Register
Zimmer et al. JSSC 2020

Rangharajan Venkatesan ISSCC 2021 35 of 93


MAC Unit Implementations
 Matrix-Matrix Accumulate (MMA) unit

 Examples: NVIDIA GPU Tensor cores, Alibaba Hanguang ISSCC 2020


Rangharajan Venkatesan ISSCC 2021 36 of 93
MAC Implementation Tradeoffs
Flexibility

Easy to support different Scalar Hybrid


layer types MAC implementations
Vector
MAC

MVA

MMA
High effort to achieve good
utilization some layer types
Efficiency
• No spatial reuse • High spatial reuse
• High control overheads • Low control overheads

Rangharajan Venkatesan ISSCC 2021 37 of 93


Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design


 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow


 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations


 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 38 of 93


Memory Hierarchy
Small Capacity, Registers
Low Latency,
Low Energy

Scratchpads

Global Buffer

Large Capacity,
High Latency, DRAM
High Energy

Horowitz, ISSCC 2014


Rangharajan Venkatesan ISSCC 2021 39 of 93
Memory Hierarchy
Small Capacity, Registers
 Efficient reuse of data to achieve Low Latency,
Low Energy
high performance and energy
efficiency Scratchpads

 Optimal number of levels and sizing


Global Buffer
 Buffer management to overlap
communication and computation

Large Capacity,
High Latency, DRAM
High Energy

Rangharajan Venkatesan ISSCC 2021 40 of 93


Dataflows: Temporal Data Reuse
 Weight-Stationary (WS) Dataflow

Less-frequent access More-frequent access

 Examples: Venkatesan et al. HotChips 2019, Zimmer et al. JSSC 2020, Google TPU,
NVDLA
Rangharajan Venkatesan ISSCC 2021 41 of 93
Dataflows: Temporal Data Reuse
 Output-Stationary (OS) Dataflow

Less-frequent access More-frequent access

 Examples: Moons et al. VLSI 2016, Thinker et al. VLSI 2017


Rangharajan Venkatesan ISSCC 2021 42 of 93
Dataflows: Temporal Data Reuse
 Input-Stationary (IS) Dataflow

Less-frequent access More-frequent access

 Example: SCNN, ISCA 2017


Rangharajan Venkatesan ISSCC 2021 43 of 93
Dataflows: Temporal Data Reuse
 Drawback of single data reuse
 Operands with low or no reuse start to dominate energy consumption

Venkatesan et al., ICCAD 2019


Rangharajan Venkatesan ISSCC 2021 44 of 93
Dataflows: Temporal Data Reuse
 Multi-level Dataflows
 Output Stationary – Local Weight Stationary (OS-LWS) Dataflow

Less-frequent access More-frequent access

 Example: MAGNet, ICCAD 2019


Rangharajan Venkatesan ISSCC 2021 45 of 93
Dataflows: Temporal Data Reuse
 Multi-level Dataflows
 Weight Stationary – Local Output Stationary (WS-LOS) Dataflow

Less-frequent access More-frequent access

 Example: MAGNet, ICCAD 2019


Rangharajan Venkatesan ISSCC 2021 46 of 93
Comparison of Dataflows

Network: ResNet-50
Dataset: ImageNet
Technology: 16ff

• OS-LWS dataflow  Temporal reuse in both weights and outputs


• Larger vector size  Spatial input/output reuse, amortize control overheads
Venkatesan et al., ICCAD 2019
Rangharajan Venkatesan ISSCC 2021 47 of 93
Compute-Communication Overlap
 Coarse-grained: DMA with double buffering

Datapath

Buffer 1 Buffer 2

Lower-level
Memory

Rangharajan Venkatesan ISSCC 2021 48 of 93


Buffet Buffer Manager
 Fine-grained compute-communication
overlap
 Operations
 Fill
 Sequentially write data read from DRAM
and set valid state
 Read
 Perform read if the address is valid,
otherwise stall until address becomes valid
 Update
 Perform write operation to address and set
valid state
 Shrink
 Invalidate addresses that are not in use
and request data from lower-level memory

Pellauer et al., ASPLOS 2019


Rangharajan Venkatesan ISSCC 2021 49 of 93
Buffet Buffer Manager

 Buffet achieves 2.3X reduction in energy-delay product (EDP) and 2.1X area
efficiency gains over DMA with double buffering

Pellauer et al., ASPLOS 2019


Rangharajan Venkatesan ISSCC 2021 50 of 93
Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design


 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow


 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations


 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 51 of 93


Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect

Memory

Rangharajan Venkatesan ISSCC 2021 52 of 93


Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect
Pattern-1
• Unicast Weights
Memory

Rangharajan Venkatesan ISSCC 2021 53 of 93


Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect
Pattern-1
• Unicast Weights
• Multicast Input activations
Memory

Rangharajan Venkatesan ISSCC 2021 54 of 93


Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect
Pattern-1
• Unicast Weights
• Multicast Input activations
Memory • Unicast Output activations

Rangharajan Venkatesan ISSCC 2021 55 of 93


Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect
Pattern-2
• Multicast Weights
• Unicast Input activations
Memory • Unicast Output activations

Rangharajan Venkatesan ISSCC 2021 56 of 93


Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect Pattern-3
• Unicast Weights
• Unicast Input activations
Memory • Unicast Partial sums
• Unicast Output activations

Rangharajan Venkatesan ISSCC 2021 57 of 93


Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect Pattern-4
• Unicast Weights
• Unicast Input activations
Memory • Unicast Partial sums
• Unicast Output activations

Rangharajan Venkatesan ISSCC 2021 58 of 93


Interconnect Design
 In practice, we observe combinations of different patterns depending on the
architecture

 Customized interconnects support specific patterns


 e.g. Systolic Arrays

 Programmable interconnect support flexible patterns


 e.g. Mesh network on-chip (NoC)

Rangharajan Venkatesan ISSCC 2021 59 of 93


Systolic Array
PE

input activations
from different
input channels (C)

psums from different output channels (M)


Sze et al., Synthesis Lectures on Computer Architecture, 2020
Rangharajan Venkatesan ISSCC 2021 60 of 93
Mesh NoC
 Different packet sizes for different data
types
 Unicast and Multicast support
 Flexible routing protocols

Rangharajan Venkatesan ISSCC 2021 61 of 93


Hierarchical Network
 Efficient communication for large scale designs
 Reduces number of hops
 Reduces congestion
 Example:
 Multi-Chip Module (MCM) based DL accelerator

Network On-Package

Network On-Chip
Venkatesan et al., HotChips 2019
Zimmer et al. JSSC 2020

Rangharajan Venkatesan ISSCC 2021 62 of 93


Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design


 Compute: Efficient MAC Datapath
 Storage: Memory hierarchy
 Communication: Interconnect topologies

 Compiler Mapping flow


 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations


 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 63 of 93


Compiler Mapping Flow
 Goal:
 Efficiently execute the neural network
in the target hardware
 Performance
 Energy efficiency

 Opportunities
 Data reuse
 Parallelism
 Pipelining

Rangharajan Venkatesan ISSCC 2021 64 of 93


Data Reuse Opportunities

Data type Reuse


Input activations* R*S*M
Weights E*F*N
Output activations R*S*C
*except halo

Sze et al., Synthesis Lectures on Computer Architecture, 2020


Rangharajan Venkatesan ISSCC 2021 65 of 93
Model Parallelism
Input Output
 Exploit parallelism across weights in Activations
Weights
Activations
a layer
PE0

 Example
 An architecture implementing weight-
stationary dataflow PE1
 Tile weights and distribute them to
different PEs
 Compute different output activations PE2
by streaming in the input activations

PE3

Rangharajan Venkatesan ISSCC 2021 66 of 93


Data Parallelism
Input Output
 Exploiting parallelism across Activations
Weights
Activations
activations in a layer
PE0
 Example
 An architecture implementing
input-stationary dataflow PE1
 Tile input activations and map to
different PEs
 Each PE computes different output PE2
activations by streaming in the
weights

PE3

Rangharajan Venkatesan ISSCC 2021 67 of 93


Mapping Tool: TimeLoop
 Loop-nest representation of  Loop-nest representation of an DNN
convolution layer accelerator
for n2=[0:N2)
for n=[0:N) for e2=[0:E2)
for f2=[0:F2) System
for m=[0:M)
for m2=[0:M2)
for e=[0:E) for c2=[0:C2)
for f=[0:F) for n1=[0:N1)
for c=[0:C) for e1=[0:E1)
for r=[0:R) for f1=[0:F1)
for m1=[0:M1)
for s=[0:S) PE
for c1=[0:C1)
Out[e][f][m][n] += for r1=[0:R)
Weight[r][s][c][m] * for s1=[0:S)
Input[e+r][f+s][c][n] for c0=[0:C0)
for m0=[0:M0)
for e0=[0:E0) MAC unit
for f0=[0:F0)
Parashar et al. ISPASS, 2019. MACs

Rangharajan Venkatesan ISSCC 2021 68 of 93


Pipelining
 Parallelism across layers of the network
 Execute one or more layers across different processing elements

Pipe-1 Pipe-2 Pipe-3 Pipe-4

Layer Layer Layer Layer Layer Layer Layer


Input 1 Output
2 3 4 5 6 7

PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE

Rangharajan Venkatesan ISSCC 2021 69 of 93


Impact of Tiling on Hardware Efficiency

 Large number of possible tilings for a given layer and hardware configuration
 >10x difference in performance and energy
 Need to explore optimized tiling to achieve best energy and performance

Venkatesan et al., ICCAD 2019


Rangharajan Venkatesan ISSCC 2021 70 of 93
Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design


 Compute: Efficient MAC Datapath
 Storage: Memory hierarchy
 Communication: Interconnect topologies

 Compiler Mapping flow


 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations


 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 71 of 93


Quantization
 Exploit application-level resiliency to perform compute at reduced precision
 Improves energy efficiency, area efficiency and performance
 Lower storage
 Cheaper compute

Accuracy
Small accuracy loss for
large efficiency gain

Hardware cost

Accuracy vs. Efficiency tradeoff


Chippa et al., DAC 2013
Rangharajan Venkatesan ISSCC 2021 72 of 93
Quantization
 Post-training Quantization

Pre-trained Quantized
Quantization
Model Model

 Pretrained model is quantized to improve efficiency

 No need for training sets during model deployment

Wu et al. arXiv, 2020

Rangharajan Venkatesan ISSCC 2021 73 of 93


Quantization
 Quantization-Aware Training

Pre-trained Quantized
Quantization
Model Model

Re-Training

 Quantization to improve efficiency


 Retraining to recover accuracy

Wu et al. arXiv, 2020

Rangharajan Venkatesan ISSCC 2021 74 of 93


Benefits of Quantization

Source: https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/

Rangharajan Venkatesan ISSCC 2021 75 of 93


Multiple Precision Support
 NVIDIA GPUs: Volta V100 and Ampere A100

Ref: NVIDIA A100 Tensor Core GPU Architecture whitepaper

Rangharajan Venkatesan ISSCC 2021 76 of 93


Outline
 Motivation for Hardware Acceleration

 Deep Neural Network: Basics

 Accelerator Architecture Design


 Compute: Efficient MAC Datapath
 Storage: Memory hierarchy
 Communication: Interconnect topologies

 Compiler Mapping flow


 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations


 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 77 of 93


Sparsity
 Neural networks exhibit high degree of sparsity
 Weights/connections are sparse
 Activations at intermediate stages are sparse

Han et al., NeurIPS 2015

Rangharajan Venkatesan ISSCC 2021 78 of 93


Types of Sparsity

Mao et al. NeurIPS 2017


Rangharajan Venkatesan ISSCC 2021 79 of 93
Structured Sparsity
 NVIDIA Ampere A100

Ref: NVIDIA A100 Tensor Core GPU Architecture whitepaper

Rangharajan Venkatesan ISSCC 2021 80 of 93


NVIDIA A100 Performance

Ref: NVIDIA A100 Tensor Core GPU Architecture whitepaper

Rangharajan Venkatesan ISSCC 2021 81 of 93


Unstructured Sparsity
 SCNN
 Only compute partial products where both operands are non-zero
 Get rid of the idea of sliding convolution: doesn’t make sense when most of the
operands are 0
 Vector ops are questionable: most elements of your vector are 0, don’t know a
priori which ones or how many

* =

Parashar et al. ISCA 2017


Rangharajan Venkatesan ISSCC 2021 82 of 93
Unstructured Sparsity
 SCNN

Parashar et al. ISCA 2017


Rangharajan Venkatesan ISSCC 2021 83 of 93
Unstructured Sparsity
 SCNN

 SCNN achieves 2.3X improvement in energy efficiency over Dense CNN


(DCNN) accelerator
Parashar et al. ISCA 2017
Rangharajan Venkatesan ISSCC 2021 84 of 93
An End-to-End Optimization Flow

Venkatesan et al.,
ICCAD 2019
Rangharajan Venkatesan ISSCC 2021 85 of 93
Summary
 Deep neural networks are increasing
used across a wide range of
applications
 Large amounts of data
 High computation demand

 Hardware acceleration is key for


continued growth

 Co-design across algorithm-compiler-


hardware can greatly improve
efficiency

Rangharajan Venkatesan ISSCC 2021 86 of 93


Papers to watch @ ISSCC 2021
 Session 9: ML Processors From Cloud to Edge
 9.1 A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-
Aware Throttling
 9.2 A 28nm 12.1TOPS/W Dual-Mode CNN Processor Using Effective-Weight-Based Convolution and Error-
Compensation-Based Prediction
 9.3 A 40nm 4.81TFLOPS/W 8b Floating-Point Training Processor for Non-Sparse Neural Networks Using Shared
Exponent Bias and 24-Way Fused Multiply-Add Tree
 9.4 PIU: A 248GOPS/W Stream-Based Processor for Irregular Probabilistic Inference Networks Using Precision-
Scalable Posit Arithmetic in 28nm
 9.5 A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile
 9.6 A 1/2.3inch 12.3Mpixel with On-Chip 4.97TOPS/W CNN Processor Back-Illuminated Stacked CMOS Image
 9.7 A 184μW Real-Time Hand-Gesture Recognition System with Hybrid Tiny Classifiers for Smart Wearable
Devices
 9.8 A 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech
Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm
 9.9 A Background-Noise and Process-Variation-Tolerant 109nW Acoustic Feature Extractor Based on Spike-
Domain Divisive-Energy Normalization for an Always-On Keyword Spotting Device

Rangharajan Venkatesan ISSCC 2021 87 of 93


Papers to watch @ ISSCC 2021
 Session 15: Compute-in-Memory Processors for Deep Neural Networks
 15.1 A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing
 15.2 A 2.75-to-75.9TOPS/W Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero
Skipping and Ping-Pong CIM with Simultaneous Computation and Weight Updating
 15.3 A 65nm 3T Dynamic Analog RAM-Based Computing-in-Memory Macro and CNN Accelerator with Retention
Enhancement, Adaptive Analog Sparsity and 44TOPS/W System Energy Efficiency
 15.4 A 5.99-to-691.1TOPS/W Tensor-Train In-Memory-Computing Processor Using Bit-Level-Sparsity- Based
Optimization and Variable-Precision Quantization
 Session 16: Compute-in-Memory
 16.1 A 22nm 4Mb 8b-Precision ReRAM Computing-in-Memory Macro with 11.91 to 195.7TOPS/W for Tiny AI
Edge Devices
 16.2 eDRAM-CIM: Compute-In-Memory Design with Reconfigurable Embedded-Dynamic-Memory Array
Realizing Adaptive Data Converters and Charge-Domain Computing
 16.3 A 28nm 384kb 6T-SRAM Computation-in-Memory Macro with 8b of Precision for AI Edge Chips
 16.4 An 89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision Compute-In Memory Macro in
22nm for Machine-Learning Edge Applications

Rangharajan Venkatesan ISSCC 2021 88 of 93


References
 Overview and Benchmarking
 V. Sze et al., “Efficient Processing of Deep Neural Networks: A Tutorial and
Survey,” Proceedings of the IEEE 2017
 V. Sze et al., “Efficient Processing of Deep Neural Networks,” Synthesis Lectures
on Computer Architecture 2020
 K. Guo et al., “Neural network accelerator comparison,”
[online]https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/.
 V. J. Reddi et al., “MLPerf Inference Benchmark,” arXiv 2019
 V. Camus et al., “Survey of precision-scalable multiply-accumulate units for
neural-network processing,” AICAS 2019
 H. Wu et al. “Integer quantization for deep learning inference: Principles and
empirical evaluation” arXiv 2020

Rangharajan Venkatesan ISSCC 2021 89 of 93


References
 Deep Learning Hardware
 F. Sijstermans, “The NVIDIA deep learning accelerator,” in Hot Chips 2018
 NVIDIA A100 Tensor Core GPU Architecture whitepaper
 B. Zimmer et al., “A 0.11pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-
based Deep Neural Network Accelerator with Ground-Reference Signaling in
16nm,” VLSI 2019
 N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing
unit,” in ISCA 2017
 A. Parashar et al., “SCNN: An accelerator for compressed-sparse convolutional
neural networks,” in ISCA 2017
 Y. Chen et al., “Eyeriss: An energy-efficient reconfigurable accelerator for deep
convolutional neural networks,” ISSCC 2016
 S. Han et al., “EIE: Efficient inference engine on compressed deep neural
network,” in ISCA 2016

Rangharajan Venkatesan ISSCC 2021 90 of 93


References
 Deep Learning Hardware
 E. H. Lee et al., “LogNet: Energy-efficient neural networks using logarithmic
computation,” ICASSP 2017
 H. Wu, “Low precision inference on GPUs” GTC 2019
 J. R. Stevens et al. “Manna: An Accelerator for Memory-Augmented Neural
Networks” MICRO 2019.
 S. Han et al. “Deep Compression: Compressing Deep Neural Networks with
Pruning, Trained Quantization and Huffman Coding” NeurIPS 2015
 S. Venkataramani et al. “AxNN: energy-efficient neuromorphic systems using
approximate computing” ISLPED 2014
 Y. Lecun et al., Optimal Brain Damage,” NeurIPS 1990

Rangharajan Venkatesan ISSCC 2021 91 of 93


References
 Deep Neural Networks
 A. Krizhevsky et al. “Imagenet classification with deep convolutional neural
networks,” NeurIPS 2012
 K. He et al. “Deep residual learning for image recognition,” CVPR 2016
 M. Tan et al., “EfficientNet: Rethinking Model Scaling for Convolutional Neural
Networks,” ICML 2019
 A. Vaswani et al., “Attention is all you need,” NeurIPS 2017
 M. Shoeybi et al., “Megatron-LM: Training MultiBillion Parameter Language Models
Using Model Parallelism,” arXiv 2019
 J. Choi et al. “PACT: Parameterized Clipping Activation for Quantized Neural
Networks”, arxiv 2018
 R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient
inference: A whitepaper”, arxiv 2018
 A. Mishra et al. “Apprentice: Using knowledge distillation techniques to improve
low-precision network accuracy” ICLR 2018

Rangharajan Venkatesan ISSCC 2021 92 of 93


References
 Modeling, Mapping and Exploration Tools
 A. Parashar et al., “Timeloop: A Systematic Approach to DNN Accelerator
Evaluation,” ISPASS 2019
 R. Venkatesan et al., “MAGNet: A modular aaccelerator generator for neural
networks.” ICCAD 2019
 Y. N. Wu et al., “Accelergy: An Architecture-Level Energy Estimation Methodology
for Accelerator Designs,” ICCAD 2019
 X. Yang et al., “Interstellar: using Halide’s scheduling language toanalyze DNN
accelerators,” ASPLOS 2020
 S. Jain et al., “RxNN: a frameworkfor evaluating deep neural networks on resistive
crossbars,” TCAD 2020
 T. Chen et al., “TVM: an automated end-to-end optimizing compiler for deep
learning,” OSDI 2018
 L. Mei et al., “ZigZag: A memory-centric rapid DNN accelerator design space
exploration framework,” arXiv 2020

Rangharajan Venkatesan ISSCC 2021 93 of 93

You might also like