0% found this document useful (0 votes)

340 views

Basic Design Approaches To Accelerating Deep Neural Networks

This document provides an overview of basic design approaches to accelerate deep neural networks. It begins with motivations for hardware acceleration of neural networks due to increasing complexity and performance/energy challenges. It then provides basics of deep neural networks including common layer types like convolution, activation and pooling layers. The rest of the document outlines key aspects of neural network accelerator design like compute units, memory hierarchy, interconnects, compiler mapping strategies, and neural network optimizations like quantization and sparsity.

Uploaded by

dxzhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

340 views

Basic Design Approaches To Accelerating Deep Neural Networks

Uploaded by

dxzhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 93

Basic Design Approaches to Accelerating Deep Neural

Networks

Rangharajan Venkatesan
NVIDIA Corporation

Live Q&A Session: Feb. 13, 2020, 7:00-7:20am, PST

Contact Info
email: [email protected]

Rangharajan Venkatesan ISSCC 2021 1 of 93

Self-Introduction
 B.Tech in Electronics and Communication
Engineering from IIT Roorkee in 2009
 Ph.D. in Electrical and Computer Engineering
from Purdue University in 2014
 Joined NVIDIA
 Research Scientist: 2014-2016
 Senior Research Scientist: 2016-Present

 Research Interests
 Machine Learning Accelerators
 High-Level Synthesis
 Low Power VLSI design
 SoC Design methodologies

Rangharajan Venkatesan ISSCC 2021 2 of 93

About this Tutorial
 Overview of design approaches to improve hardware efficiency

 Focus on inference
 Most of the techniques are generic and applicable to training as well

 Does cover ..
 Key metrics
 Design considerations
 Hardware optimizations
 Hardware/software co-design techniques

 Does not cover ..

 Algorithms and model design
 Detailed software stack
Rangharajan Venkatesan ISSCC 2021 3 of 93
Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 4 of 93

Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 5 of 93

Vast World of DNN Applications

DATA CENTER EMBEDDED COMPUTERS EMBEDDED DEVICES

Rangharajan Venkatesan ISSCC 2021 6 of 93

Hardware-enabled AI Revolution

Efficient Compute

Data

Rangharajan Venkatesan ISSCC 2021 7 of 93

Growth in Application Complexity

Bianco et al., IEEE Access, 2018. Ack: Bill Dally, GTC China, 2020.

Rangharajan Venkatesan ISSCC 2021 8 of 93

Hardware Performance Challenges

End of Moore’s Law Increasing cost

John Hennessy and David Patterson, Computer

S. Naffziger et al., ISSCC 2020
Architecture: A Quantitative Approach, 6/e. 2018
Rangharajan Venkatesan ISSCC 2021 9 of 93
Energy Efficiency Challenges
Super-human performance at low energy efficiency
1920 CPUs, 280 GPUs, Power: ~300,000 W

Google AlphaGo vs. Lee Sedol

Ack: Anand Raghunathan, Purdue University Ref: "Showdown". The Economist, 19 Nov. 2016

Rangharajan Venkatesan ISSCC 2021 10 of 93

Solution: Hardware Specialization

Reconfigurable FPGAs
Leverage reconfigurability of FPGA to accelerate a specific neural network

Accelerators
Programmable Processors Fixed Function Accelerators

Programmable hardware with support Customized hardware accelerators to

for scalar and vector math functions support a class of neural networks

Rangharajan Venkatesan ISSCC 2021 11 of 93

Many DNN Accelerators Exist!!

Different
platforms
Wide range of performance

Different power targets

Source: https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/

Rangharajan Venkatesan ISSCC 2021 12 of 93

Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 13 of 93

Deep Learning (aka DNN)

Artificial
Intelligence
Machine
Learning

Deep
Learning

Rangharajan Venkatesan ISSCC 2021 14 of 93

DNN: A Simple Example
 3 
Y j = activation  Wij × Xi 
W11  i=1 
Y1
X1
Y2
X2
Y3
X3
W34 Y4 Output Layer

Input Layer
Hidden Layer

Sze et al., Synthesis Lectures on Computer Architecture, 2020

Rangharajan Venkatesan ISSCC 2021 15 of 93
Practical DNNs

Input Layer 1 Layer 2 Layer N Output

 10s to 100s of layers

 Different types of layers
 Convolution, Fully-connected, Depthwise-Separable, Attention etc.
 Process several Gigabytes of data
 Perform billions to trillions of operations
 Wide range of applications
 Image processing, language processing, etc.

Rangharajan Venkatesan ISSCC 2021 16 of 93

Examples of DNN Applications
Image Classification Object Detection Recommendation Systems

Dog Cat Sheep

Text Summarization Automatic Speech Recognition Image Captioning

Rangharajan Venkatesan ISSCC 2021 17 of 93

Several Layer Types and Kernels
 Convolution layer
 Many different variants also
 Strided convolution, Dilated convolution, Groupwise convolution, Depthwise convolution,
Pointwise convolution, Depthwise-Separable convolution

 Activation layer
 ReLU, tanh, sigmoid, Leaky ReLU, Clipped ReLU, Swish

 Pooling layer
 Max. pooling, Average pooling, Unpooling

 Fully-connected layer

 And many more .. Attention, Deconvolution, etc.

Rangharajan Venkatesan ISSCC 2021 18 of 93
Convolution Layer
input output
feature map (fmap) feature map (fmap)
filter (weights) an output
activation
H
R E

S W F
Element-wise Partial Sum (psum)
Multiplication Accumulation
for e=[0:E)
for f=[0:F)
for r=[0:R)
for s=[0:S)
Out[e][f] += Weight[r][s] * Input[e+r][f+s]

Sze et al., Synthesis Lectures on Computer Architecture, 2020

Rangharajan Venkatesan ISSCC 2021 19 of 93
Convolution Layer

input fmap output fmap

filter (weights) an output
activation
H
R E

S W F
Sliding Window Processing
for e=[0:E)
for f=[0:F)
for r=[0:R)
for s=[0:S)
Out[e][f] += Weight[r][s] * Input[e+r][f+s]

Sze et al., Synthesis Lectures on Computer Architecture, 2020

Rangharajan Venkatesan ISSCC 2021 20 of 93
Convolution Layer

input fmap for e=[0:E)

filter … C for f=[0:F)

…
output fmap for c=[0:C)
C
…

for r=[0:R)
H … for s=[0:S)
R E
Out[e][f] +=
…

…
Weight[r][s][c] *
S W F
Input[e+r][f+s][c]
Many Input Channels (C)

Sze et al., Synthesis Lectures on Computer Architecture, 2020

Rangharajan Venkatesan ISSCC 2021 21 of 93
Convolution Layer

many input fmap

output fmap
filters (M) C

…
for m=[0:M)

…
C for e=[0:E)
…

…
H for f=[0:F)
R E
1 for c=[0:C)
…

…
S W F for r=[0:R)
Many
…

for s=[0:S)
Output Channels (M) Out[e][f][m] +=
C
…

Weight[r][s][c][m] *
R Input[e+r][f+s][c]
M
…

Sze et al., Synthesis Lectures on Computer Architecture, 2020

Rangharajan Venkatesan ISSCC 2021 22 of 93
Convolution Layer

Many
Input fmaps (N) Many
C
Output fmaps (N) for n=[0:N)
filters
…

…
M

…
C for m=[0:M)
…

H for e=[0:E)
R E
1 for f=[0:F)
1
…

…
S W F for c=[0:C)
for r=[0:R)
…

…
C
for s=[0:S)
…

…
…

Out[e][f][m][n] +=
R E Weight[r][s][c][m] *
…

H
…

N Input[e+r][f+s][c][n]

…
S
N
…

F
W
Sze et al., Synthesis Lectures on Computer Architecture, 2020
Rangharajan Venkatesan ISSCC 2021 23 of 93
Activation Layer
 Introduce non-linearity in the network

Common Activation Layers in DNNs

Rangharajan Venkatesan ISSCC 2021 24 of 93

Pooling Layer
 Down-sampling operation to provide invariance against minor changes in the
image such as shifting, rotation, etc.
 Can be different types
 Max. pooling, Average pooling

Max(1,2,4,6)

1 2 2 3
4 6 5 8 2x2 max pooling with stride = 2 6 8
3 1 4 4 3 4
2 1 3 3
Max. Pooling Example
Rangharajan Venkatesan ISSCC 2021 25 of 93
Fully-Connected Layer

1
C 1 for m=[0:M)
for c=[0:C)
M M Out[m] +=
C
Weight[m][c] *
Input[c]

Matrix-Vector Multiplication

Rangharajan Venkatesan ISSCC 2021 26 of 93

Deep Learning Stack

Neural
ResNet MobileNet BERT
Networks

DL
PyTorch TensorFlow Caffe
Framework

Compiler TVM TimeLoop ZigZag

Hardware CPU GPU ASIC FPGA

TimeLoop, ISPASS 2019. TVM, OSDI 2018. ZigZag, arXiv 2020

Rangharajan Venkatesan ISSCC 2021 27 of 93
Deep Learning Stack

Neural
ResNet MobileNet BERT
Networks

DL
PyTorch TensorFlow Caffe
Framework

Co-design across
Compiler TVM TimeLoop ZigZag different levels for
efficient hardware

Hardware CPU GPU ASIC FPGA

TimeLoop, ISPASS 2019. TVM, OSDI 2018. ZigZag, arXiv 2020

Rangharajan Venkatesan ISSCC 2021 28 of 93
Evaluation Metrics
 Accuracy
 % predicted correctly

 Performance (Throughput, Latency)

 Inferences/sec, TOPS, delay

 Energy efficiency
 Energy/inference, TOPS/W

 Area efficiency
 Inference/sec/mm2, TOPS/mm2

 Flexibility
 Support different types of neural networks and layers

Rangharajan Venkatesan ISSCC 2021 29 of 93

Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 30 of 93

Hierarchical View of DL Accelerator
 Multiply & Accumulate (MAC) unit
 Multiply-Accumulate Datapath
 Registers

 Processing Element (PE)

 Arrays of MAC units
 Scratchpad
 PE Control Finite State Machine (FSM)

 System
 Array of PEs
 Global buffer
 Controller
 DRAM

Rangharajan Venkatesan ISSCC 2021 31 of 93

Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 32 of 93

MAC Unit Implementations
 Scalar MAC unit

 One Multiply and Accumulate Weight Input

operation per cycle Register Register
 Registers are used to store values X
and achieve temporal reuse across
cycles
 Registers are optional +
 All operands cannot be reused across
cycles
Accumulation
 Typically, only one (sometimes two) Register
operands are stored in registers
 Examples
 Google TPU, Eyeriss ISSCC 2016

Rangharajan Venkatesan ISSCC 2021 33 of 93

MAC Unit Implementations
 Vector MAC unit
Weight Input
 Vector-wide multiply-accumulate Register Register
operations every cycle
 Performs vector dot-product Vector Size
 Takes a weight vector, an input activation
vector, and a partial sum scalar as inputs
X X X
 Computes a partial sum as output
 Achieves spatial reuse of partial sum
+
values by reducing them during dot-
product computation
 Registers can be used to store one or Accumulation
more operands for temporal reuse Register

Rangharajan Venkatesan ISSCC 2021 34 of 93

MAC Unit Implementations
 Matrix-Vector Accumulate (MVA) unit
Weight Input
Register Register
 Multiple vector-dot products every cycle
 Example:
 Takes a weight matrix, an input activation
vector, and a partial sum vector as inputs Vector Size
 Computes a partial sum vector as output
 Achieves spatial reuse of activation as well
X X
as partial sum values X

 Registers can be used to store one or

more operands for temporal reuse +
 Examples
Accumulation
 NVDLA, Venkatesan et al. HotChips 2019,
Register
Zimmer et al. JSSC 2020

Rangharajan Venkatesan ISSCC 2021 35 of 93

MAC Unit Implementations
 Matrix-Matrix Accumulate (MMA) unit

 Examples: NVIDIA GPU Tensor cores, Alibaba Hanguang ISSCC 2020

Rangharajan Venkatesan ISSCC 2021 36 of 93
MAC Implementation Tradeoffs
Flexibility

Easy to support different Scalar Hybrid

layer types MAC implementations
Vector
MAC

MVA

MMA
High effort to achieve good
utilization some layer types
Efficiency
• No spatial reuse • High spatial reuse
• High control overheads • Low control overheads

Rangharajan Venkatesan ISSCC 2021 37 of 93

Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 38 of 93

Memory Hierarchy
Small Capacity, Registers
Low Latency,
Low Energy

Scratchpads

Global Buffer

Large Capacity,
High Latency, DRAM
High Energy

Horowitz, ISSCC 2014

Rangharajan Venkatesan ISSCC 2021 39 of 93
Memory Hierarchy
Small Capacity, Registers
 Efficient reuse of data to achieve Low Latency,
Low Energy
high performance and energy
efficiency Scratchpads

 Optimal number of levels and sizing

Global Buffer
 Buffer management to overlap
communication and computation

Large Capacity,
High Latency, DRAM
High Energy

Rangharajan Venkatesan ISSCC 2021 40 of 93

Dataflows: Temporal Data Reuse
 Weight-Stationary (WS) Dataflow

Less-frequent access More-frequent access

 Examples: Venkatesan et al. HotChips 2019, Zimmer et al. JSSC 2020, Google TPU,
NVDLA
Rangharajan Venkatesan ISSCC 2021 41 of 93
Dataflows: Temporal Data Reuse
 Output-Stationary (OS) Dataflow

Less-frequent access More-frequent access

 Examples: Moons et al. VLSI 2016, Thinker et al. VLSI 2017

Rangharajan Venkatesan ISSCC 2021 42 of 93
Dataflows: Temporal Data Reuse
 Input-Stationary (IS) Dataflow

Less-frequent access More-frequent access

 Example: SCNN, ISCA 2017

Rangharajan Venkatesan ISSCC 2021 43 of 93
Dataflows: Temporal Data Reuse
 Drawback of single data reuse
 Operands with low or no reuse start to dominate energy consumption

Venkatesan et al., ICCAD 2019

Rangharajan Venkatesan ISSCC 2021 44 of 93
Dataflows: Temporal Data Reuse
 Multi-level Dataflows
 Output Stationary – Local Weight Stationary (OS-LWS) Dataflow

Less-frequent access More-frequent access

 Example: MAGNet, ICCAD 2019

Rangharajan Venkatesan ISSCC 2021 45 of 93
Dataflows: Temporal Data Reuse
 Multi-level Dataflows
 Weight Stationary – Local Output Stationary (WS-LOS) Dataflow

Less-frequent access More-frequent access

 Example: MAGNet, ICCAD 2019

Rangharajan Venkatesan ISSCC 2021 46 of 93
Comparison of Dataflows

Network: ResNet-50
Dataset: ImageNet
Technology: 16ff

• OS-LWS dataflow  Temporal reuse in both weights and outputs

• Larger vector size  Spatial input/output reuse, amortize control overheads
Venkatesan et al., ICCAD 2019
Rangharajan Venkatesan ISSCC 2021 47 of 93
Compute-Communication Overlap
 Coarse-grained: DMA with double buffering

Datapath

Buffer 1 Buffer 2

Lower-level
Memory

Rangharajan Venkatesan ISSCC 2021 48 of 93

Buffet Buffer Manager
 Fine-grained compute-communication
overlap
 Operations
 Fill
 Sequentially write data read from DRAM
and set valid state
 Read
 Perform read if the address is valid,
otherwise stall until address becomes valid
 Update
 Perform write operation to address and set
valid state
 Shrink
 Invalidate addresses that are not in use
and request data from lower-level memory

Pellauer et al., ASPLOS 2019

Rangharajan Venkatesan ISSCC 2021 49 of 93
Buffet Buffer Manager

 Buffet achieves 2.3X reduction in energy-delay product (EDP) and 2.1X area
efficiency gains over DMA with double buffering

Pellauer et al., ASPLOS 2019

Rangharajan Venkatesan ISSCC 2021 50 of 93
Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory Hierarchy
 Communication: Interconnect Topologies

 Compiler Mapping Flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 51 of 93

Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect

Memory

Rangharajan Venkatesan ISSCC 2021 52 of 93

Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect
Pattern-1
• Unicast Weights
Memory

Rangharajan Venkatesan ISSCC 2021 53 of 93

Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect
Pattern-1
• Unicast Weights
• Multicast Input activations
Memory

Rangharajan Venkatesan ISSCC 2021 54 of 93

Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect
Pattern-1
• Unicast Weights
• Multicast Input activations
Memory • Unicast Output activations

Rangharajan Venkatesan ISSCC 2021 55 of 93

Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect
Pattern-2
• Multicast Weights
• Unicast Input activations
Memory • Unicast Output activations

Rangharajan Venkatesan ISSCC 2021 56 of 93

Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect Pattern-3
• Unicast Weights
• Unicast Input activations
Memory • Unicast Partial sums
• Unicast Output activations

Rangharajan Venkatesan ISSCC 2021 57 of 93

Communication Patterns

Weights
Input Activations

PE PE PE PE Output Activations
Partial Sums
Interconnect Pattern-4
• Unicast Weights
• Unicast Input activations
Memory • Unicast Partial sums
• Unicast Output activations

Rangharajan Venkatesan ISSCC 2021 58 of 93

Interconnect Design
 In practice, we observe combinations of different patterns depending on the
architecture

 Customized interconnects support specific patterns

 e.g. Systolic Arrays

 Programmable interconnect support flexible patterns

 e.g. Mesh network on-chip (NoC)

Rangharajan Venkatesan ISSCC 2021 59 of 93

Systolic Array
PE

input activations
from different
input channels (C)

psums from different output channels (M)

Sze et al., Synthesis Lectures on Computer Architecture, 2020
Rangharajan Venkatesan ISSCC 2021 60 of 93
Mesh NoC
 Different packet sizes for different data
types
 Unicast and Multicast support
 Flexible routing protocols

Rangharajan Venkatesan ISSCC 2021 61 of 93

Hierarchical Network
 Efficient communication for large scale designs
 Reduces number of hops
 Reduces congestion
 Example:
 Multi-Chip Module (MCM) based DL accelerator

Network On-Package

Network On-Chip
Venkatesan et al., HotChips 2019
Zimmer et al. JSSC 2020

Rangharajan Venkatesan ISSCC 2021 62 of 93

Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory hierarchy
 Communication: Interconnect topologies

 Compiler Mapping flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 63 of 93

Compiler Mapping Flow
 Goal:
 Efficiently execute the neural network
in the target hardware
 Performance
 Energy efficiency

 Opportunities
 Data reuse
 Parallelism
 Pipelining

Rangharajan Venkatesan ISSCC 2021 64 of 93

Data Reuse Opportunities

Data type Reuse

Input activations* R*S*M
Weights E*F*N
Output activations R*S*C
*except halo

Sze et al., Synthesis Lectures on Computer Architecture, 2020

Rangharajan Venkatesan ISSCC 2021 65 of 93
Model Parallelism
Input Output
 Exploit parallelism across weights in Activations
Weights
Activations
a layer
PE0

 Example
 An architecture implementing weight-
stationary dataflow PE1
 Tile weights and distribute them to
different PEs
 Compute different output activations PE2
by streaming in the input activations

PE3

Rangharajan Venkatesan ISSCC 2021 66 of 93

Data Parallelism
Input Output
 Exploiting parallelism across Activations
Weights
Activations
activations in a layer
PE0
 Example
 An architecture implementing
input-stationary dataflow PE1
 Tile input activations and map to
different PEs
 Each PE computes different output PE2
activations by streaming in the
weights

PE3

Rangharajan Venkatesan ISSCC 2021 67 of 93

Mapping Tool: TimeLoop
 Loop-nest representation of  Loop-nest representation of an DNN
convolution layer accelerator
for n2=[0:N2)
for n=[0:N) for e2=[0:E2)
for f2=[0:F2) System
for m=[0:M)
for m2=[0:M2)
for e=[0:E) for c2=[0:C2)
for f=[0:F) for n1=[0:N1)
for c=[0:C) for e1=[0:E1)
for r=[0:R) for f1=[0:F1)
for m1=[0:M1)
for s=[0:S) PE
for c1=[0:C1)
Out[e][f][m][n] += for r1=[0:R)
Weight[r][s][c][m] * for s1=[0:S)
Input[e+r][f+s][c][n] for c0=[0:C0)
for m0=[0:M0)
for e0=[0:E0) MAC unit
for f0=[0:F0)
Parashar et al. ISPASS, 2019. MACs

Rangharajan Venkatesan ISSCC 2021 68 of 93

Pipelining
 Parallelism across layers of the network
 Execute one or more layers across different processing elements

Pipe-1 Pipe-2 Pipe-3 Pipe-4

Layer Layer Layer Layer Layer Layer Layer

Input 1 Output
2 3 4 5 6 7

PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE
PE PE PE PE PE PE PE PE

Rangharajan Venkatesan ISSCC 2021 69 of 93

Impact of Tiling on Hardware Efficiency

 Large number of possible tilings for a given layer and hardware configuration
 >10x difference in performance and energy
 Need to explore optimized tiling to achieve best energy and performance

Venkatesan et al., ICCAD 2019

Rangharajan Venkatesan ISSCC 2021 70 of 93
Outline
 Motivation for Hardware Acceleration

 Deep Neural Networks (DNN): Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory hierarchy
 Communication: Interconnect topologies

 Compiler Mapping flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 71 of 93

Quantization
 Exploit application-level resiliency to perform compute at reduced precision
 Improves energy efficiency, area efficiency and performance
 Lower storage
 Cheaper compute

Accuracy
Small accuracy loss for
large efficiency gain

Hardware cost

Accuracy vs. Efficiency tradeoff

Chippa et al., DAC 2013
Rangharajan Venkatesan ISSCC 2021 72 of 93
Quantization
 Post-training Quantization

Pre-trained Quantized
Quantization
Model Model

 Pretrained model is quantized to improve efficiency

 No need for training sets during model deployment

Wu et al. arXiv, 2020

Rangharajan Venkatesan ISSCC 2021 73 of 93

Quantization
 Quantization-Aware Training

Pre-trained Quantized
Quantization
Model Model

Re-Training

 Quantization to improve efficiency

 Retraining to recover accuracy

Wu et al. arXiv, 2020

Rangharajan Venkatesan ISSCC 2021 74 of 93

Benefits of Quantization

Source: https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/

Rangharajan Venkatesan ISSCC 2021 75 of 93

Multiple Precision Support
 NVIDIA GPUs: Volta V100 and Ampere A100

Ref: NVIDIA A100 Tensor Core GPU Architecture whitepaper

Rangharajan Venkatesan ISSCC 2021 76 of 93

Outline
 Motivation for Hardware Acceleration

 Deep Neural Network: Basics

 Accelerator Architecture Design

 Compute: Efficient MAC Datapath
 Storage: Memory hierarchy
 Communication: Interconnect topologies

 Compiler Mapping flow

 Data reuse, Parallelism, Pipelining

 Neural Network Optimizations

 Quantization
 Sparsity

Rangharajan Venkatesan ISSCC 2021 77 of 93

Sparsity
 Neural networks exhibit high degree of sparsity
 Weights/connections are sparse
 Activations at intermediate stages are sparse

Han et al., NeurIPS 2015

Rangharajan Venkatesan ISSCC 2021 78 of 93

Types of Sparsity

Mao et al. NeurIPS 2017

Rangharajan Venkatesan ISSCC 2021 79 of 93
Structured Sparsity
 NVIDIA Ampere A100

Ref: NVIDIA A100 Tensor Core GPU Architecture whitepaper

Rangharajan Venkatesan ISSCC 2021 80 of 93

NVIDIA A100 Performance

Ref: NVIDIA A100 Tensor Core GPU Architecture whitepaper

Rangharajan Venkatesan ISSCC 2021 81 of 93

Unstructured Sparsity
 SCNN
 Only compute partial products where both operands are non-zero
 Get rid of the idea of sliding convolution: doesn’t make sense when most of the
operands are 0
 Vector ops are questionable: most elements of your vector are 0, don’t know a
priori which ones or how many

* =

Parashar et al. ISCA 2017

Rangharajan Venkatesan ISSCC 2021 82 of 93
Unstructured Sparsity
 SCNN

Parashar et al. ISCA 2017

Rangharajan Venkatesan ISSCC 2021 83 of 93
Unstructured Sparsity
 SCNN

 SCNN achieves 2.3X improvement in energy efficiency over Dense CNN

(DCNN) accelerator
Parashar et al. ISCA 2017
Rangharajan Venkatesan ISSCC 2021 84 of 93
An End-to-End Optimization Flow

Venkatesan et al.,
ICCAD 2019
Rangharajan Venkatesan ISSCC 2021 85 of 93
Summary
 Deep neural networks are increasing
used across a wide range of
applications
 Large amounts of data
 High computation demand

 Hardware acceleration is key for

continued growth

 Co-design across algorithm-compiler-

hardware can greatly improve
efficiency

Rangharajan Venkatesan ISSCC 2021 86 of 93

Papers to watch @ ISSCC 2021
 Session 9: ML Processors From Cloud to Edge
 9.1 A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-
Aware Throttling
 9.2 A 28nm 12.1TOPS/W Dual-Mode CNN Processor Using Effective-Weight-Based Convolution and Error-
Compensation-Based Prediction
 9.3 A 40nm 4.81TFLOPS/W 8b Floating-Point Training Processor for Non-Sparse Neural Networks Using Shared
Exponent Bias and 24-Way Fused Multiply-Add Tree
 9.4 PIU: A 248GOPS/W Stream-Based Processor for Irregular Probabilistic Inference Networks Using Precision-
Scalable Posit Arithmetic in 28nm
 9.5 A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile
 9.6 A 1/2.3inch 12.3Mpixel with On-Chip 4.97TOPS/W CNN Processor Back-Illuminated Stacked CMOS Image
 9.7 A 184μW Real-Time Hand-Gesture Recognition System with Hybrid Tiny Classifiers for Smart Wearable
Devices
 9.8 A 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech
Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm
 9.9 A Background-Noise and Process-Variation-Tolerant 109nW Acoustic Feature Extractor Based on Spike-
Domain Divisive-Energy Normalization for an Always-On Keyword Spotting Device

Rangharajan Venkatesan ISSCC 2021 87 of 93

Papers to watch @ ISSCC 2021
 Session 15: Compute-in-Memory Processors for Deep Neural Networks
 15.1 A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing
 15.2 A 2.75-to-75.9TOPS/W Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero
Skipping and Ping-Pong CIM with Simultaneous Computation and Weight Updating
 15.3 A 65nm 3T Dynamic Analog RAM-Based Computing-in-Memory Macro and CNN Accelerator with Retention
Enhancement, Adaptive Analog Sparsity and 44TOPS/W System Energy Efficiency
 15.4 A 5.99-to-691.1TOPS/W Tensor-Train In-Memory-Computing Processor Using Bit-Level-Sparsity- Based
Optimization and Variable-Precision Quantization
 Session 16: Compute-in-Memory
 16.1 A 22nm 4Mb 8b-Precision ReRAM Computing-in-Memory Macro with 11.91 to 195.7TOPS/W for Tiny AI
Edge Devices
 16.2 eDRAM-CIM: Compute-In-Memory Design with Reconfigurable Embedded-Dynamic-Memory Array
Realizing Adaptive Data Converters and Charge-Domain Computing
 16.3 A 28nm 384kb 6T-SRAM Computation-in-Memory Macro with 8b of Precision for AI Edge Chips
 16.4 An 89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision Compute-In Memory Macro in
22nm for Machine-Learning Edge Applications

Rangharajan Venkatesan ISSCC 2021 88 of 93

References
 Overview and Benchmarking
 V. Sze et al., “Efficient Processing of Deep Neural Networks: A Tutorial and
Survey,” Proceedings of the IEEE 2017
 V. Sze et al., “Efficient Processing of Deep Neural Networks,” Synthesis Lectures
on Computer Architecture 2020
 K. Guo et al., “Neural network accelerator comparison,”
[online]https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/.
 V. J. Reddi et al., “MLPerf Inference Benchmark,” arXiv 2019
 V. Camus et al., “Survey of precision-scalable multiply-accumulate units for
neural-network processing,” AICAS 2019
 H. Wu et al. “Integer quantization for deep learning inference: Principles and
empirical evaluation” arXiv 2020

Rangharajan Venkatesan ISSCC 2021 89 of 93

References
 Deep Learning Hardware
 F. Sijstermans, “The NVIDIA deep learning accelerator,” in Hot Chips 2018
 NVIDIA A100 Tensor Core GPU Architecture whitepaper
 B. Zimmer et al., “A 0.11pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-
based Deep Neural Network Accelerator with Ground-Reference Signaling in
16nm,” VLSI 2019
 N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing
unit,” in ISCA 2017
 A. Parashar et al., “SCNN: An accelerator for compressed-sparse convolutional
neural networks,” in ISCA 2017
 Y. Chen et al., “Eyeriss: An energy-efficient reconfigurable accelerator for deep
convolutional neural networks,” ISSCC 2016
 S. Han et al., “EIE: Efficient inference engine on compressed deep neural
network,” in ISCA 2016

Rangharajan Venkatesan ISSCC 2021 90 of 93

References
 Deep Learning Hardware
 E. H. Lee et al., “LogNet: Energy-efficient neural networks using logarithmic
computation,” ICASSP 2017
 H. Wu, “Low precision inference on GPUs” GTC 2019
 J. R. Stevens et al. “Manna: An Accelerator for Memory-Augmented Neural
Networks” MICRO 2019.
 S. Han et al. “Deep Compression: Compressing Deep Neural Networks with
Pruning, Trained Quantization and Huffman Coding” NeurIPS 2015
 S. Venkataramani et al. “AxNN: energy-efficient neuromorphic systems using
approximate computing” ISLPED 2014
 Y. Lecun et al., Optimal Brain Damage,” NeurIPS 1990

Rangharajan Venkatesan ISSCC 2021 91 of 93

References
 Deep Neural Networks
 A. Krizhevsky et al. “Imagenet classification with deep convolutional neural
networks,” NeurIPS 2012
 K. He et al. “Deep residual learning for image recognition,” CVPR 2016
 M. Tan et al., “EfficientNet: Rethinking Model Scaling for Convolutional Neural
Networks,” ICML 2019
 A. Vaswani et al., “Attention is all you need,” NeurIPS 2017
 M. Shoeybi et al., “Megatron-LM: Training MultiBillion Parameter Language Models
Using Model Parallelism,” arXiv 2019
 J. Choi et al. “PACT: Parameterized Clipping Activation for Quantized Neural
Networks”, arxiv 2018
 R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient
inference: A whitepaper”, arxiv 2018
 A. Mishra et al. “Apprentice: Using knowledge distillation techniques to improve
low-precision network accuracy” ICLR 2018

Rangharajan Venkatesan ISSCC 2021 92 of 93

References
 Modeling, Mapping and Exploration Tools
 A. Parashar et al., “Timeloop: A Systematic Approach to DNN Accelerator
Evaluation,” ISPASS 2019
 R. Venkatesan et al., “MAGNet: A modular aaccelerator generator for neural
networks.” ICCAD 2019
 Y. N. Wu et al., “Accelergy: An Architecture-Level Energy Estimation Methodology
for Accelerator Designs,” ICCAD 2019
 X. Yang et al., “Interstellar: using Halide’s scheduling language toanalyze DNN
accelerators,” ASPLOS 2020
 S. Jain et al., “RxNN: a frameworkfor evaluating deep neural networks on resistive
crossbars,” TCAD 2020
 T. Chen et al., “TVM: an automated end-to-end optimizing compiler for deep
learning,” OSDI 2018
 L. Mei et al., “ZigZag: A memory-centric rapid DNN accelerator design space
exploration framework,” arXiv 2020

Rangharajan Venkatesan ISSCC 2021 93 of 93

F6 Wirelineforum 2024
100% (1)
F6 Wirelineforum 2024
468 pages
Razavi Data Conversion Book
No ratings yet
Razavi Data Conversion Book
135 pages
Isscc2021 T6
No ratings yet
Isscc2021 T6
87 pages
Fundamentals of Fully-Integrated Voltage Regulators: Yan Lu University of Macau, Macao, China
100% (1)
Fundamentals of Fully-Integrated Voltage Regulators: Yan Lu University of Macau, Macao, China
81 pages
Isscc 2021 Tutorial: Silicon Photonics: From Basics To Asics
No ratings yet
Isscc 2021 Tutorial: Silicon Photonics: From Basics To Asics
88 pages
Closing The Gap Between Asic Custom Tools and Techniques
100% (4)
Closing The Gap Between Asic Custom Tools and Techniques
431 pages
ISSCC 2021 Tutorials T11: Ultra-Low Power Wireless Receiver Design
No ratings yet
ISSCC 2021 Tutorials T11: Ultra-Low Power Wireless Receiver Design
74 pages
Calibration Techniques in Adcs: Ahmed M. A. Ali Analog Devices, Inc
No ratings yet
Calibration Techniques in Adcs: Ahmed M. A. Ali Analog Devices, Inc
74 pages
Tutorial T2: Fundamentals of Memory Subsystem Design For HPC and AI
No ratings yet
Tutorial T2: Fundamentals of Memory Subsystem Design For HPC and AI
105 pages
ISSCC2020-01 Visuals Plenary
100% (1)
ISSCC2020-01 Visuals Plenary
207 pages
Isscc02 Tutorial
No ratings yet
Isscc02 Tutorial
62 pages
Vlsi
No ratings yet
Vlsi
111 pages
Design and Analysis of A Symmetric Phase Locked Loop For Low Frequencies in 180 NM Technology
No ratings yet
Design and Analysis of A Symmetric Phase Locked Loop For Low Frequencies in 180 NM Technology
6 pages
Isscc2018 31 Digest
No ratings yet
Isscc2018 31 Digest
17 pages
Phase Noise and Jitter in CMOS Ring Oscillators
No ratings yet
Phase Noise and Jitter in CMOS Ring Oscillators
14 pages
3DIC
No ratings yet
3DIC
148 pages
Charge Pump Low Voltage Op Amp
100% (3)
Charge Pump Low Voltage Op Amp
2 pages
1994 PDF
No ratings yet
1994 PDF
578 pages
ISSCC2023-Amplifiers and Oscillators
No ratings yet
ISSCC2023-Amplifiers and Oscillators
340 pages
Chapter 1
No ratings yet
Chapter 1
49 pages
Design Automation For Differential MOS Current-Mode Logic Circuits
No ratings yet
Design Automation For Differential MOS Current-Mode Logic Circuits
235 pages
Lab I12
No ratings yet
Lab I12
8 pages
Process Induced Well Proximity
No ratings yet
Process Induced Well Proximity
15 pages
Mixed Signal Lecture PLL
100% (1)
Mixed Signal Lecture PLL
15 pages
Maloberti F. - Layout of Analog and Mixed Analog-Digital Circuits PDF
No ratings yet
Maloberti F. - Layout of Analog and Mixed Analog-Digital Circuits PDF
27 pages
PID Controller Modifications To Improve Steady-State Performance of Digital Controllers For Buck and Boost Converters
No ratings yet
PID Controller Modifications To Improve Steady-State Performance of Digital Controllers For Buck and Boost Converters
8 pages
Chip IO Circuit Design PDF
100% (2)
Chip IO Circuit Design PDF
21 pages
CMOS Analog Circuit Design 0198071892 PDF
No ratings yet
CMOS Analog Circuit Design 0198071892 PDF
794 pages
MATLAB Software For The Code Excited Linear Prediction (1608453847)
100% (2)
MATLAB Software For The Code Excited Linear Prediction (1608453847)
110 pages
An Overview of Design Techniques For CMOS Phase Detectors
No ratings yet
An Overview of Design Techniques For CMOS Phase Detectors
4 pages
Analog Design Centering and Sizing 2007
No ratings yet
Analog Design Centering and Sizing 2007
210 pages
Vco 3
No ratings yet
Vco 3
6 pages
A Semi Digital Dual Delay-Locked Loop
No ratings yet
A Semi Digital Dual Delay-Locked Loop
10 pages
Eetop - CN - Ayan Mandal, Sunil P. Khatri, Rabi Mahapatra (Auth.) - Source-Synchronous Networ
No ratings yet
Eetop - CN - Ayan Mandal, Sunil P. Khatri, Rabi Mahapatra (Auth.) - Source-Synchronous Networ
151 pages
Ecen720: High-Speed Links Circuits and Systems Spring 2021: Lecture 11: Clocking Architectures & Plls
No ratings yet
Ecen720: High-Speed Links Circuits and Systems Spring 2021: Lecture 11: Clocking Architectures & Plls
102 pages
Measuring and Evaluating The Security Level of Circuits
No ratings yet
Measuring and Evaluating The Security Level of Circuits
94 pages
PhaseChangeMemory PDF
No ratings yet
PhaseChangeMemory PDF
136 pages
Cmos Image Sensors
No ratings yet
Cmos Image Sensors
10 pages
Eee 2002 Lecture Notes PDF
No ratings yet
Eee 2002 Lecture Notes PDF
148 pages
Delay Lock Loop
100% (1)
Delay Lock Loop
19 pages
Lecture11 Ee689 Clocking Arch Plls
No ratings yet
Lecture11 Ee689 Clocking Arch Plls
100 pages
Lec 9 CCandCBDesigns 08 PDF
No ratings yet
Lec 9 CCandCBDesigns 08 PDF
30 pages
FinFET/Nanowire Design For 5nm/3nm Technology Nodes: Channel Cladding and Introducing A "Bottleneck" Shape To Remove Performance Bottleneck
No ratings yet
FinFET/Nanowire Design For 5nm/3nm Technology Nodes: Channel Cladding and Introducing A "Bottleneck" Shape To Remove Performance Bottleneck
3 pages
Lec02 Sequential Circuit
No ratings yet
Lec02 Sequential Circuit
45 pages
Razavi Monolithic Phase-Locked Loops and Clock Recovery Circuits
No ratings yet
Razavi Monolithic Phase-Locked Loops and Clock Recovery Circuits
39 pages
Issccedu 2018: Chopping Demystified: Kofi Makinwa Delft University of Technology The Netherlands
100% (1)
Issccedu 2018: Chopping Demystified: Kofi Makinwa Delft University of Technology The Netherlands
14 pages
Ch2 AnalogSamplingFeb2017
No ratings yet
Ch2 AnalogSamplingFeb2017
40 pages
Equalizer부분 rev2
100% (1)
Equalizer부분 rev2
36 pages
Interconnect
No ratings yet
Interconnect
32 pages
Manual HSPICE
0% (1)
Manual HSPICE
458 pages
DLL Design Examples, Design Issues - Tips
No ratings yet
DLL Design Examples, Design Issues - Tips
56 pages
RTS Noise Impact in CMOS Image Sensors Readout Circuit
No ratings yet
RTS Noise Impact in CMOS Image Sensors Readout Circuit
5 pages
Unit 2
No ratings yet
Unit 2
38 pages
Ultra-Low Power SAR-ADC in 28nm CMOS TECH
No ratings yet
Ultra-Low Power SAR-ADC in 28nm CMOS TECH
86 pages
Stein J.Y. Digital Signal Processing - A Computer Science Perspective (Wiley, 2000) (T) (869s)
100% (1)
Stein J.Y. Digital Signal Processing - A Computer Science Perspective (Wiley, 2000) (T) (869s)
869 pages
DSP Architecture Design Essentials
No ratings yet
DSP Architecture Design Essentials
353 pages
Course Notes
No ratings yet
Course Notes
253 pages
SAR ADC Tutorial
100% (1)
SAR ADC Tutorial
48 pages
(2012) Design of D-PHY Chip For Mobile Display Interface Supporting MIPI Standard
No ratings yet
(2012) Design of D-PHY Chip For Mobile Display Interface Supporting MIPI Standard
2 pages
VLSI
No ratings yet
VLSI
188 pages
Introduction To Link Design: Computer Systems Laboratory Stanford University Horowitz@stanford - Edu
No ratings yet
Introduction To Link Design: Computer Systems Laboratory Stanford University Horowitz@stanford - Edu
50 pages
Logic synthesis Standard Requirements
From Everand
Logic synthesis Standard Requirements
Gerardus Blokdyk
No ratings yet
5_lecture_28_01_25
No ratings yet
5_lecture_28_01_25
47 pages
DL Inference FPGA Class1
No ratings yet
DL Inference FPGA Class1
56 pages
EECS251Leture-JennyHuang 2021
No ratings yet
EECS251Leture-JennyHuang 2021
67 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
ISSCC 2021 Tutorials: Designing Amplifiers For Stability
No ratings yet
ISSCC 2021 Tutorials: Designing Amplifiers For Stability
85 pages
Isscc2021 T8
No ratings yet
Isscc2021 T8
100 pages
Isscc2021 T1
No ratings yet
Isscc2021 T1
98 pages