0% found this document useful (0 votes)
8 views146 pages

MCA - HW - Lecture 7and8 - Prelim

The document covers lectures on data-level parallelism in modern computer architecture, focusing on vector processors, SIMD extensions, and GPUs. It discusses Flynn's taxonomy, the efficiency of SIMD architectures, and the structure and execution of vector processors. Key topics include vector operations, memory access patterns, and optimization techniques for vector processing.

Uploaded by

Ibrahim Omar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views146 pages

MCA - HW - Lecture 7and8 - Prelim

The document covers lectures on data-level parallelism in modern computer architecture, focusing on vector processors, SIMD extensions, and GPUs. It discusses Flynn's taxonomy, the efficiency of SIMD architectures, and the structure and execution of vector processors. Key topics include vector operations, memory access patterns, and optimization techniques for vector processing.

Uploaded by

Ibrahim Omar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 146

CESE 4085 Modern Computer Architecture

Lectures 7 & 8: Data-Level Parallelism

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 456
Computer Architecture
A Quantitative Approach, Sixth Edition

Chapter 4

Data-Level Parallelism in Vector, SIMD,


and GPU Architectures

CESE4085 - Modern Computer Architectures 457


Lecture Overview
Lecture 7:
• Vector Processors
• SIMD Extensions
• Multi-threading (Chapter 3)

Combining topics of 6th and


Lecture 8: (ambitious) earlier editions of CA book
• GPUs
• Loop-level parallelism (skipped, but is exam material)
• Thread-Level Parallelism – Multi-processors (Chapter 5)
• Memory Coherence (Chapter 5)
• Snooping vs. Directory Protocols

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 459
Flynn's Taxomony
•SISD (Single Instruction, Single Data)
•Uniprocessors
•SIMD (Single Instruction, Multiple Data)
•Exploits data-level parallelism
•Vector architectures also belong to this class
•Multimedia extensions (MMX, SSE, VIS, AltiVec, …)
•Examples: Illiac-IV, CM-2, MasPar MP-1/2
•MISD (Multiple Instruction, Single Data)
•???
•MIMD (Multiple Instruction, Multiple Data)
•Examples: Sun Enterprise 5000, Cray T3D/T3E, SGI
Origin
•exploits thread-level parallelism; flexible
•Most widely used

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 460
QCE Department, EEMCS Faculty
CESE4085 - Modern Computer Architectures 461
Introduction
Introduction
SIMD architectures can exploit significant data-level
parallelism for:
Matrix-oriented scientific computing
Media-oriented image and sound processors

SIMD is more energy efficient than MIMD


Only needs to fetch one instruction per data operation
Makes SIMD attractive for personal mobile devices

SIMD allows programmer to continue to think sequentially

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 463
Introduction
SIMD Parallelism
Vector architectures
SIMD extensions
Graphics Processor Units (GPUs)

For x86 processors:


Expect two additional cores per chip per year
SIMD width to double every four years
Potential speedup from SIMD to be twice that from MIMD!

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 464
Intermezzo – processing vectors
Question – what takes a lot time when processing
vectors/arrays in a scalar processor?

For example:
for(i=0; i<N, i++)
{Y(i)=a*X(i)+Y(i)}

Question: what do we need to do every iteration?

Answer:
-

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 467
Intermezzo – processing vectors
Question – what takes a lot time when processing
vectors/arrays in a scalar processor?

For example:

for(i=0; i<N, i++)


{Y(i)=a*X(i)+Y(i)}

Question: what do we need to do every iteration?

Answer: check i (compare), increment i, branch


(Also: load/store per loop iteration resulting in potential
(cache) misses, thus long memory access latencies)

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 468
Vector processors

SCALAR (1 operation) VECTOR (N operations)

r1r1
r1 r2
r2
r2
r1 r2 r1
r1 r2
r2
ALU
ALU
add r3, r1, r2 ALU
ALU
ALU add.vv v3, v1, v2 ALU
r3
r3
r3
r3 r3
r3
vector
length

•if N is the vector length, not necessarily N ALUs


•number of ALU’s = lanes

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 469
Vector processors
•Single vector instruction specifies lots of work
•equivalent to executing an entire loop
•no control hazards for the loop branches
•fewer instructions to fetch and decode (also: sm. program size)
•Computation of each result in the vector is independent of the
computation of other results in the same vector
•deep pipeline without data hazards; high clock rate
•HW has to check for data hazards only between vector instructions
(once per vector, not per vector element)
•Access memory with known pattern
•elements are all adjacent in memory => highly interleaved memory
banks provides high bandwidth
•access is initiated for entire vector => high memory latency is
amortised (no data caches are needed)

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 470
Vector processors
•Vector operations: arithmetic (add, sub, mul, div), memory
accesses, effective address calculations
•Multiple vector instructions can be in progress at the same time
=> more parallelism
•The application has to be as such …
•Regular loops
•Large scientific and engineering applications (car crash
simulations, whether forecasting, …)
•Multimedia applications
•.. to benefit from using a vector processor

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 471
Basic Vector Architectures
•Vector processor: ordinary pipelined scalar unit + vector
unit
•Types of vector processors
•Memory-memory processors: all vector operations are memory-to-
memory
•Vector-register processors: all vector operations
except load and store are among the vector registers

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 472
Vector-register processor

•Vector Registers: each vector register is a fixed length


bank holding a single vector
•has at least 2 read and 1 write ports (overlap operations)
•typically 8-32 vector registers, each holding 64-128 elements (words)
•VMIPS: 8 vector registers, each holding 64 elements (16 Rd ports, 8 Wr
ports)

•Vector Functional Units (FUs): multiple resources, fully


pipelined, start new operation every clock
•typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X), integer add,
logical, shift;
•may have multiple of same unit
•VMIPS: 5 FUs (FP add/sub, FP mul, FP div, FP integer, FP logical)

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 473
Vector-register processor

•Vector Load-Store Units (LSUs)


•fully pipelined unit to load or store a vector; may have
multiple LSUs
•VMIPS: 1 VLSU, bandwidth is 1 word per cycle after initial delay
•Scalar registers
•single element for FP scalar or address
•VMIPS: 32 GPR, 32 FPRs they are read out and latched at
one input of the FUs
•Cross-bar to connect FUs, LSUs, registers
•cross-bar to connect Rd/Wr ports and FUs

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 474
VMIPS: Basic Structure

•8 x 64-element vector
registers
•5 FUs – each fully
pipelined
•Load/store unit – fully
pipelined
•Scalar registers

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 475
VMIPS Vector Instructions

. . .
QCE Department, EEMCS Faculty
CESE4085 - Modern Computer Architectures 476
Vector Execution Time
•Execution time is a function of vector length, data dependencies, struct.
hazards)
•Initiation rate: rate at which a FU consumes vector elements
•number of lanes = the number of parallel pipelines used to execute
operations within each vector instruction; up to 8 (e.g. Cray X1)
•the time for a single vector instruction = init.rate x vect.len.
•Convoy
•set of vector instructions that can begin execution in same clock (no
struct. or data hazards); assumption: convoys do not overlap in time (no
forwarding).
•Chime: approx. time to execute a convoy
•e.g. m convoys take m chimes; if each vector length is n, then they take
approx. m x n clock cycles (ignores overhead; good approximation for
long vectors)

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 477
Vector Execution Time
•Example:
•4 convoys, 1 lane,
•vector length=64 elements; 1 element/cycle (1 lane
pipelined execution unit)
•a convoy has to finish before another starts
•4 x 64 256 clocks

1: LV V1,Rx ;load vector X


2: MULVS.D V2, V1,F0 ;vector-scalar mult.
LV V3,Ry ;load vector Y
3: ADDV.D V4,V2,V3 ;add
4: SV Ry,V4 ;store the result

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 478
Start-up time & execution time
•Instructions in a convoys may start at 1 cycle distance (pipelined
instruction issue) – small overhead
•important source of overhead: start-up time = pipeline latency
time (depth of FU pipeline, e.g. 12 for MULV, 6 for ADDV);
Time
1: LV V1,Rx
12 n
2: MULV V2,F0,V1
LV V3,Ry 12 n

3: ADDV V4,V2,V3
6 n
4: SV Ry,V4
12 n

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 479
Vector Load/Store Units & Memories

•Start-up overheads usually longer for LSUs


•Memory system must sustain (number of lanes x word) /clock
cycle
•Many Vector Procs. use banks (vs. simple interleaving):
•support multiple loads/stores per cycle
=> multiple banks & address banks independently
•support non-sequential accesses
•Note: Number of memory banks > memory latency to avoid
stalls
•m banks => m words per memory latency l clocks
•if m < l, then gap in memory pipeline:
•may have 1024 banks in SRAM

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 480
Variable Vector Length
•What to do when vector length is not exactly VL (64)?

for(i=0; i<N, i++)


{Y(i)=a*X(i)+Y(i)}

•N can be unknown at compile time


•Vector-Length Register (VLR): controls the length of any
vector operation, including a vector load or store (cannot
be > the length of vector registers)
•What if N > Max. Vector Length (MVL)? Þ Strip mining
•1st loop does short piece (N mod MVL), rest VL = MVL

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 481
Vector Stride
•Adjacent elements not sequential in memory (e.g. matrix
multiplication)
0 x[0][0]
for(i=0; i<100; i++) 1 x[0][1]
2 x[0][2]
for(j=0; j<100; j++) { x[0][3]

x(i,j)=0.0;
for(k=0; k<100; k++) x[0][100]
x(i,j)=y(i,j)+B(i,k)*C(k,j); 100 x[1][0]
x[1][1]
} x[1][2]
•Matrix C accesses are not adjacent (800 bytes between)
•Stride: distance separating elements that are to be merged into a
single vector Þ LVWS (load vector with stride) instruction
•Strides can cause bank conflicts (e.g., stride=32 and 16 banks)
•e.g. x[i, stride], x[i, 2 x stride], x[i, 3 x stride], … map in the same bank

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 482
Vector Opt.1: Chaining (fwd. for vector regs.)

•MULV.D V1,V2,V3
ADDV.D V4,V1,V5 ; separate convoy
•Chaining: if vector register (V1) is not a single entity but a
group of individual registers Þ pipeline forwarding of individual
elements of a vector
•Flexible chaining: allow vector to chain to any other active
vector operation Þ simultaneous access to same register
•more read/write ports
•organize the registers in individual banks (Þ simultaneous
access to different banks)
•As long as enough HW, increases convoy size

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 483
Vector Opt.2: Conditional Execution

do i = 1, 64
if (A(i) <> 0) then
A(i) = A(i) – B(i) Vectorizable?
endif

•Loop is vectorizable, if only no “if”s (no control dependency)


•Conditional execution: turns a control dependency in a data
dependency
•Vector-mask control:
•vector instructions operate only on vector elements whose
corresponding entries in the vector-mask register are 1
•vector-mask register loaded from vector test
•Some VP use vector mask only to disable the storing of the
result (the operation still executes)

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 484
Vector Opt.3: Sparse Matrices
•Sparse matrix: elements of a vector are usually stored in some
compacted form and then accessed indirectly, e.g.:
Note: vectors A and C potentially stored using different
do 100 i = 1, n compaction technique, therefore, K and M to unpack.
A(K(i))=A(K(i))+C(M(i))
•Mechanism to support sparse matrices: scatter-gather operations
•to support moving between a dense representation (0’s not
included) and normal matrix representation (0’s included)
•Gather (LVI) fetches the vector whose elements are by:
•address = a base address + offsets given in an index vector
•sort of register indirect addressing; get elements in dense form
•Elements are operated on in dense form,
•(if needed) the sparse vector can be stored in expanded form:
scatter store (SVI), using the same index vector

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 485
Summary
•fine-grained vs. coarse-grained multi-threading
•SMT – fine-grained based on OoO superscalar
•single-thread performance should not be penalized
•shares processor resources, except: PC, register renaming, ROB
•reduces the impact of branch miss-prediction
•Vector processing
•No control&data dependencies between elements of a vector
•Vector instructions access memory with known pattern
•Reduces branches and branch problems in pipelines
•A single vector instruction does lots of work ( loop)
•Components of a vector processor: vector registers, functional
units, load/store, crossbar....
•VP optimisation: chaining, conditional execution, sparse matrices

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 486
Vector Architectures
Vector Architectures
Basic idea:
Read sets of data elements into “vector registers”
Operate on those registers
Disperse the results back into memory

Registers are controlled by compiler


Used to hide memory latency
Leverage memory bandwidth

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 488
Vector Architectures
VMIPS
Example architecture: RV64V
Loosely based on Cray-1
32 62-bit vector registers
Register file has 16 read ports and 8 write ports
Vector functional units
Fully pipelined
Data and control hazards are detected
Vector load-store unit
Fully pipelined
One word per clock cycle after initial latency
Scalar registers
31 general-purpose registers
32 floating-point registers

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 489
Vector Architectures
VMIPS Instructions
.vv: two vector operands
.vs and .sv: vector and scalar operands
LV/SV: vector load and vector store from address
Example: DAXPY
vsetdcfg 4*FP64# Enable 4 DP FP vregs
fld f0,a # Load scalar a
vld v0,x5 # Load vector X
vmul v1,v0,f0 # Vector-scalar mult
vld v2,x6 # Load vector Y
vadd v3,v1,v2 # Vector-vector add
vst v3,x6 # Store the sum
vdisable # Disable vector regs
8 instructions, 258 for RV64V (scalar code)
QCE Department, EEMCS Faculty
CESE4085 - Modern Computer Architectures 490
Vector Architectures
Vector Execution Time
Execution time depends on three factors:
Length of operand vectors
Structural hazards
Data dependencies

RV64V functional units consume one element per clock cycle


Execution time is approximately the vector length

Convey
Set of vector instructions that could potentially execute
together

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 491
Vector Architectures
Chimes
Sequences with read-after-write dependency hazards placed
in same convey via chaining

Chaining
Allows a vector operation to start as soon as the individual
elements of its vector source operand become available

Chime
Unit of time to execute one convey
m conveys executes in m chimes for vector length n
For vector length of n, requires m x n clock cycles

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 492
Vector Architectures
Example
vld v0,x5 # Load vector X
vmul v1,v0,f0 # Vector-scalar multiply
vld v2,x6 # Load vector Y
vadd v3,v1,v2 # Vector-vector add
vst v3,x6 # Store the sum

Convoys:
1 vld vmul
2 vld vadd
3 vst

3 chimes, 2 FP ops per result, cycles per FLOP = 1.5


For 64 element vectors, requires 32 x 3 = 96 clock cycles

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 493
Vector Architectures
Challenges
Start up time
Latency of vector functional unit
Assume the same as Cray-1
Floating-point add => 6 clock cycles
Floating-point multiply => 7 clock cycles
Floating-point divide => 20 clock cycles
Vector load => 12 clock cycles

Improvements:
> 1 element per clock cycle
Non-64 wide vectors
IF statements in vector code
Memory system optimizations to support vector processors
Multiple dimensional matrices
Sparse matrices
Programming a vector computer

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 494
Vector Architectures
Multiple Lanes

Element n of vector register A is “hardwired” to element n


of vector register B
Allows for multiple hardware lanes

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 495
Vector Architectures
Vector Length Register
for (i=0; i <n; i=i+1) Y[i] = a * X[i] + Y[i];
vsetdcfg 2 DP FP # Enable 2 64b Fl.Pt. registers
fld f0,a # Load scalar a
loop: setvl t0,a0 # vl = t0 = min(mvl,n)
vld v0,x5 # Load vector X
slli t1,t0,3 # t1 = vl * 8 (in bytes)
add x5,x5,t1 # Increment pointer to X by vl*8
vmul v0,v0,f0 # Vector-scalar mult
vld v1,x6 # Load vector Y
vadd v1,v0,v1 # Vector-vector add
sub a0,a0,t0 # n -= vl (t0)
vst v1,x6 # Store the sum into Y
add x6,x6,t1 # Increment pointer to Y by vl*8
bnez a0,loop # Repeat if n != 0
vdisable # Disable vector regs}

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 496
Vector Architectures
Vector Mask Registers
Consider:
for (i = 0; i < 64; i=i+1)
if (X[i] != 0)
X[i] = X[i] – Y[i];
Use predicate register to “disable” elements:
vsetdcfg 2*FP64 # Enable 2 64b FP vector regs
vsetpcfgi 1 # Enable 1 predicate register
vld v0,x5 # Load vector X into v0
vld v1,x6 # Load vector Y into v1
fmv.d.x f0,x0 # Put (FP) zero into f0
vpne p0,v0,f0 # Set p0(i) to 1 if v0(i)!=f0
vsub v0,v0,v1 # Subtract under vector mask
vst v0,x5 # Store the result in X
vdisable # Disable vector registers
vpdisable # Disable predicate registers

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 497
Vector Architectures
Memory Banks
Memory system must be designed to support high
bandwidth for vector loads and stores
Spread accesses across multiple banks
Control bank addresses independently
Load or store non sequential words (need independent bank
addressing)
Support multiple vector processors sharing the same memory

Example:
32 processors, each generating 4 loads and 2 stores/cycle
Processor cycle time is 2.167 ns, SRAM cycle time is 15 ns
How many memory banks needed?
32x(4+2)x15/2.167 = ~1330 banks

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 498
Vector Architectures
Stride
Consider:
for (i = 0; i < 100; i=i+1)
for (j = 0; j < 100; j=j+1) {
A[i][j] = 0.0;
for (k = 0; k < 100; k=k+1)
A[i][j] = A[i][j] + B[i][k] * D[k][j];
}

Must vectorize multiplication of rows of B with columns of D


Use non-unit stride
Bank conflict (stall) occurs when the same bank is hit faster than bank
busy time:
#banks / LCM(stride,#banks) < bank busy time

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 499
Vector Architectures
Scatter-Gather
Consider:
for (i = 0; i < n; i=i+1)
A[K[i]] = A[K[i]] + C[M[i]];
Use index vector:
vsetdcfg 4*FP64 # 4 64b FP vector registers
vld v0, x7 # Load K[]
vldx v1, x5, v0 # Load A[K[]]
vld v2, x28 # Load M[]
vldi v3, x6, v2 # Load C[M[]]
vadd v1, v1, v3 # Add them
vstx v1, x5, v0 # Store A[K[]]
vdisable # Disable vector registers

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 500
Vector Architectures
Programming Vec. Architectures
Compilers can provide feedback to programmers
Programmers can provide hints to compiler

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 501
SIMD Instruction Set Extensions for Multimedia
SIMD Extensions
Media applications operate on data types narrower than the
native word size
Example: disconnect carry chains to “partition” adder

Limitations, compared to vector instructions:


Number of data operands encoded into op code
No sophisticated addressing modes (strided, scatter-gather)
No mask registers

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 503
SIMD Instruction Set Extensions for Multimedia
SIMD Implementations
Implementations:
Intel MMX (1996)
Eight 8-bit integer ops or four 16-bit integer ops
Streaming SIMD Extensions (SSE) (1999)
Eight 16-bit integer ops
Four 32-bit integer/fp ops or two 64-bit integer/fp ops
Advanced Vector Extensions (2010)
Four 64-bit integer/fp ops
AVX-512 (2017)
Eight 64-bit integer/fp ops
Operands must be consecutive and aligned memory locations

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 504
SIMD Instruction Set Extensions for Multimedia
Example SIMD Code
Example DXPY:
fld f0,a # Load scalar a
splat.4D f0,f0 # Make 4 copies of a
addi x28,x5,#256 # Last address to load
Loop: fld.4D f1,0(x5) # Load X[i] ... X[i+3]
fmul.4D f1,f1,f0 # a x X[i] ... a x X[i+3]
fld.4D f2,0(x6) # Load Y[i] ... Y[i+3]
fadd.4D f2,f2,f1 # a x X[i]+Y[i]...
# a x X[i+3]+Y[i+3]
fsd.4D f2,0(x6) # Store Y[i]... Y[i+3]
addi x5,x5,#32 # Increment index to X
addi x6,x6,#32 # Increment index to Y
bne x28,x5,Loop # Check if done

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 505
SIMD Instruction Set Extensions for Multimedia
Roofline Performance Model
Basic idea:
Plot peak floating-point throughput as a function of arithmetic
intensity
Ties together floating-point performance and memory
performance for a target machine
Arithmetic intensity
Floating-point operations per byte read

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 507
SIMD Instruction Set Extensions for Multimedia
Examples
Attainable GFLOPs/sec = (Peak Memory BW × Arithmetic
Intensity, Peak Floating Point Perf.)

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 508
Computer Architecture
A Quantitative Approach, Sixth Edition

Chapter 5

Thread-Level Parallelism

CESE4085 - Modern Computer Architectures 509


Introduction
Introduction
Thread-Level parallelism
Have multiple program counters
Uses MIMD model
Targeted for tightly-coupled shared-memory multiprocessors

For n processors, need n threads

Amount of computation assigned to each thread = grain size


Threads can be used for data-level parallelism, but the
overheads may outweigh the benefit

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 510
Start – multi-threading slides
There are multithreading slides – check for usefulness!

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 511
Start – Multithreading slides

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 512
Trends in microarchitecture

•Higher clock speeds


•To achieve high clock frequency make pipeline deeper (super-
pipelining)
•Events that disrupt pipeline (branch mispredictions, cache misses, TLB
misses, etc) become very expensive in terms of lost clock cycles
•ILP
•Extract parallelism in a single program
•Superscalar processors - multiple execution units working in parallel
•Challenge to find enough instructions that can be executed
concurrently (limit: instruction dependencies)
•Out-of-order execution => instructions are sent to execution units
based on instruction dependencies rather than program order

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 515
ILP limitations: most applications stall 80% or
more of time during “execution”
8-way superscalar

<=#1

<=#2

18 18% CPU usefully busy

[Tullsen ISCA95]

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 516
Resources waste
Functional units

Vertical waste:
• due to stalls in the
execution flow
Cycles

Horizontal waste:
• due to low level ILP (not able to “use” all FUs)

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 517
Multi-threading idea
Issue for

branch

branch

branch
single thread
execution

branch branch
thread 1

Issue for

branch branch
thread 2
execution
thread 3

• Instead of enlarging the depth of the instruction window


considered for issue (more speculation, lowering confidence),
• Enlarge its “width” by fetching from multiple threads
• Reduces the impact of Branch Prediction!

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 518
OoO superscalar – recap
IF

in order
ID

Dispatch
buffer
Reg file
EX ALU MEM1 FP1 BR

out of order
MEM2 FP2

FP3

Reorder
buffer

in order
WB

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 519
Exploiting thread-level parallelism

•Multi-processors
•Each with a full set of architectural resources

•Multiple threads – concurrently on the same processor Why?


•Inexpensive – one CPU, no external interconnects
•No remote or coherence misses (more capacity misses)
•Threads can share resources Þ we can increase threads without
a corresponding linear increase in area

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 520
Multi-threading
When to switch (to which) thread?
•Cycle-by-cycle interleaving: fine-grained multi-threading
•Processor switches between software threads after a predefined
time slice (a cycle)
•Round-robin among threads, skip when thread is stalled.
•Blocking interleaving: course-grained multi-threading
•processor switches to another thread when:
•a long latency operation stalls the current thread (L2 miss),
•max number of cycles/thread exceeded
•scheduling difficulties encountered
•Still, some execution slots are wasted
•Fundamental: single-thread performance is not penalized

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 521
Multi-threading
Functional units Peak throughput = 4 IPC.
Thread 1
Thread 2
Thread 3
Cycles Thread 4
Idle
Superscalar Coarse-grained MT Fine-Grained MT Simultaneous MT

• Superscalar processor – high under-utilization


• Fine-grained multithreading – can only issue instructions from a single
thread in a cycle
• Coarse-grained multithreading – same as fine-grain multithreading, but
switches threads only at L2 miss
• Simultaneous multithreading can issue instructions from any thread every
cycle – has the highest probability of finding work for every issue slot

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 522
What resources are shared?

•Multiple threads are simultaneously active


•For correctness, each thread needs its own
•PC
•logical registers (and its own mapping from logical to
physical registers),
•For performance, each thread could have its own:
•Reorder buffer (ROB) – so that a stall in one thread does not
stall commit in other threads,
•branch predictor,
•I-cache, D-cache, TLB,
•… replicate resources for low interference, although more
sharing Þ better resource utilization

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 524
Resource sharing&replication
IF PC
PC
PC
PC

in order
ID

Dispatch
buffer

EX ALU MEM1 FP1 BR

out of order
Reg file
MEM2 FP2

FP3

Reorder PC
PC
PC
buffer RQ

in order
WB

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 525
SMT design – based on OoO superscalar
• Instruction issue
• When to switch (to which) thread, such that the execution
unit are fully used, and single-thread performance is not
penalized?
• Prioritized scheme
• thread 1, is preferred; when thread 1 stalls, thread 2 is preferred

• Round-Robin
• all threads compete for resources (fair)
• Execution
• no changes in the execution path
• Retiring
• same functionality per thread queues

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 526
SMT design

• Instruction fetch – which thread to fetch from?


• While fetching instructions, thread priority can dramatically
influence total throughput
• Static solutions: Round-robin; fetch such that each thread has
an equal share of processor resources
• Each cycle 8 instructions from 1 thread
• Each cycle 4 instructions from 2 threads, 2 from 4,…
• Dynamic solutions: check execution queues and favor some
threads
• with minimal # of in-flight branches
• with minimal # of outstanding misses
• with minimal # of in-flight instructions

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 527
Bottlenecks in SMT
•8 threads SMT yields throughput improvements of 2x-4x
•Fetch and memory throughput remain bottlenecks
•Try not to affect clock cycle time, especially in
•Instruction issue - more candidate instructions need to be
considered
•Instruction completion - choosing which instructions to commit
may be challenging
•Larger register file needed to hold multiple contexts
•register file access is likely to limit the number of threads
•Ensuring that cache and TLB conflicts generated by SMT do
not degrade performance
•increased cache associativity to reduce conflicts

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 528
SMT: performance vs. cost per unit

[Silc 1999]

•superscalar is cheap only when the number of slots is small


•SMT improves a lot on superscalar already with 2 threads
•with more than 4 threads, the SMT cost-performance
decreases

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 529
SMT – what to do with it?

•We have HW threading support.


•Who builds the threads?:
•programmer, compiler,
•software marks all potential parallelism, hardware dynamically
modulates application parallelism (thread create/schedule/switch
in hardware) [Chen2005]
•just-in-time compilation with multithreading code generation?
•Who schedules the threads?
•at coarse grain the OS
•fine-grain HW

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 530
Programmer control on HT
•pthreads library
•create a thread that the OS will map on a (physical or logical)
processor.
•pthread_setschedparam() sets thread scheduling parameters
(policy: SCHED_OTHER, SCHED_RR, or SCHED_FIFO, …).
•sched_setaffinity() (defines a set of CPUs on which a process
threads is eligible to run)
•Windows – similar function calls, e.g
Get/SetProcessAffinityMask() ,Get/SetThreadAffinityMask(),
Get/SetThreadIdealProcessor(), GetLogicalProcessorInformation()

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 532
Parallelism - revisited

•Data parallel
•bit-level parallel: wider processor data-paths (8®16®32®
64…)
•word-level parallel: vector processors (SIMD)

•“Functional” parallel
•ILP
•pipelining, (OoO) superscalar, VLIW, EPIC
•TLP
•processes: multi-processors (centralized/distributed)
•threads (lighter processes, same data space): hardware multi-
threading (fine/coarse/SMT)

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 534
End – Multithreading slides

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 535
CESE 4085 Modern Computer Architecture

Lectures 7 & 8: Data-Level Parallelism

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 537
Lecture Overview
Lecture 7:
• Vector Processors
• SIMD Extensions
• Multi-threading (Chapter 3)

Combining topics of 6th and


Lecture 8: (ambitious) earlier editions of CA book
• GPUs
• Loop-level parallelism (skipped, but is exam material)
• Thread-Level Parallelism – Multi-processors (Chapter 5)
• Memory Coherence (Chapter 5)
• Snooping vs. Directory Protocols

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 538
Some personal thoughts
These are personal observations about GPUs – before we delve
more deeply into technical details:
• GPUs focused on graphics and still focus on graphics
• Floating-point data and processing (—> SIMD)
• (introduction of) CUDA allowed for (more) general-purpose
application of GPUs – when certain conditions apply, e.g.,
embarrassingly parallel data processing
• “grouping” of data for processing — independent data at
different levels —> need to learn new terminology and forcing
programmers to break up their data in processable sizes
• Abstraction layer across different GPUs requiring recompilation
to map operations to hardware (instructions)
• High bandwidth
• Pay attention to comparison with vector processors and SIMD

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 539
Graphical Processing Units
Graphical Processing Units
Basic idea:
Heterogeneous execution model
CPU is the host, GPU is the device
Develop a C-like programming language for GPU
Unify all forms of GPU parallelism as CUDA thread
Programming model is “Single Instruction Multiple Thread”

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 540
Graphical Processing Units
Threads and Blocks
A thread is associated with each data element
Threads are organized into blocks
Blocks are organized into a grid

GPU hardware handles thread management, not applications


or OS

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 541
Graphical Processing Units
NVIDIA GPU Architecture
Similarities to vector machines:
Works well with data-level parallel problems
Scatter-gather transfers
Mask registers
Large register files

Differences:
No scalar processor
Uses multithreading to hide memory latency
Has many functional units, as opposed to a few deeply
pipelined units like a vector processor

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 542
Graphical Processing Units
Example (Figure 4.13 in book)
Code that works over all elements is the grid
e.g., 8192 elements
Thread blocks break this down into manageable sizes
(up to) 512 threads per block
Thus grid size = 16 blocks (= 8192 / 512)
SIMD instruction executes 32 elements at a time
A thread block is analogous to a strip-mined vector loop with
vector length of 32 (in this example: 16 (=512/32))
Block is assigned to a multithreaded SIMD processor by the
thread block scheduler
Current-generation GPUs have 7-15 multithreaded SIMD
processors

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 543
Graphical Processing Units
Terminology (specific to Pascal GPU)
Each thread is limited to 256 (vector) registers (page 319)
32 thread * 256 vector registers * 32 registers (per vec-reg) is too
much and potentially unused, thus: Dynamic Hardware Register
allocation (read page 320)
Groups of 32 threads combined into a SIMD thread or “warp”
Mapped to 16 physical lanes
Up to 32 warps are scheduled on a single SIMD processor
Each warp has its own PC
Thread scheduler uses scoreboard to dispatch warps
By definition, no data dependencies between warps
Dispatch warps into pipeline, hide memory latency
Thread block scheduler schedules blocks to SIMD processors
Within each SIMD processor:
32 SIMD lanes
Wide and shallow compared to vector processors

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 544
Graphical Processing Units
Example (Figure 4.13 in book)

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 545
Graphical Processing Units
GPU Organization

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 546
Graphical Processing Units
NVIDIA Instruction Set Arch.
ISA is an abstraction of the hardware instruction set
“Parallel Thread Execution (PTX)”
opcode.type d,a,b,c;
Uses virtual registers —> compiler must allocate these to hw. regs.
Translation to machine code is performed in software
Example:
shl.s32 R8, blockIdx, 9 ; Thread Block ID * Block size (512 or 29)
add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread ID
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
mul.f64 R0D, RD0, RD4 ; Product in RD0 = RD0 * RD4 (scalar a)
add.f64 R0D, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])
st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 547
Graphical Processing Units
Conditional Branching
Like vector architectures, GPU branch hardware uses
internal masks
Also uses
Branch synchronization stack
Entries consist of masks for each SIMD lane
I.e. which threads commit their results (all threads execute)
Instruction markers to manage when a branch diverges into multiple
execution paths
Push on divergent branch
…and when paths converge
Act as barriers
Pops stack
Per-thread-lane 1-bit predicate register, specified by
programmer

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 548
Graphical Processing Units
Example Skip

if (X[i] != 0)
X[i] = X[i] – Y[i];
else X[i] = Z[i];

ld.global.f64 RD0, [X+R8] ; RD0 = X[i]


setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1
@!P1, bra ELSE1, *Push ; Push old mask, set new mask bits
; if P1 false, go to ELSE1
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
sub.f64 RD0, RD0, RD2 ; Difference in RD0
st.global.f64 [X+R8], RD0 ; X[i] = RD0
@P1, bra ENDIF1, *Comp ; complement mask bits
; if P1 true, go to ENDIF1
ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i]
st.global.f64 [X+R8], RD0 ; X[i] = RD0
ENDIF1: <next instruction>, *Pop ; pop to restore old mask

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 549
Graphical Processing Units
NVIDIA GPU Memory Structures
Each SIMD Lane has private section of off-chip DRAM
“Private memory”
Contains stack frame, spilling registers, and private variables
Each multithreaded SIMD processor also has local memory
Shared by SIMD lanes / threads within a block
Memory shared by SIMD processors is GPU Memory
Host can read and write GPU memory

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 550
Graphical Processing Units
Pascal Architecture Innovations
Each SIMD processor has
Two or four SIMD thread schedulers, two instruction dispatch units
16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 load-store
units, 4 special function units
Two threads of SIMD instructions are scheduled every two clock
cycles
Fast single-, double-, and half-precision
High Bandwith Memory 2 (HBM2) at 732 GB/s
NVLink between multiple GPUs (20 GB/s in each direction)
Unified virtual memory and paging support

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 551
Graphical Processing Units
Pascal Multithreaded SIMD Proc.

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 552
Graphical Processing Units
Vector Architectures vs GPUs
SIMD processor analogous to vector processor, both have
MIMD
Registers
RV64V register file holds entire vectors
GPU distributes vectors across the registers of SIMD lanes
RV64 has 32 vector registers of 32 elements (1024)
GPU has 256 registers with 32 elements each (8K)
RV64 has 2 to 8 lanes with vector length of 32, chime is 4 to 16
cycles
SIMD processor chime is 2 to 4 cycles
GPU vectorized loop is grid
All GPU loads are gather instructions and all GPU stores are scatter
instructions

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 553
Graphical Processing Units
SIMD Architectures vs GPUs
GPUs have more SIMD lanes
GPUs have hardware support for more threads
Both have 2:1 ratio between double- and single-precision
performance
Both have 64-bit addresses, but GPUs have smaller memory
SIMD architectures have no scatter-gather support

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 554
Detecting and Enhancing Loop-Level Parallelism
Loop-Level Parallelism Skip

Focuses on determining whether data accesses in later


iterations are dependent on data values produced in earlier
iterations
Loop-carried dependence

Example 1:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;

No loop-carried dependence

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 555
Detecting and Enhancing Loop-Level Parallelism
Loop-Level Parallelism Skip

Example 2:
for (i=0; i<100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}

S1 and S2 use values computed by S1 in previous iteration


S2 uses value computed by S1 in same iteration

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 556
Detecting and Enhancing Loop-Level Parallelism
Loop-Level Parallelism Skip

Example 3:
for (i=0; i<100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
S1 uses value computed by S2 in previous iteration but dependence is not
circular so loop is parallel
Transform to:
A[0] = A[0] + B[0];
for (i=0; i<99; i=i+1) {
B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];
}
B[100] = C[99] + D[99];

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 557
Detecting and Enhancing Loop-Level Parallelism
Loop-Level Parallelism Skip

Example 4:
for (i=0;i<100;i=i+1) {
A[i] = B[i] + C[i];
D[i] = A[i] * E[i];
}

Example 5:
for (i=1;i<100;i=i+1) {
Y[i] = Y[i-1] + Y[i];
}

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 558
Detecting and Enhancing Loop-Level Parallelism
Finding dependencies Skip

Assume indices are affine:


a x i + b (i is loop index)

Assume:
Store to a x i + b, then
Load from c x i + d
i runs from m to n
Dependence exists if:
Given j, k such that m ≤ j ≤ n, m ≤ k ≤ n
Store to a x j + b, load from a x k + d, and a x j + b = c x k + d

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 559
Detecting and Enhancing Loop-Level Parallelism
Finding dependencies Skip

Generally cannot determine at compile time


Test for absence of a dependence:
GCD test:
If a dependency exists, GCD(c,a) must evenly divide (d-b)

Example:
for (i=0; i<100; i=i+1) {
X[2*i+3] = X[2*i] * 5.0;
}

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 560
Detecting and Enhancing Loop-Level Parallelism
Finding dependencies Skip

Example 2:
for (i=0; i<100; i=i+1) {
Y[i] = X[i] / c; /* S1 */
X[i] = X[i] + c; /* S2 */
Z[i] = Y[i] + c; /* S3 */
Y[i] = c - Y[i]; /* S4 */
}

Watch for antidependencies and output dependencies

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 561
Detecting and Enhancing Loop-Level Parallelism
Finding dependencies Skip

Example 2:
for (i=0; i<100; i=i+1) {
Y[i] = X[i] / c; /* S1 */
X[i] = X[i] + c; /* S2 */
Z[i] = Y[i] + c; /* S3 */
Y[i] = c - Y[i]; /* S4 */
}

Watch for antidependencies and output dependencies

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 562
Detecting and Enhancing Loop-Level Parallelism
Reductions Skip

Reduction Operation:
for (i=9999; i>=0; i=i-1)
sum = sum + x[i] * y[i];

Transform to…
for (i=9999; i>=0; i=i-1)
sum [i] = x[i] * y[i];
for (i=9999; i>=0; i=i-1)
finalsum = finalsum + sum[i];

Do on p processors:
for (i=999; i>=0; i=i-1)
finalsum[p] = finalsum[p] + sum[i+1000*p];
Note: assumes associativity!

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 563
Fallacies and Pitfalls
Fallacies and Pitfalls Skip

GPUs suffer from being coprocessors


GPUs have flexibility to change ISA
Concentrating on peak performance in vector architectures
and ignoring start-up overhead
Overheads require long vector lengths to achieve speedup
Increasing vector performance without comparable
increases in scalar performance
You can get good vector performance without providing
memory bandwidth
On GPUs, just add more threads if you don’t have enough
memory performance

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 564
END of NEW CH4

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 565
NEW CH5

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 567
Computer Architecture
A Quantitative Approach, Sixth Edition

Chapter 5

Thread-Level Parallelism

CESE4085 - Modern Computer Architectures 568


Introduction
Types
Symmetric multiprocessors (SMP)
Small number of cores
Share single memory with uniform
memory access/latency (UMA)

Distributed shared memory (DSM)


Memory distributed among
processors
Non-uniform memory access/latency
(NUMA)
Processors connected via direct
(switched) and non-direct (multi-
hop) interconnection networks

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 623
Shared vs. Distributed Memory
•Centralized shared-memory architectures
•Main memory is shared between the processors
•At most few dozen processors
•Also called
•Symmetric (shared-memory) Multiprocessors (SMPs)
•single main memory that has symmetric relationship to all
processors
•Uniform Memory Access (UMA) architectures
•memory has uniform access time from any processor (in absence
of contention)
•Common belief: easier to program than distributed memory
architectures

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 624
Shared vs. Distributed Memory
•Centralized shared-memory architectures
•Main memory is shared between the processors
•At most few dozen processors
•Also called
•Symmetric (shared-memory) Multiprocessors (SMPs)
•single main memory that has symmetric relationship to all
processors
•Uniform Memory Access (UMA) architectures
•memory has uniform access time from any processor (in absence
of contention)
•Common belief: easier to program than distributed memory
architectures

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 625
Centralized shared-memory MP

processor processor processor processor

cache cache cache cache

main can be one bus,


memory I/O system multiple busses,
switch, or some other
interconnection network

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 626
Distributed-Memory MPs
•Distributed-memory architectures
•Memory is physically distributed among the processors
•Typically have more processors than SMPs
•More difficult to program than SMPs
•Require some kind of interconnect
•direct (switches)
•indirect (2- or higher dimensional meshes, hypercubes, fat
trees, etc.)
•Also called Non-uniform Memory Access (NUMA) architectures

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 627
Distributed-Memory MPs

processor processor processor processor


+ cache + cache + cache + cache

LM I/O LM I/O LM I/O LM I/O

interconnection
network

LM I/O LM I/O LM I/O LM I/O

processor processor processor processor


+ cache + cache + cache + cache

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 628
SMPs – memory architecture
•Private vs. shared data
•private data: used by a single processor
•when cached, in only one cache => no problem
•shared data: used by multiple processors, allows processors
to communicate
•when cached, can be in multiple caches at the same time

•To ameliorate the problem of long memory access latency,


cache both private and shared data
•also reduce the interconnect bandwidth

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 629
Centralized Shared-Memory Architectures
Cache Coherence
Processors may see different values through their caches:

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 630
Coherence and Consistency
•Informally:
•"Any read must return the most recent write“
•Too difficult (and expensive) to enforce
•Too vague and simplistic; it tells 2 things:
•what values can be returned by a read (coherence)
•when a written value will be returned by a read (consistency).
•coherence defines behavior to same location, consistency
defines behavior to other locations (order among accesses)

P1 P2
/*Assume initial value of A and flag is 0*/
A = 1; while (flag == 0); /*spin idly*/
flag = 1; print A;

•the intuition might not necessarily be true


•a problem even without caches

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 631
Centralized Shared-Memory Architectures
Cache Coherence
Coherence
All reads by any processor must return the most recently
written value
Writes to the same location by any two processors are seen in
the same order by all processors

Consistency
When a written value will be returned by a read
If a processor writes location A followed by location B, any
processor that sees the new value of B must also see the new
value of A

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 632
Centralized Shared-Memory Architectures
Enforcing Coherence
Coherent caches provide:
Migration: movement of data
Replication: multiple copies of data

Cache coherence protocols


Directory based
Sharing status of each block kept in one location
Snooping
Each core tracks sharing status of each block

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 633
Cache Coherence Protocols
•Snooping
•Send address of all request for data to all processors
•Processors snoop to see if they have copy (cache tags!) of requested data
and respond accordingly
•Requires broadcast, since caching information is at processors
•Works well with bus (natural broadcast medium)
•Dominates for small scale machines (most of the market)
•The data “sharing” status is kept in each cache
•Directory-based
•Keep track of what is being shared in one centralized place
•Distributed memory => distributed directory for scalability (avoids
bottlenecks, hot spots)
•Scales better than snooping
•Actually existed before snooping

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 634
Centralized Shared-Memory Architectures
Snoopy Coherence Protocols
Write invalidate (most used approach, achieve exclusivity)
On write, invalidate all other copies
Use bus itself to serialize
Write cannot complete until bus access is obtained

Write update
On write, update all copies – lots of bandwidth needed

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 635
Basics of Write Invalidate
•Use the bus to perform invalidates
•To perform an invalidate, acquire bus access and broadcast
the address to be invalidated
•all processors (actually their cache controllers) snoop the bus,
listening to addresses
•if the address is in my cache, invalidate my copy
•Serialization of bus access enforces write serialization
•On a read miss (may also be generated by an invalidation),
where is the most recent value?
•Easy for write-through caches (it is in the memory)
•For write-back caches, again use snooping

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 636
Centralized Shared-Memory Architectures
Snoopy Coherence Protocols
Locating an item when a read miss occurs
In write-back cache, the updated value must be sent to the
requesting processor

Cache lines marked as shared or exclusive/modified


Only writes to shared lines need an invalidate broadcast
After this, the line is marked as exclusive

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 637
Centralized Shared-Memory Architectures
Snoopy Coherence Protocols Self-study

Leads to
FSMs on next
slide = same
information

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 638
Centralized Shared-Memory Architectures
Snoopy Coherence Protocols

Also called “Modified”


state (see previous slide)

Thus: MSI protocol

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 639
Centralized Shared-Memory Architectures
Snoopy Coherence Protocols
Complications for the basic MSI protocol:
Operations are not atomic
E.g. detect miss, acquire bus, receive a response
Creates possibility of deadlock and races
One solution: processor that sends invalidate can hold bus until
other processors receive the invalidate

Extensions:
Add exclusive state to indicate clean block in only one cache
(MESI protocol)
Prevents needing to write invalidate on a write
Owned state

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 640
Centralized Shared-Memory Architectures
Coherence Protocols: Extensions
Shared memory bus and
snooping bandwidth is
bottleneck for scaling
symmetric multiprocessors
Duplicating tags
Place directory in outermost
cache
Use crossbars or point-to-
point networks with banked
memory

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 641
Centralized Shared-Memory Architectures
Coherence Protocols
Every multicore with >8 processors uses an interconnect
other than bus
Makes it difficult to serialize events
Write and upgrade misses are not atomic
How can the processor know when all invalidates are
complete?
How can we resolve races when two processors write at the
same time?
Solution: associate each block with a single bus

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 642
Limitations in SMP and snooping protocols
•Number of processors increase Þ bus traffic increases
•Processor speed increases Þ bus traffic increases
•bus must support both coherence traffic & normal memory traffic
•Þ Multiple buses or interconnection networks (cross bar or small point-to-point)
•however snooping still requires broadcast; caches have to respond to
requests from all other caches Þ limits the scalability of SMT with snooping

•AMD Opteron example:


•A memory connected directly to each dual-core chip
•Point-to-point connections for up to 4 chips
•coherence: b-cast to find shared copies, but ACKs to order operations
•Remote memory and local memory latency are similar, allowing OS Opteron
as UMA computer, though the memory is distributed

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 643
True Vs. False Sharing * Write invalidate
* 1 valid bit per cache block

•True sharing: the word(s) being read is (are) the same as


the word(s) being written
•False sharing: the word being read is different from the
word being written, but they are in same cache block
Need to (write) invalidate to gain exclusivity
X1 X2 =in the same $ block, in shared state in $ of P1 & P2
(MSI) for the write operation (page 394)
Time P1 P2 Comment
1 Write X1 True sharing miss (assume X1 was read by P2)- invalidation
required
2 Read X2 False sharing miss, since X2 is invalidated by the write of X1
by P1
3 Write X1 False sharing miss, since X2 is shared again after P2 read it
4 Write X2 False sharing miss since writing to X2 while invalid for the X1
write
5 Read X2 True sharing miss since it involves a read of X2 which was
invalidated

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 644
Performance of Symmetric Shared-Memory Multiprocessors

•Cache performance is combination of


•Uniprocessor cache miss traffic
•Traffic caused by communication
•Results in invalidations and subsequent cache misses
•4th C: coherence miss (more of those in tightly coupled
applications that share large amounts of data)
•Joins Compulsory, Capacity, Conflict
•Programmers must prevent false sharing
•if we distribute data between processors, make sure it's aligned at
cache block boundaries

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 645
Performance of Symmetric Shared-Memory Multiprocessors
Performance
Coherence influences cache miss rate
Coherence misses
True sharing misses
Write to shared block (transmission of invalidation)
Read an invalidated block
False sharing misses
Read an unmodified word in an invalidated block

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 646
Performance of Symmetric Shared-Memory Multiprocessors
Performance Study: Commercial Workload SKIP

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 647
Performance of Symmetric Shared-Memory Multiprocessors
Performance Study: Commercial Workload SKIP

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 648
Performance of Symmetric Shared-Memory Multiprocessors
Performance Study: Commercial Workload SKIP

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 649
Performance of Symmetric Shared-Memory Multiprocessors
Performance Study: Commercial Workload SKIP

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 650
Qualitative Performance DifferencesSelf-study
•Performance differences between write invalidate and write
update:
•Multiple writes to same word require
•multiple write broadcasts in write update protocol
•one invalidation in write invalidate
•When a cache block consists of multiple words, each word written
to a cache block requires
•multiple write broadcasts in write update protocol
•one invalidation in write invalidate
•write invalidate works on cache blocks, write update on
words/bytes
•Delay between writing a word in one processor and reading the
new value in another is less in write update
•And the winner is: write invalidate because bus bandwidth is
most precious

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 651
Distributed Shared Memory and Directory-Based Coherence
Directory Protocols Self-study

Snooping schemes require communication among all caches


on every cache miss
Limits scalability
Another approach: Use centralized directory to keep track of
every block
Which caches have each block
Dirty status of each block
Implement in shared L3 cache
Keep bit vector of size = # cores for each block in L3
Not scalable beyond shared L3

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 652
Distributed Shared Memory and Directory-Based Coherence
Directory Protocols Self-study

Alternative approach:
Distribute memory

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 653
Distributed Shared Memory and Directory-Based Coherence
Directory Protocols Self-study

For each block, maintain state:


Shared
One or more nodes have the block cached, value in memory is
up-to-date
Set of node IDs
Uncached
Modified
Exactly one node has a copy of the cache block, value in
memory is out-of-date
Owner node ID

Directory maintains block states and sends invalidation


messages

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 654
Distributed Shared Memory and Directory-Based Coherence
Messages Self-study

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 655
Distributed Shared Memory and Directory-Based Coherence
Directory Protocols Self-study

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 656
Distributed Shared Memory and Directory-Based Coherence
Directory Protocols Self-study

For uncached block:


Read miss
Requesting node is sent the requested data and is made the
only sharing node, block is now shared
Write miss
The requesting node is sent the requested data and becomes
the sharing node, block is now exclusive
For shared block:
Read miss
The requesting node is sent the requested data from memory,
node is added to sharing set
Write miss
The requesting node is sent the value, all nodes in the sharing
set are sent invalidate messages, sharing set only contains
requesting node, block is now exclusive

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 657
Distributed Shared Memory and Directory-Based Coherence
Directory Protocols Self-study

For exclusive block:


Read miss
The owner is sent a data fetch message, block becomes shared,
owner sends data to the directory, data written back to
memory, sharers set contains old owner and requestor
Data write back
Block becomes uncached, sharer set is empty
Write miss
Message is sent to old owner to invalidate and send the value to
the directory, requestor becomes new owner, block remains
exclusive

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 658
Synchronization
Synchronization Skip

Basic building blocks:


Atomic exchange
Swaps register with memory location
Test-and-set
Sets under condition
Fetch-and-increment
Reads original value from memory and increments it in memory
Requires memory read and write in uninterruptable instruction

RISC-V: load reserved/store conditional


If the contents of the memory location specified by the load linked are
changed before the store conditional to the same address, the store
conditional fails

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 659
Synchronization
Implementing Locks Skip

Atomic exchange (EXCH):


try: mov x3,x4 ;mov exchange value
lr x2,x1 ;load reserved from
sc x3,0(x1) ;store conditional
bnez x3,try ;branch store fails
mov x4,x2 ;put load value in x4?

Atomic increment:
try: lr x2,x1 ;load reserved 0(x1)
addi x3,x2,1;increment
sc x3,0(x1) ;store conditional
bnez x3,try ;branch store fails

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 660
Synchronization
Implementing Locks Skip

Lock (no cache coherence)


addi x2,R0,#1
lockit: EXCH x2,0(x1) ;atomic exchange
bnez x2,locket ;already locked?

Lock (cache coherence):


lockit: ld x2,0(x1) ;load of lock
bnez x2,locket ;not available-spin
addi x2,R0,#1 ;load locked value
EXCH x2,0(x1) ;swap
bnez x2,locket ;branch if lock wasn’t 0

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 661
Synchronization
Implementing Locks Skip

Advantage of this scheme: reduces memory traffic

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 662
Models of Memory Consistency: An Introduction
Models of Memory Consistency Skip

Processor 1: Processor 2:
A=0 B=0
… …
A=1 B=1
if (B==0) … if (A==0) …

n Should be impossible for both if-statements to be


evaluated as true
n Delayed write invalidate?

n Sequential consistency:
n Result of execution should be the same as long as:
n Accesses on each processor were kept in order
n Accesses on different processors were arbitrarily interleaved

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 663
Models of Memory Consistency: An Introduction
Implementing Locks Skip

To implement, delay completion of all memory accesses


until all invalidations caused by the access are completed
Reduces performance!

Alternatives:
Program-enforced synchronization to force write on
processor to occur before read on the other processor
Requires synchronization object for A and another for B
“Unlock” after write
“Lock” after read

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 664
Models of Memory Consistency: An Introduction
Relaxed Consistency Models Skip

Rules:
X→Y
Operation X must complete before operation Y is done
Sequential consistency requires:
R → W, R → R, W → R, W → W

Relax W → R
“Total store ordering”

Relax W → W
“Partial store order”

Relax R → W and R → R
“Weak ordering” and “release consistency”

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 665
Models of Memory Consistency: An Introduction
Relaxed Consistency Models Skip

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 666
Models of Memory Consistency: An Introduction
Relaxed Consistency Models Skip

Consistency model is multiprocessor specific

Programmers will often implement explicit synchronization

Speculation gives much of the performance advantage of


relaxed models with sequential consistency
Basic idea: if an invalidation arrives for a result that has not
been committed, use speculation recovery

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 667
Fallacies and Pitfalls
Fallacies and Pitfalls SKIP

Fallacy = common misbelief


Pitfall = easily made mistake

P: Measuring performance of multiprocessors by linear


speedup versus execution time
F: Amdahl’s Law doesn’t apply to parallel computers
F: Linear speedups are needed to make multiprocessors
cost-effective
Doesn’t consider cost of other system components
P: Not developing the software to take advantage of, or
optimize for, a multiprocessor architecture

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 668
END of NEW CH5

QCE Department, EEMCS Faculty


CESE4085 - Modern Computer Architectures 669

You might also like