MCA - HW - Lecture 7and8 - Prelim
MCA - HW - Lecture 7and8 - Prelim
Chapter 4
For example:
for(i=0; i<N, i++)
{Y(i)=a*X(i)+Y(i)}
Answer:
-
For example:
r1r1
r1 r2
r2
r2
r1 r2 r1
r1 r2
r2
ALU
ALU
add r3, r1, r2 ALU
ALU
ALU add.vv v3, v1, v2 ALU
r3
r3
r3
r3 r3
r3
vector
length
•8 x 64-element vector
registers
•5 FUs – each fully
pipelined
•Load/store unit – fully
pipelined
•Scalar registers
. . .
QCE Department, EEMCS Faculty
CESE4085 - Modern Computer Architectures 476
Vector Execution Time
•Execution time is a function of vector length, data dependencies, struct.
hazards)
•Initiation rate: rate at which a FU consumes vector elements
•number of lanes = the number of parallel pipelines used to execute
operations within each vector instruction; up to 8 (e.g. Cray X1)
•the time for a single vector instruction = init.rate x vect.len.
•Convoy
•set of vector instructions that can begin execution in same clock (no
struct. or data hazards); assumption: convoys do not overlap in time (no
forwarding).
•Chime: approx. time to execute a convoy
•e.g. m convoys take m chimes; if each vector length is n, then they take
approx. m x n clock cycles (ignores overhead; good approximation for
long vectors)
3: ADDV V4,V2,V3
6 n
4: SV Ry,V4
12 n
x(i,j)=0.0;
for(k=0; k<100; k++) x[0][100]
x(i,j)=y(i,j)+B(i,k)*C(k,j); 100 x[1][0]
x[1][1]
} x[1][2]
•Matrix C accesses are not adjacent (800 bytes between)
•Stride: distance separating elements that are to be merged into a
single vector Þ LVWS (load vector with stride) instruction
•Strides can cause bank conflicts (e.g., stride=32 and 16 banks)
•e.g. x[i, stride], x[i, 2 x stride], x[i, 3 x stride], … map in the same bank
•MULV.D V1,V2,V3
ADDV.D V4,V1,V5 ; separate convoy
•Chaining: if vector register (V1) is not a single entity but a
group of individual registers Þ pipeline forwarding of individual
elements of a vector
•Flexible chaining: allow vector to chain to any other active
vector operation Þ simultaneous access to same register
•more read/write ports
•organize the registers in individual banks (Þ simultaneous
access to different banks)
•As long as enough HW, increases convoy size
do i = 1, 64
if (A(i) <> 0) then
A(i) = A(i) – B(i) Vectorizable?
endif
Convey
Set of vector instructions that could potentially execute
together
Chaining
Allows a vector operation to start as soon as the individual
elements of its vector source operand become available
Chime
Unit of time to execute one convey
m conveys executes in m chimes for vector length n
For vector length of n, requires m x n clock cycles
Convoys:
1 vld vmul
2 vld vadd
3 vst
Improvements:
> 1 element per clock cycle
Non-64 wide vectors
IF statements in vector code
Memory system optimizations to support vector processors
Multiple dimensional matrices
Sparse matrices
Programming a vector computer
Example:
32 processors, each generating 4 loads and 2 stores/cycle
Processor cycle time is 2.167 ns, SRAM cycle time is 15 ns
How many memory banks needed?
32x(4+2)x15/2.167 = ~1330 banks
Chapter 5
Thread-Level Parallelism
<=#1
<=#2
[Tullsen ISCA95]
Vertical waste:
• due to stalls in the
execution flow
Cycles
Horizontal waste:
• due to low level ILP (not able to “use” all FUs)
branch
branch
branch
single thread
execution
branch branch
thread 1
Issue for
branch branch
thread 2
execution
thread 3
in order
ID
Dispatch
buffer
Reg file
EX ALU MEM1 FP1 BR
out of order
MEM2 FP2
FP3
Reorder
buffer
in order
WB
•Multi-processors
•Each with a full set of architectural resources
in order
ID
Dispatch
buffer
out of order
Reg file
MEM2 FP2
FP3
Reorder PC
PC
PC
buffer RQ
in order
WB
[Silc 1999]
•Data parallel
•bit-level parallel: wider processor data-paths (8®16®32®
64…)
•word-level parallel: vector processors (SIMD)
•“Functional” parallel
•ILP
•pipelining, (OoO) superscalar, VLIW, EPIC
•TLP
•processes: multi-processors (centralized/distributed)
•threads (lighter processes, same data space): hardware multi-
threading (fine/coarse/SMT)
Differences:
No scalar processor
Uses multithreading to hide memory latency
Has many functional units, as opposed to a few deeply
pipelined units like a vector processor
if (X[i] != 0)
X[i] = X[i] – Y[i];
else X[i] = Z[i];
Example 1:
for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;
No loop-carried dependence
Example 2:
for (i=0; i<100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}
Example 3:
for (i=0; i<100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
S1 uses value computed by S2 in previous iteration but dependence is not
circular so loop is parallel
Transform to:
A[0] = A[0] + B[0];
for (i=0; i<99; i=i+1) {
B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];
}
B[100] = C[99] + D[99];
Example 4:
for (i=0;i<100;i=i+1) {
A[i] = B[i] + C[i];
D[i] = A[i] * E[i];
}
Example 5:
for (i=1;i<100;i=i+1) {
Y[i] = Y[i-1] + Y[i];
}
Assume:
Store to a x i + b, then
Load from c x i + d
i runs from m to n
Dependence exists if:
Given j, k such that m ≤ j ≤ n, m ≤ k ≤ n
Store to a x j + b, load from a x k + d, and a x j + b = c x k + d
Example:
for (i=0; i<100; i=i+1) {
X[2*i+3] = X[2*i] * 5.0;
}
Example 2:
for (i=0; i<100; i=i+1) {
Y[i] = X[i] / c; /* S1 */
X[i] = X[i] + c; /* S2 */
Z[i] = Y[i] + c; /* S3 */
Y[i] = c - Y[i]; /* S4 */
}
Example 2:
for (i=0; i<100; i=i+1) {
Y[i] = X[i] / c; /* S1 */
X[i] = X[i] + c; /* S2 */
Z[i] = Y[i] + c; /* S3 */
Y[i] = c - Y[i]; /* S4 */
}
Reduction Operation:
for (i=9999; i>=0; i=i-1)
sum = sum + x[i] * y[i];
Transform to…
for (i=9999; i>=0; i=i-1)
sum [i] = x[i] * y[i];
for (i=9999; i>=0; i=i-1)
finalsum = finalsum + sum[i];
Do on p processors:
for (i=999; i>=0; i=i-1)
finalsum[p] = finalsum[p] + sum[i+1000*p];
Note: assumes associativity!
Chapter 5
Thread-Level Parallelism
interconnection
network
P1 P2
/*Assume initial value of A and flag is 0*/
A = 1; while (flag == 0); /*spin idly*/
flag = 1; print A;
Consistency
When a written value will be returned by a read
If a processor writes location A followed by location B, any
processor that sees the new value of B must also see the new
value of A
Write update
On write, update all copies – lots of bandwidth needed
Leads to
FSMs on next
slide = same
information
Extensions:
Add exclusive state to indicate clean block in only one cache
(MESI protocol)
Prevents needing to write invalidate on a write
Owned state
Alternative approach:
Distribute memory
Atomic increment:
try: lr x2,x1 ;load reserved 0(x1)
addi x3,x2,1;increment
sc x3,0(x1) ;store conditional
bnez x3,try ;branch store fails
Processor 1: Processor 2:
A=0 B=0
… …
A=1 B=1
if (B==0) … if (A==0) …
n Sequential consistency:
n Result of execution should be the same as long as:
n Accesses on each processor were kept in order
n Accesses on different processors were arbitrarily interleaved
Alternatives:
Program-enforced synchronization to force write on
processor to occur before read on the other processor
Requires synchronization object for A and another for B
“Unlock” after write
“Lock” after read
Rules:
X→Y
Operation X must complete before operation Y is done
Sequential consistency requires:
R → W, R → R, W → R, W → W
Relax W → R
“Total store ordering”
Relax W → W
“Partial store order”
Relax R → W and R → R
“Weak ordering” and “release consistency”