04 Progbasics
04 Progbasics
Parallel Computing
Stanford CS149, Fall 2021
REVIEW
ispc_sinx() returns.
Each *ISPC program instance* executes the code Completion of ISPC program instances
in the function ispc_sinx serially. Resume sequential execution
(parallelism exists because there are multiple
program instances, not because of parallelism in
the code that defines an ispc function) Sequential execution (C code)
...
...
0x1F 0
L1 cache
(32 KB)
Core 1
L2 cache
(256 KB)
.. L3 cache DRAM
. (32 GB)
L1 cache
(20 MB)
(32 KB)
Core 8
L2 cache
(256 KB)
Thread 1: Thread 2:
int x = 0; void foo(int* x) {
spawn_thread(foo, &x);
// read from addr storing
// write to address holding // contents of variable x
// contents of variable x while (x == 0) {}
x = 1; print x;
}
Store to x
Thread 1
x
(Pseudocode provided in a fake C-like language for brevity.) Stanford CS149, Fall 2021
A common metaphor:
A shared address space is
like a bulletin board
Image credit:
https://thetab.com/us/stanford/2016/07/28/honest-packing-list-freshman-stanford-1278
Stanford CS149, Fall 2021
Coordinating access to shared variables with
synchronization
Thread 1: Thread 2:
int x = 0; void foo(int* x, Lock* my_lock) {
Lock my_lock; my_lock->lock();
x++;
spawn_thread(foo, &x, &my_lock); my_lock->unlock();
print(x);
mylock.lock(); }
x++;
mylock.unlock();
Examples of interconnects
Memory Memory
Interconnect
Crossbar
Memory Core Core Core Core
I/O Core
Core
Core
Core
Multi-stage network
* Caches (not shown) are another implementation of a shared address space (more on this in a later lecture)
Stanford CS149, Fall 2021
Shared address space hardware architecture
Any processor can directly reference any memory location
Memory
Memory Controller
Core 1 Core 2
Integrated
GPU
Core 3 Core 4
Graphics
Stanford CS149, Fall 2021
SUN Niagara 2 (UltraSPARC T2): crossbar interconnect
Note area of crossbar (CCX):
about same area as one core on chip
Core
Core
L2 cache Memory
Core
Crossbar
Switch
Core
L2 cache Memory
Core
Core
▪ 72 cores, arranged as 6x6 mesh of tiles (2 cores/tile) Tile Tile Tile Tile Tile Tile
* In practice, you’ll find NUMA behavior on a single-socket system as well (recall: different cache slices are a different distance from each core)
Stanford CS149, Fall 2021
Summary: shared address space model
▪ Communication abstraction
- Threads read/write variables in shared address space
- Threads manipulate synchronization primitives: locks, atomic ops, etc.
- Logical extension of uniprocessor programming *
* But NUMA implementations require reasoning about locality for performance optimization Stanford CS149, Fall 2021
Message passing model of
communication
x send(X, 2, my_msg_id)
Illustration adopted from Culler, Singh, Gupta Stanford CS149, Fall 2021
A common metaphor: snail mail
Cluster of workstations
(Infiniband network)
* We’ll have multiple lectures in the course about data-parallel programming and data-parallel thinking: this is just a taste
Stanford CS149, Fall 2021
Key data type of data-parallel code: sequences
▪ A sequence is an ordered collection of elements
▪ For example, in a C++ like language: Sequence<T>
▪ Scala lists: List[T]
▪ In a functional language (like Haskell): seq T
▪ In C++:
template<class InputIt, class OutputIt, class UnaryOperation>
OutputIt transform(InputIt first1, InputIt last1,
OutputIt d_first,
UnaryOperation unary_op);
f
f
Stanford CS149, Fall 2021
Parallelizing map
▪ Since f :: a -> b is a function (side-effect free), then applying f to all elements of
the sequence can be done in any order without changing the output of the program
absolute_value(N, x, y);
Given this program, it is reasonable to think of the program as
using foreach to “map the loop body onto each element” of the
arrays X and Y.
// ISPC code:
export void absolute_value(
uniform int N,
uniform float* x, But if we want to be more precise: a sequence is not a first-class
uniform float* y) ISPC concept. It is implicitly defined by how the program has
{
foreach (i = 0 ... N) implemented array indexing logic in the foreach loop.
{
if (x[i] < 0)
y[i] = -x[i];
else
y[i] = x[i];
(There is no operation in ISPC with the semantic: “map this code
} over all elements of this sequence”)
}
// ISPC code:
export void absolute_repeat(
This is also a valid ISPC program!
uniform int N,
uniform float* x,
{
uniform float* y)
It takes the absolute value of elements of x, then repeats it
foreach (i = 0 ... N) twice in the output array y
{
if (x[i] < 0)
y[2*i] = -x[i];
else (Less obvious how to think of this code as mapping the loop
y[2*i] = x[i];
y[2*i+1] = y[2*i]; body onto existing sequences.)
}
}
Stanford CS149, Fall 2021
Data parallelism in ISPC
// main C++ code:
const int N = 1024; Think of loop body as a function
float* x = new float[N];
float* y = new float[N]; The input/output sequences being mapped
// initialize N elements of x
over are implicitly defined by array indexing
logic
shift_negative(N, x, y);
return sum;
}
sum is of type uniform float (one copy of variable for all program instances)
x[i] is not a uniform expression (different value for each program instance)
Result: compile-time type error
Stanford CS149, Fall 2021
ISPC discussion: sum “reduction”
Each instance accumulates a private partial sum export uniform float sumall2(
uniform int N,
(no communication) uniform float* x)
{
uniform float sum;
Partial sums are added together using the reduce_add() cross-instance float partial = 0.0f;
communication primitive. The result is the same total sum for all program foreach (i = 0 ... N)
{
instances (reduce_add() returns a uniform float) partial += x[i];
}
The ISPC code at right will execute in a manner similar to handwritten C + AVX // from ISPC math library
intrinsics implementation below. * sum = reduce_add(partial);
return sum;
}
▪ Other data parallel operators express more complex patterns on sequences: gather, scatter,
reduce, scan, shift, etc.
- This will be a topic of a later lecture
▪ You will think in terms of data-parallel primitives often in this class, but many modern
performance-oriented data-parallel languages do not enforce this structure in the language
- Many languages (like ISPC, CUDA, etc.) choose flexibility/familiarity of imperative C-style syntax over the safety of a more
functional form
Stanford CS149, Fall 2021
Summary
▪ Programming models provide a way to think about the organization of parallel programs.
▪ I want you to always be thinking about abstraction vs. implementation for the remainder of
this course.
Time (1 processor)
Speedup( P processors ) =
Time (P processors)
Assignment
Parallel Threads ** ** I had to pick a term
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping
N
Parallelism
N
N2 N2
1
Execution time
Stanford CS149, Fall 2021
First attempt at parallelism (P processors)
▪ Strategy:
- Step 1: execute in parallel P
- time for phase 1: N2/P Sequential program
Parallelism
- Step 2: execute serially
- time for phase 2: N2
N2 N2
1
Speedup
N2/P
P
Parallel program
Parallelism
Speedup ≤ 2
N2
1
Execution time
Stanford CS149, Fall 2021
Parallelizing step 2
▪ Strategy:
- Step 1: execute in parallel
- time for phase 1: N2/P
- Step 2: compute partial sums in parallel, combine results serially
- time for phase 2: N2/P + P
▪ Overall performance:
- Speedup
Overhead of parallel algorithm:
N2/P N2/P P combining the partial sums
P
Execution time
Stanford CS149, Fall 2021
Amdahl’s law
▪ Let S = the fraction of total work that is inherently sequential
▪ Max speedup on P processors given by:
speedup
S=0.05
S=0.1
Num Processors
Stanford CS149, Fall 2021
A small serial region can limit speedup on a large parallel machine
Summit supercomputer: 27,648 GPUs x (5,376 ALUs/GPU) = 148,635,648 ALUs
Machine can perform 148 million single precision operations in parallel
What is max speedup if 0.1% of application is serial?
Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)
Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping
Execution on
** I had to pick a term
parallel machine
t1.join();
}
List of tasks:
task 0 task 1 task 2 task 3 task 4 ... task 99
Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)
Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping
Execution on
** I had to pick a term
parallel machine
Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)
Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping
Execution on
** I had to pick a term
parallel machine
▪ Question: If you were implementing the OS, how would to map the two
threads to the four execution contexts?
Exec 1
SIMD Exec 2
Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Grid solver algorithm: find the dependencies
C-like pseudocode for sequential algorithm is provided below
const int n;
float* A; // assume allocated for grid of N+2 x N+2 elements
void solve(float* A) {
Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Step 1: identify dependencies
(problem decomposition phase)
N
Each row element depends on element to left.
N
...
...
N
Update all red cells in parallel
void solve(float* A) {
Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Shared address space (with SPMD
threads) expression of solver
P1 P2 P3 P4
while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS); Each thread computes the rows it is responsible for updating
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + A[i+1,j], A[i,j+1]);
myDiff += abs(A[i,j] - prev));
}
lock(myLock);
diff += myDiff;
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}
Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Shared address space solver (pseudocode in SPMD execution model)
int n; // grid size
bool done = false;
float diff = 0.0;
LOCK myLock;
BARRIER myBarrier;
void solve(float* A) {
float myDiff;
int threadId = getThreadId();
problem with this implementation?
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)
while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS);
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + A[i+1,j], A[i,j+1]);
myDiff += abs(A[i,j] - prev));
}
lock(myLock);
diff += myDiff;
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}
Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Shared address space solver (pseudocode in SPMD execution model)
int n; // grid size
bool done = false;
// allocate grid
sum locally, then complete global reduction at the
float* A = allocate(n+2, n+2); end of the iteration.
void solve(float* A) {
float myDiff;
int threadId = getThreadId();
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)
while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS);
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + A[i+1,j], A[i,j+1]);
}
myDiff += abs(A[i,j] - prev));
Compute partial sum per worker
lock(myLock);
diff += myDiff; Now only only lock once per thread,
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
not once per (i,j) loop iteration!
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}
Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Barrier synchronization primitive
▪ barrier(num_threads)
▪ Barriers are a conservative way to express dependencies Compute red cells
P1 P2 P3 P4
// allocate grid
float* A = allocate(n+2, n+2);
void solve(float* A) {
float myDiff;
int threadId = getThreadId();
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)
while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS);
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + A[i+1,j], A[i,j+1]);
myDiff += abs(A[i,j] - prev));
}
lock(myLock);
diff += myDiff;
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}
Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Shared address space solver: one barrier
int
bool
n;
done = false;
// grid size
Idea:
LOCK myLock;
BARRIER myBarrier;
float diff[3]; // global diff, but now 3 copies Remove dependencies by using different diff
float *A = allocate(n+2, n+2);
variables in successive loop iterations
void solve(float* A) {
float myDiff; // thread local variable
int index = 0; // thread local variable
Trade off footprint for removing dependencies!
diff[0] = 0.0f;
(a common parallel programming technique)
barrier(myBarrier, NUM_PROCESSORS); // one-time only: just for init
while (!done) {
myDiff = 0.0f;
//
// perform computation (accumulate locally into myDiff)
//
lock(myLock);
diff[index] += myDiff; // atomically update global diff
unlock(myLock);
diff[(index+1) % 3] = 0.0f;
barrier(myBarrier, NUM_PROCESSORS);
if (diff[index]/(n*n) < TOLERANCE)
break;
index = (index + 1) % 3;
}
}
Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Grid solver implementation in two programming models
▪ Data-parallel programming model
- Synchronization:
- Single logical thread of control, but iterations of forall loop may be parallelized by the system (implicit
barrier at end of forall loop body)
- Communication
- Implicit in loads and stores (like shared address space)
- Special built-in primitives for more complex communication patterns:
e.g., reduce