0% found this document useful (0 votes)
26 views83 pages

04 Progbasics

This lecture covers the basics of parallel programming, focusing on different programming models such as shared address space, message passing, and data parallelism. It discusses the implementation of these models, including how threads communicate and the importance of synchronization mechanisms. The lecture also highlights the complexities of hardware architecture that support these parallel programming abstractions.

Uploaded by

jolanhj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views83 pages

04 Progbasics

This lecture covers the basics of parallel programming, focusing on different programming models such as shared address space, message passing, and data parallelism. It discusses the implementation of these models, including how threads communicate and the importance of synchronization mechanisms. The lecture also highlights the complexities of hardware architecture that support these parallel programming abstractions.

Uploaded by

jolanhj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Lecture 4:

Parallel Programming Basics

Parallel Computing
Stanford CS149, Fall 2021
REVIEW

Stanford CS149, Fall 2021


Quiz: reviewing ISPC abstractions
export void ispc_sinx(
uniform int N,
This is an ISPC function.
uniform int terms,
uniform float* x,
uniform float* result)
It contains two nested for loops
{
// assume N % programCount = 0
for (uniform int i=0; i<N; i+=programCount)
Consider one ISPC program instance.
{ Which iterations of the two loops are executed in parallel
int idx = i + programIndex;
float value = x[idx]; by the ISPC program instance?
float numer = x[idx] * x[idx] * x[idx];
uniform int denom = 6; // 3!
uniform int sign = -1; Hint: this is a trick question
for (uniform int j=1; j<=terms; j++)
{ Answer: none
value += sign * numer / denom
numer *= x[idx] * x[idx];
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}
result[idx] = value;
}
}

Stanford CS149, Fall 2021


Program instances (that run in parallel) are created when the
ispc_sinx() ispc function is called
#include “sinx_ispc.h”

int N = 1024; main()


int terms = 5;
float* x = new float[N]; Sequential execution (C code)
float* result = new float[N];

// initialize x here ispc_sinx() Call to ispc_sinx()


0 1 2 3 4 5 6 7 Begin executing programCount
// execute ISPC code instances of ispc_sinx()
ispc_sinx(N, terms, x, result);
(ISPC code)

ispc_sinx() returns.
Each *ISPC program instance* executes the code Completion of ISPC program instances
in the function ispc_sinx serially. Resume sequential execution
(parallelism exists because there are multiple
program instances, not because of parallelism in
the code that defines an ispc function) Sequential execution (C code)

Stanford CS149, Fall 2021


WHAT WE DIDN’T GET TO LAST TIME
Three ways of thinking about parallel computation

(Recall: abstraction vs. implementation)

Stanford CS149, Fall 2021


Three programming models (abstractions)
1. Shared address space
2. Message passing
3. Data parallel

Stanford CS149, Fall 2021


Shared address space model

Stanford CS149, Fall 2021


Review: a program’s memory address space Address
0x0
Value
16
0x1 255
▪ A computer’s memory is organized as a array of bytes 0x2 14
0x3 0
0x4 0
▪ Each byte is identified by its “address” in memory 0x5
0x6
0
6
(its position in this array) 0x7 0
(in this class we assume memory is byte-addressable) 0x8 32
0x9 48
0xA 255
“The byte stored at address 0x8 has the value 32.” 0xB 255
0xC 255
“The byte stored at address 0x10 (16) has the value 128.” 0xD 0
0xE 0
0xF 0
In the illustration on the right, the program’s
0x10 128
memory address space is 32 bytes in size
(so valid addresses range from 0x0 to 0x1F)

...
...
0x1F 0

Stanford CS149, Fall 2021


The implementation of the linear memory address space abstraction
on a modern computer is complex
The instruction “load the value stored at address X into register R0” might involve a
complex sequence of operations by multiple data caches and access to DRAM

L1 cache
(32 KB)

Core 1
L2 cache
(256 KB)

.. L3 cache DRAM
. (32 GB)
L1 cache
(20 MB)
(32 KB)

Core 8
L2 cache
(256 KB)

Stanford CS149, Fall 2021


Shared address space model (abstraction)
Threads communicate by reading/writing to locations in a shared address space (shared variables)

Thread 1: Thread 2:
int x = 0; void foo(int* x) {
spawn_thread(foo, &x);
// read from addr storing
// write to address holding // contents of variable x
// contents of variable x while (x == 0) {}
x = 1; print x;
}

Store to x
Thread 1
x

Shared address space


Thread 2 Load from x

(Communication operations shown in red)

(Pseudocode provided in a fake C-like language for brevity.) Stanford CS149, Fall 2021
A common metaphor:
A shared address space is
like a bulletin board

(Everyone can read/write)

Image credit:
https://thetab.com/us/stanford/2016/07/28/honest-packing-list-freshman-stanford-1278
Stanford CS149, Fall 2021
Coordinating access to shared variables with
synchronization
Thread 1: Thread 2:
int x = 0; void foo(int* x, Lock* my_lock) {
Lock my_lock; my_lock->lock();
x++;
spawn_thread(foo, &x, &my_lock); my_lock->unlock();

print(x);
mylock.lock(); }
x++;
mylock.unlock();

Stanford CS149, Fall 2021


Review: why do we need mutual exclusion?
▪ Each thread executes:
-Load the value of variable x from a location in memory into register r1
(this stores a copy of the value in memory in the register)
- Add the contents of register r2 to register r1
- Store the value of register r1 into the address storing the program variable x
▪ One possible interleaving: (let starting value of x=0, r2=1)
T1 T2
r1 ← x T1 reads value 0
r1 ← x T2 reads value 0
r1 ← r1 + r2 T1 sets value of its r1 to 1
r1 ← r1 + r2 T2 sets value of its r1 to 1
X ← r1 T1 stores 1 to address of x
X ← r1 T2 stores 1 to address of x

▪ Need this set of three instructions must be “atomic”


Stanford CS149, Fall 2021
Examples of mechanisms for preserving atomicity
▪ Lock/unlock mutex around a critical section
mylock.lock();
// critical section
mylock.unlock();

▪ Some languages have first-class support for atomicity of code blocks


atomic {
// critical section
}

▪ Intrinsics for hardware-supported atomic read-modify-write operations


atomicAdd(x, 10);

Stanford CS149, Fall 2021


Review: shared address space model
▪ Threads communicate by:
- Reading/writing to shared variables in a shared address space
- Inter-thread communication is implicit in memory loads/stores
- Manipulating synchronization primitives
- e.g., ensuring mutual exclusion via use of locks

▪ This is a natural extension of sequential programming


- In fact, all our discussions in class have assumed a shared address space so far!

Stanford CS149, Fall 2021


Hardware implementation of a shared address space
Key idea: any processor can directly reference contents of any memory location

Examples of interconnects

Core Core Core Core


Core Core Core Core
Local Cache Local Cache Local Cache Local Cache Shared Bus

Memory Memory
Interconnect

Crossbar
Memory Core Core Core Core
I/O Core

Core

Core

Core

Memory Memory Memory Memory Memory Memory

Multi-stage network
* Caches (not shown) are another implementation of a shared address space (more on this in a later lecture)
Stanford CS149, Fall 2021
Shared address space hardware architecture
Any processor can directly reference any memory location

Memory

Memory Controller

Core 1 Core 2
Integrated
GPU
Core 3 Core 4

Intel Core i7 (quad core)


Example: Intel Core i7 processor (Kaby Lake) (interconnect is a ring)

Stanford CS149, Fall 2021


Intel’s ring interconnect
Introduced in Sandy Bridge microarchitecture
System Agent ▪ Four rings: for different types of messages
- request
- snoop
L3 cache slice - ack
(2 MB)
Core - data (32 bytes)

▪ Six interconnect nodes: four “slices” of L3 cache + system agent


L3 cache slice Core + graphics
(2 MB)

▪ Each bank of L3 connected to ring bus twice


L3 cache slice Core
(2 MB)
▪ Theoretical peak BW from cores to L3 at 3.4 GHz ~ 435 GB/sec
- When each core is accessing its local slice
L3 cache slice
Core
(2 MB)

Graphics
Stanford CS149, Fall 2021
SUN Niagara 2 (UltraSPARC T2): crossbar interconnect
Note area of crossbar (CCX):
about same area as one core on chip

Core

Core L2 cache Memory

Core

L2 cache Memory
Core
Crossbar
Switch
Core
L2 cache Memory

Core

Core L2 cache Memory

Core

Eight core processor


Stanford CS149, Fall 2021
KNL Mesh Interconnect
Intel Xeon Phi (Knights Landing)
OPIO
MCDRAM MCDRAM
OPIO PCIe MCDRAM
OPIO MCDRAM
OPIO

EDC EDC IIO EDC EDC

Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

iMC Tile Tile Tile Tile iMC


DDR DDR

Tile Tile Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

▪ 72 cores, arranged as 6x6 mesh of tiles (2 cores/tile) Tile Tile Tile Tile Tile Tile

▪ YX routing of messages: EDC EDC Misc EDC EDC

- Message travels in Y direction


-
OPIO
MCDRAM OPIO
MCDRAM OPIO
MCDRAM OPIO
MCDRAM
“Turn”
- Message traves in X direction

Stanford CS149, Fall 2021


Non-uniform memory access (NUMA)
The latency of accessing a memory location may be different from different processing cores in the system
Bandwidth from any one location may also be different to different CPU cores *

Example: modern multi-socket configuration


X Memory Memory

Memory Controller Memory Controller On chip


network
Core 1 Core 2 Core 5 Core 6

Core 3 Core 4 Core 7 Core 8

* In practice, you’ll find NUMA behavior on a single-socket system as well (recall: different cache slices are a different distance from each core)
Stanford CS149, Fall 2021
Summary: shared address space model
▪ Communication abstraction
- Threads read/write variables in shared address space
- Threads manipulate synchronization primitives: locks, atomic ops, etc.
- Logical extension of uniprocessor programming *

▪ Requires hardware support to implement efficiently


- Any processor can load and store from any address
- Can be costly to scale to large numbers of processors
(one of the reasons why high-core count processors are expensive)

* But NUMA implementations require reasoning about locality for performance optimization Stanford CS149, Fall 2021
Message passing model of
communication

Stanford CS149, Fall 2021


Message passing model (abstraction)
▪ Threads operate within their own private address spaces
▪ Threads communicate by sending/receiving messages
- send: specifies recipient, buffer to be transmitted, and optional message identifier (“tag”)
- receive: sender, specifies buffer to store data, and optional message identifier
- Sending messages is the only way to exchange data between threads 1 and 2
- Why?
Thread 2 address space
Thread 1 address space

x send(X, 2, my_msg_id)

Variable X semantics: send contexts of local variable X as


message to thread 2 and tag message with the recv(Y, 1, my_msg_id)
id “my_msg_id” semantics: receive message with id “my_msg_id”
from thread 1 and store contents in local variable Y
Y
Variable Y

(Communication operations shown in red)

Illustration adopted from Culler, Singh, Gupta Stanford CS149, Fall 2021
A common metaphor: snail mail

Stanford CS149, Fall 2021


Message passing (implementation)
▪ Hardware need not implement system-wide loads and stores to execute message passing programs (it need
only communicate messages between nodes)
- Can connect commodity systems together to form a large parallel machine
(message passing is a programming model for clusters and supercomputers)

Cluster of workstations
(Infiniband network)

Stanford CS149, Fall 2021


The data-parallel model

Stanford CS149, Fall 2021


Programming models provide a way to think about the organization
of parallel programs (by imposing structure)

▪ Shared address space: very little structure to communication


- All threads can read and write to all shared variables

▪ Message passing: communication is structured in the form of messages


- All communication occurs in the form of messages
- Communication is explicit in source code—the sends and receives)

▪ Data parallel structure: more rigid structure to computation


- Perform same function on elements of large collections

Stanford CS149, Fall 2021


Data-parallel model *
▪ Organize computation as operations on sequences of elements
- e.g., perform same function on all elements of a sequence

▪ A well-known modern example: NumPy: C = A + B


(A, B, and C are vectors of same length)

Something you’ve seen early in the lecture…

* We’ll have multiple lectures in the course about data-parallel programming and data-parallel thinking: this is just a taste
Stanford CS149, Fall 2021
Key data type of data-parallel code: sequences
▪ A sequence is an ordered collection of elements
▪ For example, in a C++ like language: Sequence<T>
▪ Scala lists: List[T]
▪ In a functional language (like Haskell): seq T

▪ Program can only access elements of sequence through sequence operators:


- map, reduce, scan, shift, etc.

Stanford CS149, Fall 2021


Map
▪ Higher order function (function that takes a function as an argument) that operates on sequences
▪ Applies side-effect-free unary function f :: a -> b to all elements of input sequence, to produce
output sequence of the same length
▪ In a functional language (e.g., Haskell)
- map :: (a -> b) -> seq a -> seq b

▪ In C++:
template<class InputIt, class OutputIt, class UnaryOperation>
OutputIt transform(InputIt first1, InputIt last1,
OutputIt d_first,
UnaryOperation unary_op);
f

f
Stanford CS149, Fall 2021
Parallelizing map
▪ Since f :: a -> b is a function (side-effect free), then applying f to all elements of
the sequence can be done in any order without changing the output of the program

▪ The implementation of map has flexibility to reorder/parallelize processing of elements


of sequence however it sees fit

Stanford CS149, Fall 2021


Data parallelism in ISPC
// main C++ code:
const int N = 1024;
float* x = new float[N]; foreach construct
float* y = new float[N];
Think of loop body as a function
// initialize N elements of x here

absolute_value(N, x, y);
Given this program, it is reasonable to think of the program as
using foreach to “map the loop body onto each element” of the
arrays X and Y.

// ISPC code:
export void absolute_value(
uniform int N,
uniform float* x, But if we want to be more precise: a sequence is not a first-class
uniform float* y) ISPC concept. It is implicitly defined by how the program has
{
foreach (i = 0 ... N) implemented array indexing logic in the foreach loop.
{
if (x[i] < 0)
y[i] = -x[i];
else
y[i] = x[i];
(There is no operation in ISPC with the semantic: “map this code
} over all elements of this sequence”)
}

Stanford CS149, Fall 2021


Data parallelism in ISPC
// main C++ code:
const int N = 1024;
float* x = new float[N/2]; Think of loop body as a function
float* y = new float[N];
The input/output sequences being mapped over are
// initialize N/2 elements of x here
implicitly defined by array indexing logic
absolute_repeat(N/2, x, y);

// ISPC code:
export void absolute_repeat(
This is also a valid ISPC program!
uniform int N,
uniform float* x,

{
uniform float* y)
It takes the absolute value of elements of x, then repeats it
foreach (i = 0 ... N) twice in the output array y
{
if (x[i] < 0)
y[2*i] = -x[i];
else (Less obvious how to think of this code as mapping the loop
y[2*i] = x[i];
y[2*i+1] = y[2*i]; body onto existing sequences.)
}
}
Stanford CS149, Fall 2021
Data parallelism in ISPC
// main C++ code:
const int N = 1024; Think of loop body as a function
float* x = new float[N];
float* y = new float[N]; The input/output sequences being mapped
// initialize N elements of x
over are implicitly defined by array indexing
logic
shift_negative(N, x, y);

// ISPC code: The output of this program is undefined!


export void shift_negative(
uniform int N,
uniform float* x, Possible for multiple iterations of the loop body to write to same
uniform float* y) memory location
{
foreach (i = 0 ... N)
{ Data-parallel model (foreach) provides no specification of order in
if (i >= 1 && x[i] < 0)
y[i-1] = x[i];
which iterations occur
else
y[i] = x[i];
}
}

Stanford CS149, Fall 2021


ISPC discussion: sum “reduction”
Compute the sum of all array elements in parallel
export uniform float sumall1(uniform int N, uniform float* x) export uniform float sumall2(uniform int N, uniform float* x)
{ {
uniform float sum = 0.0f; uniform float sum;
foreach (i = 0 ... N) float partial = 0.0f;
{ foreach (i = 0 ... N)
sum += x[i]; {
} partial += x[i];
}
return sum;
} // from ISPC math library
sum = reduce_add(partial);

return sum;
}

Correct ISPC solution

sum is of type uniform float (one copy of variable for all program instances)
x[i] is not a uniform expression (different value for each program instance)
Result: compile-time type error
Stanford CS149, Fall 2021
ISPC discussion: sum “reduction”
Each instance accumulates a private partial sum export uniform float sumall2(
uniform int N,
(no communication) uniform float* x)
{
uniform float sum;
Partial sums are added together using the reduce_add() cross-instance float partial = 0.0f;
communication primitive. The result is the same total sum for all program foreach (i = 0 ... N)
{
instances (reduce_add() returns a uniform float) partial += x[i];
}

The ISPC code at right will execute in a manner similar to handwritten C + AVX // from ISPC math library
intrinsics implementation below. * sum = reduce_add(partial);

float sumall2(int N, float* x) { return sum;


}
float tmp[8]; // assume 16-byte alignment
__mm256 partial = _mm256_broadcast_ss(0.0f);

for (int i=0; i<N; i+=8)


partial = _mm256_add_ps(partial, _mm256_load_ps(&x[i])); * Self-test: If you understand why this implementation
_mm256_store_ps(tmp, partial); complies with the semantics of the ISPC gang abstraction,
then you’ve got a good command of ISPC
float sum = 0.f;
for (int i=0; i<8; i++)
sum += tmp[i];

return sum;
}

Stanford CS149, Fall 2021


Summary: data-parallel model
▪ Data-parallelism is about imposing rigid program structure to facilitate simple programming
and advanced optimizations

▪ Basic structure: map a function onto a large collection of data


- Functional: side-effect free execution
- No communication among distinct function invocations
(allow invocations to be scheduled in any order, including in parallel)

▪ Other data parallel operators express more complex patterns on sequences: gather, scatter,
reduce, scan, shift, etc.
- This will be a topic of a later lecture

▪ You will think in terms of data-parallel primitives often in this class, but many modern
performance-oriented data-parallel languages do not enforce this structure in the language
- Many languages (like ISPC, CUDA, etc.) choose flexibility/familiarity of imperative C-style syntax over the safety of a more
functional form
Stanford CS149, Fall 2021
Summary
▪ Programming models provide a way to think about the organization of parallel programs.

▪ They provide abstractions that permit multiple valid implementations.

▪ I want you to always be thinking about abstraction vs. implementation for the remainder of
this course.

Stanford CS149, Fall 2021


Parallel Programming Basics

Stanford CS149, Fall 2021


Creating a parallel program
▪ Thought process:
1. Identify work that can be performed in parallel
2. Partition work (and also data associated with the work)
3. Manage data access, communication, and synchronization

▪ A common goal is maximizing speedup *


For a fixed computation:

Time (1 processor)
Speedup( P processors ) =
Time (P processors)

* Other goals include high efficiency (cost, area, power, etc.)


or working on bigger problems than can fit on one machine Stanford CS149, Fall 2021
Creating a parallel program
Problem to solve
Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)

Assignment
Parallel Threads ** ** I had to pick a term
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping

These responsibilities may be assumed by


Execution on the programmer, by the system (compiler,
parallel machine
runtime, hardware), or by both!
Adopted from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Problem decomposition
▪ Break up problem into tasks that can be carried out in parallel
▪ In general: create at least enough tasks to keep all execution units on a machine busy

Key challenge of decomposition:


identifying dependencies
(or... a lack of dependencies)

Stanford CS149, Fall 2021


Amdahl’s Law: dependencies limit maximum speedup
due to parallelism

▪ You run your favorite sequential program...

▪ Let S = the fraction of sequential execution that is inherently sequential (dependencies


prevent parallel execution)

▪ Then maximum speedup due to parallel execution ≤ 1/S

Stanford CS149, Fall 2021


A simple example
▪ Consider a two-step computation on a N x N image
- Step 1: multiply brightness of all pixels by two
(independent computation on each pixel)
- Step 2: compute average of all pixel values

▪ Sequential implementation of program


- Both steps take ~ N2 time, so total time is ~ 2N2

N
Parallelism

N
N2 N2
1

Execution time
Stanford CS149, Fall 2021
First attempt at parallelism (P processors)
▪ Strategy:
- Step 1: execute in parallel P
- time for phase 1: N2/P Sequential program

Parallelism
- Step 2: execute serially
- time for phase 2: N2
N2 N2
1

▪ Overall performance: Execution time

Speedup
N2/P
P
Parallel program
Parallelism
Speedup ≤ 2
N2
1

Execution time
Stanford CS149, Fall 2021
Parallelizing step 2
▪ Strategy:
- Step 1: execute in parallel
- time for phase 1: N2/P
- Step 2: compute partial sums in parallel, combine results serially
- time for phase 2: N2/P + P

▪ Overall performance:
- Speedup
Overhead of parallel algorithm:
N2/P N2/P P combining the partial sums
P

Parallelism Parallel program


Note: speedup → P when N >> P
1

Execution time
Stanford CS149, Fall 2021
Amdahl’s law
▪ Let S = the fraction of total work that is inherently sequential
▪ Max speedup on P processors given by:
speedup

Max Speedup S=0.01

S=0.05

S=0.1

Num Processors
Stanford CS149, Fall 2021
A small serial region can limit speedup on a large parallel machine
Summit supercomputer: 27,648 GPUs x (5,376 ALUs/GPU) = 148,635,648 ALUs
Machine can perform 148 million single precision operations in parallel
What is max speedup if 0.1% of application is serial?

Stanford CS149, Fall 2021


Decomposition
▪ Who is responsible for decomposing a program into independent tasks?
- In most cases: the programmer

▪ Automatic decomposition of sequential programs continues to be a challenging


research problem
(very difficult in general case)
- Compiler must analyze program, identify dependencies
- What if dependencies are data dependent (not known at compile time)?
- Researchers have had modest success with simple loop nests
- The “magic parallelizing compiler” for complex, general-purpose code has not yet been achieved

Stanford CS149, Fall 2021


Assignment
Problem to solve

Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)

Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping

Execution on
** I had to pick a term
parallel machine

Stanford CS149, Fall 2021


Assignment
▪ Assigning tasks to threads ** ** I had to pick a term
(will explain in a second)
- Think of “tasks” as things to do
- Think of threads as “workers”
▪ Goals: achieve good workload balance, reduce communication costs

▪ Can be performed statically (before application is run), or dynamically as program executes

▪ Although programmer is often responsible for decomposition, many languages/runtimes take


responsibility for assignment.

Stanford CS149, Fall 2021


Assignment examples in ISPC
export void ispc_sinx_interleaved( export void ispc_sinx_foreach(
uniform int N, uniform int N,
uniform int terms, uniform int terms,
uniform float* x, uniform float* x,
uniform float* result) uniform float* result)
{ {
// assumes N % programCount = 0 foreach (i = 0 ... N)
for (uniform int i=0; i<N; i+=programCount) {
{ float value = x[i];
int idx = i + programIndex; float numer = x[i] * x[i] * x[i];
float value = x[idx]; uniform int denom = 6; // 3!
float numer = x[idx] * x[idx] * x[idx]; uniform int sign = -1;
uniform int denom = 6; // 3!
uniform int sign = -1; for (uniform int j=1; j<=terms; j++)
{
for (uniform int j=1; j<=terms; j++) value += sign * numer / denom;
{ numer *= x[i] * x[i];
value += sign * numer / denom; denom *= (2*j+2) * (2*j+3);
numer *= x[idx] * x[idx]; sign *= -1;
denom *= (2*j+2) * (2*j+3); }
sign *= -1; result[i] = value;
} }
result[i] = value; }
}
} Decomposition of work by loop iteration
Decomposition of work by loop iteration foreach construct exposes independent work to system
Programmer-managed assignment: System-manages assignment of iterations (work) to ISPC program
Static assignment instances (abstraction leaves room for dynamic assignment, but
Assign iterations to ISPC program instances in interleaved fashion current ISPC implementation is static)
Stanford CS149, Fall 2021
Example 2: static assignment using C++11 threads
void my_thread_start(int N, int terms, float* x, float* results) { Decomposition of work by loop iteration
sinx(N, terms, x, result); // do work
}

Programmer-managed static assignment


void parallel_sinx(int N, int terms, float* x, float* result) { This program assigns loop iterations to threads in a
blocked fashion (first half of array assigned to the
int half = N/2.
spawned thread, second half assigned to main thread)
// launch thread to do work on first half of array
std::thread t1(my_thread_start, half, terms, x, result);

// do work on second half of array in main thread


sinx(N - half, terms, x + half, result + half);

t1.join();
}

Stanford CS149, Fall 2021


Dynamic assignment using ISPC tasks
void foo(uniform float* input, ISPC runtime assigns tasks to worker threads
uniform float* output,
uniform int N)
{
// create a bunch of tasks
launch[100] my_ispc_task(input, output, N);
}

Next task ptr

List of tasks:
task 0 task 1 task 2 task 3 task 4 ... task 99

Implementation of task assignment to threads: after completing current task,


worker thread inspects list and assigns itself the next uncompleted task.

Worker Worker Worker Worker


thread 0 thread 1 thread 2 thread 3

Stanford CS149, Fall 2021


Orchestration
Problem to solve

Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)

Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping

Execution on
** I had to pick a term
parallel machine

Stanford CS149, Fall 2021


Orchestration
▪ Involves:
- Structuring communication
- Adding synchronization to preserve dependencies if necessary
- Organizing data structures in memory
- Scheduling tasks

▪ Goals: reduce costs of communication/sync, preserve locality of data reference, reduce


overhead, etc.

▪ Machine details impact many of these decisions


- If synchronization is expensive, programmer might use it more sparsely

Stanford CS149, Fall 2021


Mapping to hardware
Problem to solve

Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)

Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping

Execution on
** I had to pick a term
parallel machine

Stanford CS149, Fall 2021


Mapping to hardware
▪ Mapping “threads” (“workers”) to hardware execution units
▪ Example 1: mapping by the operating system
- e.g., map a thread to HW execution context on a CPU core

▪ Example 2: mapping by the compiler


- Map ISPC program instances to vector instruction lanes

▪ Example 3: mapping by the hardware


- Map CUDA thread blocks to GPU cores (discussed in future lecture)

▪ Some interesting mapping decisions:


- Place related threads (cooperating threads) on the same processor
(maximize locality, data sharing, minimize costs of comm/sync)
- Place unrelated threads on the same processor (one might be bandwidth limited and another might be compute limited) to
use machine more efficiently
Stanford CS149, Fall 2021
Example: mapping to hardware
▪ Consider an application that creates two threads
▪ The application runs on the processor shown below
- Two cores, two-execution contexts per core, up to instructions per clock, one instruction is an 8-wide SIMD instruction.

▪ Question: “who” is responsible for mapping the applications’s threads to


the processor’s thread execution contexts?
Answer: the operating system

▪ Question: If you were implementing the OS, how would to map the two
threads to the four execution contexts?

▪ Another question: How would you map Fetch/


Decode
Fetch/
Decode
Fetch/
Decode
Fetch/
Decode

threads to execution contexts if your C Exec 1


SIMD Exec 2

Exec 1
SIMD Exec 2

program spawned five threads? Execution Execution Execution Execution


Context Context Context Context

Stanford CS149, Fall 2021


A parallel programming example

Stanford CS149, Fall 2021


A 2D-grid based solver
▪ Problem: solve partial differential equation (PDE) on (N+2) x (N+2) grid
▪ Solution uses iterative algorithm:
- Perform Gauss-Seidel sweeps over grid until convergence
N
A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j]
+ A[i,j+1] + A[i+1,j]);

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Grid solver algorithm: find the dependencies
C-like pseudocode for sequential algorithm is provided below
const int n;
float* A; // assume allocated for grid of N+2 x N+2 elements

void solve(float* A) {

float diff, prev;


bool done = false;

while (!done) { // outermost loop: iterations


diff = 0.f;
for (int i=1; i<n i++) { // iterate over non-border points of grid
for (int j=1; j<n; j++) {
prev = A[i,j];
A[i,j] = 0.2f * (A[i,j] + A[i,j-1] + A[i-1,j] +
A[i,j+1] + A[i+1,j]);
diff += fabs(A[i,j] - prev); // compute amount of change
}
}

if (diff/(n*n) < TOLERANCE) // quit if converged


done = true;
}
}

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Step 1: identify dependencies
(problem decomposition phase)
N
Each row element depends on element to left.

Each row depends on previous row.

N
...
...

Note: the dependencies illustrated on this slide are grid


element data dependencies in one iteration of the solver
(in one iteration of the “while not done” loop)

Stanford CS149, Fall 2021


Step 1: identify dependencies
(problem decomposition phase)
N
There is independent work along the diagonals!

Good: parallelism exists!

Possible implementation strategy:


1. Partition grid cells on a diagonal into tasks
N
2. Update values in parallel
3. When complete, move to next diagonal
...
...

Bad: independent work is hard to exploit


Not much parallelism at beginning and end of computation.
Frequent synchronization (after completing each diagonal)

Stanford CS149, Fall 2021


Let’s make life easier on ourselves
▪ Idea: improve performance by changing the algorithm to one that is more amenable
to parallelism
- Change the order that grid cell cells are updated
- New algorithm iterates to same solution (approximately), but converges to solution
differently
- Note: floating-point values computed are different, but solution still converges to within error threshold
- Yes, we needed domain knowledge of the Gauss-Seidel method to realize this
change is permissible
- But this is a common technique in parallel programming

Stanford CS149, Fall 2021


New approach: reorder grid cell update via red-black coloring
Reorder grid traversal: red-black coloring

N
Update all red cells in parallel

When done updating red cells ,


update all black cells in parallel
(respect dependency on red cells)
N
Repeat until convergence

Stanford CS149, Fall 2021


Possible assignments of work to processors
Reorder grid traversal: red-black coloring

Question: Which is better? Does it matter?


Answer: it depends on the system this program is running on
Stanford CS149, Fall 2021
Consider dependencies in the program
1. Perform red cell update in parallel Compute red cells
2. Wait until all processors done with update Wait
3. Communicate updated red cells to other processors
4. Perform black cell update in parallel Compute black cells

5. Wait until all processors done with update Wait


6. Communicate updated black cells to other processors
7. Repeat
P1 P2 P3 P4

Stanford CS149, Fall 2021


Communication resulting from assignment
Reorder grid traversal: red-black coloring

= data that must be sent to P2 each iteration


Blocked assignment requires less data to be communicated between processors
Stanford CS149, Fall 2021
Two ways to think about writing this program
▪ Data parallel thinking

▪ SPMD / shared address space

Stanford CS149, Fall 2021


Data-parallel expression of solver

Stanford CS149, Fall 2021


Data-parallel expression of grid solver
Note: to simplify pseudocode: just showing red-cell update
const int n;
Assignment: ???
float* A = allocate(n+2, n+2)); // allocate grid

void solve(float* A) {

bool done = false;


float diff = 0.f;
while (!done) { Decomposition:
for_all (red cells (i,j)) { processing individual
float prev = A[i,j]; grid elements constitutes
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + independent work
A[i+1,j] + A[i,j+1]);
reduceAdd(diff, abs(A[i,j] - prev));
}
Orchestration: handled by system
(builtin communication primitive: reduceAdd)
if (diff/(n*n) < TOLERANCE)
done = true; Orchestration:
}
handled by system
}
(End of for_all block is implicit wait for all
workers before returning to sequential control)

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Shared address space (with SPMD
threads) expression of solver

Stanford CS149, Fall 2021


Shared address space expression of solver
SPMD execution model
Compute red cells
▪ Programmer is responsible for synchronization
Wait
▪ Common synchronization primitives:
- Locks (provide mutual exclusion): only one Compute black cells
thread in the critical region at a time
Wait
- Barriers: wait for threads to reach this point

P1 P2 P3 P4

Stanford CS149, Fall 2021


Shared address space solver (pseudocode in SPMD execution model)
int
bool
n;
done = false;
// grid size
Assume these are global variables
float diff = 0.0; (accessible to all threads)
LOCK myLock;
BARRIER myBarrier; Assume solve function is executed by all threads.
// allocate grid (SPMD-style)
float* A = allocate(n+2, n+2);

void solve(float* A) { Value of threadId is different for each SPMD instance:


float myDiff;
int threadId = getThreadId();
use value to compute region of grid to work on
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)

while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS); Each thread computes the rows it is responsible for updating
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + A[i+1,j], A[i,j+1]);
myDiff += abs(A[i,j] - prev));
}
lock(myLock);
diff += myDiff;
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Shared address space solver (pseudocode in SPMD execution model)
int n; // grid size
bool done = false;
float diff = 0.0;
LOCK myLock;
BARRIER myBarrier;

Do you see a potential performance


// allocate grid
float* A = allocate(n+2, n+2);

void solve(float* A) {
float myDiff;
int threadId = getThreadId();
problem with this implementation?
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)

while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS);
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + A[i+1,j], A[i,j+1]);
myDiff += abs(A[i,j] - prev));
}
lock(myLock);
diff += myDiff;
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Shared address space solver (pseudocode in SPMD execution model)
int n; // grid size
bool done = false;

Improve performance by accumulating into partial


float diff = 0.0;
LOCK myLock;
BARRIER myBarrier;

// allocate grid
sum locally, then complete global reduction at the
float* A = allocate(n+2, n+2); end of the iteration.
void solve(float* A) {
float myDiff;
int threadId = getThreadId();
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)

while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS);
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + A[i+1,j], A[i,j+1]);

}
myDiff += abs(A[i,j] - prev));
Compute partial sum per worker
lock(myLock);
diff += myDiff; Now only only lock once per thread,
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
not once per (i,j) loop iteration!
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Barrier synchronization primitive
▪ barrier(num_threads)
▪ Barriers are a conservative way to express dependencies Compute red cells

▪ Barriers divide computation into phases Barrier

▪ All computations by all threads before the barrier complete


Compute black cells
before any computation in any thread after the barrier begins
- In other words, all computations after the barrier are Barrier

assumed to depend on all computations before the barrier

P1 P2 P3 P4

Stanford CS149, Fall 2021


Shared address space solver
int n; // grid size
bool done = false;

Why are there three barriers?


float diff = 0.0;
LOCK myLock;
BARRIER myBarrier;

// allocate grid
float* A = allocate(n+2, n+2);

void solve(float* A) {
float myDiff;
int threadId = getThreadId();
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)

while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS);
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + A[i+1,j], A[i,j+1]);
myDiff += abs(A[i,j] - prev));
}
lock(myLock);
diff += myDiff;
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Shared address space solver: one barrier
int
bool
n;
done = false;
// grid size
Idea:
LOCK myLock;
BARRIER myBarrier;
float diff[3]; // global diff, but now 3 copies Remove dependencies by using different diff
float *A = allocate(n+2, n+2);
variables in successive loop iterations

void solve(float* A) {
float myDiff; // thread local variable
int index = 0; // thread local variable
Trade off footprint for removing dependencies!
diff[0] = 0.0f;
(a common parallel programming technique)
barrier(myBarrier, NUM_PROCESSORS); // one-time only: just for init

while (!done) {
myDiff = 0.0f;
//
// perform computation (accumulate locally into myDiff)
//
lock(myLock);
diff[index] += myDiff; // atomically update global diff
unlock(myLock);
diff[(index+1) % 3] = 0.0f;
barrier(myBarrier, NUM_PROCESSORS);
if (diff[index]/(n*n) < TOLERANCE)
break;
index = (index + 1) % 3;
}
}

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Grid solver implementation in two programming models
▪ Data-parallel programming model
- Synchronization:
- Single logical thread of control, but iterations of forall loop may be parallelized by the system (implicit
barrier at end of forall loop body)
- Communication
- Implicit in loads and stores (like shared address space)
- Special built-in primitives for more complex communication patterns:
e.g., reduce

▪ Shared address space


- Synchronization:
- Mutual exclusion required for shared variables (e.g., via locks)
- Barriers used to express dependencies (between phases of computation)
- Communication
- Implicit in loads/stores to shared variables
Stanford CS149, Fall 2021
Summary
▪ Amdahl’s Law
- Overall maximum speedup from parallelism is limited by amount of serial execution in a program

▪ Aspects of creating a parallel program


- Decomposition to create independent work, assignment of work to workers, orchestration (to coordinate
processing of work by workers), mapping to hardware
- We’ll talk a lot about making good decisions in each of these phases in the coming lectures (in practice,
they are very inter-related)

▪ Focus today: identifying dependencies


▪ Focus soon: identifying locality, reducing synchronization

Stanford CS149, Fall 2021

You might also like