0% found this document useful (0 votes)

26 views83 pages

04 Progbasics

This lecture covers the basics of parallel programming, focusing on different programming models such as shared address space, message passing, and data parallelism. It discusses the implementation of these models, including how threads communicate and the importance of synchronization mechanisms. The lecture also highlights the complexities of hardware architecture that support these parallel programming abstractions.

Uploaded by

jolanhj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views83 pages

04 Progbasics

Uploaded by

jolanhj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 83

Lecture 4:

Parallel Programming Basics

Parallel Computing
Stanford CS149, Fall 2021
REVIEW

Stanford CS149, Fall 2021

Quiz: reviewing ISPC abstractions
export void ispc_sinx(
uniform int N,
This is an ISPC function.
uniform int terms,
uniform float* x,
uniform float* result)
It contains two nested for loops
{
// assume N % programCount = 0
for (uniform int i=0; i<N; i+=programCount)
Consider one ISPC program instance.
{ Which iterations of the two loops are executed in parallel
int idx = i + programIndex;
float value = x[idx]; by the ISPC program instance?
float numer = x[idx] * x[idx] * x[idx];
uniform int denom = 6; // 3!
uniform int sign = -1; Hint: this is a trick question
for (uniform int j=1; j<=terms; j++)
{ Answer: none
value += sign * numer / denom
numer *= x[idx] * x[idx];
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}
result[idx] = value;
}
}

Stanford CS149, Fall 2021

Program instances (that run in parallel) are created when the
ispc_sinx() ispc function is called
#include “sinx_ispc.h”

int N = 1024; main()

int terms = 5;
float* x = new float[N]; Sequential execution (C code)
float* result = new float[N];

// initialize x here ispc_sinx() Call to ispc_sinx()

0 1 2 3 4 5 6 7 Begin executing programCount
// execute ISPC code instances of ispc_sinx()
ispc_sinx(N, terms, x, result);
(ISPC code)

ispc_sinx() returns.
Each *ISPC program instance* executes the code Completion of ISPC program instances
in the function ispc_sinx serially. Resume sequential execution
(parallelism exists because there are multiple
program instances, not because of parallelism in
the code that defines an ispc function) Sequential execution (C code)

Stanford CS149, Fall 2021

WHAT WE DIDN’T GET TO LAST TIME
Three ways of thinking about parallel computation

(Recall: abstraction vs. implementation)

Stanford CS149, Fall 2021

Three programming models (abstractions)
1. Shared address space
2. Message passing
3. Data parallel

Stanford CS149, Fall 2021

Shared address space model

Stanford CS149, Fall 2021

Review: a program’s memory address space Address
0x0
Value
16
0x1 255
▪ A computer’s memory is organized as a array of bytes 0x2 14
0x3 0
0x4 0
▪ Each byte is identified by its “address” in memory 0x5
0x6
0
6
(its position in this array) 0x7 0
(in this class we assume memory is byte-addressable) 0x8 32
0x9 48
0xA 255
“The byte stored at address 0x8 has the value 32.” 0xB 255
0xC 255
“The byte stored at address 0x10 (16) has the value 128.” 0xD 0
0xE 0
0xF 0
In the illustration on the right, the program’s
0x10 128
memory address space is 32 bytes in size
(so valid addresses range from 0x0 to 0x1F)

...
...
0x1F 0

Stanford CS149, Fall 2021

The implementation of the linear memory address space abstraction
on a modern computer is complex
The instruction “load the value stored at address X into register R0” might involve a
complex sequence of operations by multiple data caches and access to DRAM

L1 cache
(32 KB)

Core 1
L2 cache
(256 KB)

.. L3 cache DRAM
. (32 GB)
L1 cache
(20 MB)
(32 KB)

Core 8
L2 cache
(256 KB)

Stanford CS149, Fall 2021

Shared address space model (abstraction)
Threads communicate by reading/writing to locations in a shared address space (shared variables)

Thread 1: Thread 2:
int x = 0; void foo(int* x) {
spawn_thread(foo, &x);
// read from addr storing
// write to address holding // contents of variable x
// contents of variable x while (x == 0) {}
x = 1; print x;
}

Store to x
Thread 1
x

Shared address space

Thread 2 Load from x

(Communication operations shown in red)

(Pseudocode provided in a fake C-like language for brevity.) Stanford CS149, Fall 2021
A common metaphor:
A shared address space is
like a bulletin board

(Everyone can read/write)

Image credit:
https://thetab.com/us/stanford/2016/07/28/honest-packing-list-freshman-stanford-1278
Stanford CS149, Fall 2021
Coordinating access to shared variables with
synchronization
Thread 1: Thread 2:
int x = 0; void foo(int* x, Lock* my_lock) {
Lock my_lock; my_lock->lock();
x++;
spawn_thread(foo, &x, &my_lock); my_lock->unlock();

print(x);
mylock.lock(); }
x++;
mylock.unlock();

Stanford CS149, Fall 2021

Review: why do we need mutual exclusion?
▪ Each thread executes:
-Load the value of variable x from a location in memory into register r1
(this stores a copy of the value in memory in the register)
- Add the contents of register r2 to register r1
- Store the value of register r1 into the address storing the program variable x
▪ One possible interleaving: (let starting value of x=0, r2=1)
T1 T2
r1 ← x T1 reads value 0
r1 ← x T2 reads value 0
r1 ← r1 + r2 T1 sets value of its r1 to 1
r1 ← r1 + r2 T2 sets value of its r1 to 1
X ← r1 T1 stores 1 to address of x
X ← r1 T2 stores 1 to address of x

▪ Need this set of three instructions must be “atomic”

Stanford CS149, Fall 2021
Examples of mechanisms for preserving atomicity
▪ Lock/unlock mutex around a critical section
mylock.lock();
// critical section
mylock.unlock();

▪ Some languages have first-class support for atomicity of code blocks

atomic {
// critical section
}

▪ Intrinsics for hardware-supported atomic read-modify-write operations

atomicAdd(x, 10);

Stanford CS149, Fall 2021

Review: shared address space model
▪ Threads communicate by:
- Reading/writing to shared variables in a shared address space
- Inter-thread communication is implicit in memory loads/stores
- Manipulating synchronization primitives
- e.g., ensuring mutual exclusion via use of locks

▪ This is a natural extension of sequential programming

- In fact, all our discussions in class have assumed a shared address space so far!

Stanford CS149, Fall 2021

Hardware implementation of a shared address space
Key idea: any processor can directly reference contents of any memory location

Examples of interconnects

Core Core Core Core

Core Core Core Core
Local Cache Local Cache Local Cache Local Cache Shared Bus

Memory Memory
Interconnect

Crossbar
Memory Core Core Core Core
I/O Core

Core

Memory Memory Memory Memory Memory Memory

Multi-stage network
* Caches (not shown) are another implementation of a shared address space (more on this in a later lecture)
Stanford CS149, Fall 2021
Shared address space hardware architecture
Any processor can directly reference any memory location

Memory

Memory Controller

Core 1 Core 2
Integrated
GPU
Core 3 Core 4

Intel Core i7 (quad core)

Example: Intel Core i7 processor (Kaby Lake) (interconnect is a ring)

Stanford CS149, Fall 2021

Intel’s ring interconnect
Introduced in Sandy Bridge microarchitecture
System Agent ▪ Four rings: for different types of messages
- request
- snoop
L3 cache slice - ack
(2 MB)
Core - data (32 bytes)

▪ Six interconnect nodes: four “slices” of L3 cache + system agent

L3 cache slice Core + graphics
(2 MB)

▪ Each bank of L3 connected to ring bus twice

L3 cache slice Core
(2 MB)
▪ Theoretical peak BW from cores to L3 at 3.4 GHz ~ 435 GB/sec
- When each core is accessing its local slice
L3 cache slice
Core
(2 MB)

Graphics
Stanford CS149, Fall 2021
SUN Niagara 2 (UltraSPARC T2): crossbar interconnect
Note area of crossbar (CCX):
about same area as one core on chip

Core

Core L2 cache Memory

Core

L2 cache Memory
Core
Crossbar
Switch
Core
L2 cache Memory

Core

Core L2 cache Memory

Core

Eight core processor

Stanford CS149, Fall 2021
KNL Mesh Interconnect
Intel Xeon Phi (Knights Landing)
OPIO
MCDRAM MCDRAM
OPIO PCIe MCDRAM
OPIO MCDRAM
OPIO

EDC EDC IIO EDC EDC

Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

iMC Tile Tile Tile Tile iMC

DDR DDR

Tile Tile Tile Tile Tile Tile

▪ 72 cores, arranged as 6x6 mesh of tiles (2 cores/tile) Tile Tile Tile Tile Tile Tile

▪ YX routing of messages: EDC EDC Misc EDC EDC

- Message travels in Y direction

-
OPIO
MCDRAM OPIO
MCDRAM OPIO
MCDRAM OPIO
MCDRAM
“Turn”
- Message traves in X direction

Stanford CS149, Fall 2021

Non-uniform memory access (NUMA)
The latency of accessing a memory location may be different from different processing cores in the system
Bandwidth from any one location may also be different to different CPU cores *

Example: modern multi-socket configuration

X Memory Memory

Memory Controller Memory Controller On chip

network
Core 1 Core 2 Core 5 Core 6

Core 3 Core 4 Core 7 Core 8

* In practice, you’ll find NUMA behavior on a single-socket system as well (recall: different cache slices are a different distance from each core)
Stanford CS149, Fall 2021
Summary: shared address space model
▪ Communication abstraction
- Threads read/write variables in shared address space
- Threads manipulate synchronization primitives: locks, atomic ops, etc.
- Logical extension of uniprocessor programming *

▪ Requires hardware support to implement efficiently

- Any processor can load and store from any address
- Can be costly to scale to large numbers of processors
(one of the reasons why high-core count processors are expensive)

* But NUMA implementations require reasoning about locality for performance optimization Stanford CS149, Fall 2021
Message passing model of
communication

Stanford CS149, Fall 2021

Message passing model (abstraction)
▪ Threads operate within their own private address spaces
▪ Threads communicate by sending/receiving messages
- send: specifies recipient, buffer to be transmitted, and optional message identifier (“tag”)
- receive: sender, specifies buffer to store data, and optional message identifier
- Sending messages is the only way to exchange data between threads 1 and 2
- Why?
Thread 2 address space
Thread 1 address space

x send(X, 2, my_msg_id)

Variable X semantics: send contexts of local variable X as

message to thread 2 and tag message with the recv(Y, 1, my_msg_id)
id “my_msg_id” semantics: receive message with id “my_msg_id”
from thread 1 and store contents in local variable Y
Y
Variable Y

(Communication operations shown in red)

Illustration adopted from Culler, Singh, Gupta Stanford CS149, Fall 2021
A common metaphor: snail mail

Stanford CS149, Fall 2021

Message passing (implementation)
▪ Hardware need not implement system-wide loads and stores to execute message passing programs (it need
only communicate messages between nodes)
- Can connect commodity systems together to form a large parallel machine
(message passing is a programming model for clusters and supercomputers)

Cluster of workstations
(Infiniband network)

Stanford CS149, Fall 2021

The data-parallel model

Stanford CS149, Fall 2021

Programming models provide a way to think about the organization
of parallel programs (by imposing structure)

▪ Shared address space: very little structure to communication

- All threads can read and write to all shared variables

▪ Message passing: communication is structured in the form of messages

- All communication occurs in the form of messages
- Communication is explicit in source code—the sends and receives)

▪ Data parallel structure: more rigid structure to computation

- Perform same function on elements of large collections

Stanford CS149, Fall 2021

Data-parallel model *
▪ Organize computation as operations on sequences of elements
- e.g., perform same function on all elements of a sequence

▪ A well-known modern example: NumPy: C = A + B

(A, B, and C are vectors of same length)

Something you’ve seen early in the lecture…

* We’ll have multiple lectures in the course about data-parallel programming and data-parallel thinking: this is just a taste
Stanford CS149, Fall 2021
Key data type of data-parallel code: sequences
▪ A sequence is an ordered collection of elements
▪ For example, in a C++ like language: Sequence<T>
▪ Scala lists: List[T]
▪ In a functional language (like Haskell): seq T

▪ Program can only access elements of sequence through sequence operators:

- map, reduce, scan, shift, etc.

Stanford CS149, Fall 2021

Map
▪ Higher order function (function that takes a function as an argument) that operates on sequences
▪ Applies side-effect-free unary function f :: a -> b to all elements of input sequence, to produce
output sequence of the same length
▪ In a functional language (e.g., Haskell)
- map :: (a -> b) -> seq a -> seq b

▪ In C++:
template<class InputIt, class OutputIt, class UnaryOperation>
OutputIt transform(InputIt first1, InputIt last1,
OutputIt d_first,
UnaryOperation unary_op);
f

f
Stanford CS149, Fall 2021
Parallelizing map
▪ Since f :: a -> b is a function (side-effect free), then applying f to all elements of
the sequence can be done in any order without changing the output of the program

▪ The implementation of map has flexibility to reorder/parallelize processing of elements

of sequence however it sees fit

Stanford CS149, Fall 2021

Data parallelism in ISPC
// main C++ code:
const int N = 1024;
float* x = new float[N]; foreach construct
float* y = new float[N];
Think of loop body as a function
// initialize N elements of x here

absolute_value(N, x, y);
Given this program, it is reasonable to think of the program as
using foreach to “map the loop body onto each element” of the
arrays X and Y.

// ISPC code:
export void absolute_value(
uniform int N,
uniform float* x, But if we want to be more precise: a sequence is not a first-class
uniform float* y) ISPC concept. It is implicitly defined by how the program has
{
foreach (i = 0 ... N) implemented array indexing logic in the foreach loop.
{
if (x[i] < 0)
y[i] = -x[i];
else
y[i] = x[i];
(There is no operation in ISPC with the semantic: “map this code
} over all elements of this sequence”)
}

Stanford CS149, Fall 2021

Data parallelism in ISPC
// main C++ code:
const int N = 1024;
float* x = new float[N/2]; Think of loop body as a function
float* y = new float[N];
The input/output sequences being mapped over are
// initialize N/2 elements of x here
implicitly defined by array indexing logic
absolute_repeat(N/2, x, y);

// ISPC code:
export void absolute_repeat(
This is also a valid ISPC program!
uniform int N,
uniform float* x,

{
uniform float* y)
It takes the absolute value of elements of x, then repeats it
foreach (i = 0 ... N) twice in the output array y
{
if (x[i] < 0)
y[2*i] = -x[i];
else (Less obvious how to think of this code as mapping the loop
y[2*i] = x[i];
y[2*i+1] = y[2*i]; body onto existing sequences.)
}
}
Stanford CS149, Fall 2021
Data parallelism in ISPC
// main C++ code:
const int N = 1024; Think of loop body as a function
float* x = new float[N];
float* y = new float[N]; The input/output sequences being mapped
// initialize N elements of x
over are implicitly defined by array indexing
logic
shift_negative(N, x, y);

// ISPC code: The output of this program is undefined!

export void shift_negative(
uniform int N,
uniform float* x, Possible for multiple iterations of the loop body to write to same
uniform float* y) memory location
{
foreach (i = 0 ... N)
{ Data-parallel model (foreach) provides no specification of order in
if (i >= 1 && x[i] < 0)
y[i-1] = x[i];
which iterations occur
else
y[i] = x[i];
}
}

Stanford CS149, Fall 2021

ISPC discussion: sum “reduction”
Compute the sum of all array elements in parallel
export uniform float sumall1(uniform int N, uniform float* x) export uniform float sumall2(uniform int N, uniform float* x)
{ {
uniform float sum = 0.0f; uniform float sum;
foreach (i = 0 ... N) float partial = 0.0f;
{ foreach (i = 0 ... N)
sum += x[i]; {
} partial += x[i];
}
return sum;
} // from ISPC math library
sum = reduce_add(partial);

return sum;
}

Correct ISPC solution

sum is of type uniform float (one copy of variable for all program instances)
x[i] is not a uniform expression (different value for each program instance)
Result: compile-time type error
Stanford CS149, Fall 2021
ISPC discussion: sum “reduction”
Each instance accumulates a private partial sum export uniform float sumall2(
uniform int N,
(no communication) uniform float* x)
{
uniform float sum;
Partial sums are added together using the reduce_add() cross-instance float partial = 0.0f;
communication primitive. The result is the same total sum for all program foreach (i = 0 ... N)
{
instances (reduce_add() returns a uniform float) partial += x[i];
}

The ISPC code at right will execute in a manner similar to handwritten C + AVX // from ISPC math library
intrinsics implementation below. * sum = reduce_add(partial);

float sumall2(int N, float* x) { return sum;

}
float tmp[8]; // assume 16-byte alignment
__mm256 partial = _mm256_broadcast_ss(0.0f);

for (int i=0; i<N; i+=8)

partial = _mm256_add_ps(partial, _mm256_load_ps(&x[i])); * Self-test: If you understand why this implementation
_mm256_store_ps(tmp, partial); complies with the semantics of the ISPC gang abstraction,
then you’ve got a good command of ISPC
float sum = 0.f;
for (int i=0; i<8; i++)
sum += tmp[i];

return sum;
}

Stanford CS149, Fall 2021

Summary: data-parallel model
▪ Data-parallelism is about imposing rigid program structure to facilitate simple programming
and advanced optimizations

▪ Basic structure: map a function onto a large collection of data

- Functional: side-effect free execution
- No communication among distinct function invocations
(allow invocations to be scheduled in any order, including in parallel)

▪ Other data parallel operators express more complex patterns on sequences: gather, scatter,
reduce, scan, shift, etc.
- This will be a topic of a later lecture

▪ You will think in terms of data-parallel primitives often in this class, but many modern
performance-oriented data-parallel languages do not enforce this structure in the language
- Many languages (like ISPC, CUDA, etc.) choose flexibility/familiarity of imperative C-style syntax over the safety of a more
functional form
Stanford CS149, Fall 2021
Summary
▪ Programming models provide a way to think about the organization of parallel programs.

▪ They provide abstractions that permit multiple valid implementations.

▪ I want you to always be thinking about abstraction vs. implementation for the remainder of
this course.

Stanford CS149, Fall 2021

Parallel Programming Basics

Stanford CS149, Fall 2021

Creating a parallel program
▪ Thought process:
1. Identify work that can be performed in parallel
2. Partition work (and also data associated with the work)
3. Manage data access, communication, and synchronization

▪ A common goal is maximizing speedup *

For a fixed computation:

Time (1 processor)
Speedup( P processors ) =
Time (P processors)

* Other goals include high efficiency (cost, area, power, etc.)

or working on bigger problems than can fit on one machine Stanford CS149, Fall 2021
Creating a parallel program
Problem to solve
Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)

Assignment
Parallel Threads ** ** I had to pick a term
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping

These responsibilities may be assumed by

Execution on the programmer, by the system (compiler,
parallel machine
runtime, hardware), or by both!
Adopted from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Problem decomposition
▪ Break up problem into tasks that can be carried out in parallel
▪ In general: create at least enough tasks to keep all execution units on a machine busy

Key challenge of decomposition:

identifying dependencies
(or... a lack of dependencies)

Stanford CS149, Fall 2021

Amdahl’s Law: dependencies limit maximum speedup
due to parallelism

▪ You run your favorite sequential program...

▪ Let S = the fraction of sequential execution that is inherently sequential (dependencies

prevent parallel execution)

▪ Then maximum speedup due to parallel execution ≤ 1/S

Stanford CS149, Fall 2021

A simple example
▪ Consider a two-step computation on a N x N image
- Step 1: multiply brightness of all pixels by two
(independent computation on each pixel)
- Step 2: compute average of all pixel values

▪ Sequential implementation of program

- Both steps take ~ N2 time, so total time is ~ 2N2

N
Parallelism

N
N2 N2
1

Execution time
Stanford CS149, Fall 2021
First attempt at parallelism (P processors)
▪ Strategy:
- Step 1: execute in parallel P
- time for phase 1: N2/P Sequential program

Parallelism
- Step 2: execute serially
- time for phase 2: N2
N2 N2
1

▪ Overall performance: Execution time

Speedup
N2/P
P
Parallel program
Parallelism
Speedup ≤ 2
N2
1

Execution time
Stanford CS149, Fall 2021
Parallelizing step 2
▪ Strategy:
- Step 1: execute in parallel
- time for phase 1: N2/P
- Step 2: compute partial sums in parallel, combine results serially
- time for phase 2: N2/P + P

▪ Overall performance:
- Speedup
Overhead of parallel algorithm:
N2/P N2/P P combining the partial sums
P

Parallelism Parallel program

Note: speedup → P when N >> P
1

Execution time
Stanford CS149, Fall 2021
Amdahl’s law
▪ Let S = the fraction of total work that is inherently sequential
▪ Max speedup on P processors given by:
speedup

Max Speedup S=0.01

S=0.05

S=0.1

Num Processors
Stanford CS149, Fall 2021
A small serial region can limit speedup on a large parallel machine
Summit supercomputer: 27,648 GPUs x (5,376 ALUs/GPU) = 148,635,648 ALUs
Machine can perform 148 million single precision operations in parallel
What is max speedup if 0.1% of application is serial?

Stanford CS149, Fall 2021

Decomposition
▪ Who is responsible for decomposing a program into independent tasks?
- In most cases: the programmer

▪ Automatic decomposition of sequential programs continues to be a challenging

research problem
(very difficult in general case)
- Compiler must analyze program, identify dependencies
- What if dependencies are data dependent (not known at compile time)?
- Researchers have had modest success with simple loop nests
- The “magic parallelizing compiler” for complex, general-purpose code has not yet been achieved

Stanford CS149, Fall 2021

Assignment
Problem to solve

Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)

Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping

Execution on
** I had to pick a term
parallel machine

Stanford CS149, Fall 2021

Assignment
▪ Assigning tasks to threads ** ** I had to pick a term
(will explain in a second)
- Think of “tasks” as things to do
- Think of threads as “workers”
▪ Goals: achieve good workload balance, reduce communication costs

▪ Can be performed statically (before application is run), or dynamically as program executes

▪ Although programmer is often responsible for decomposition, many languages/runtimes take

responsibility for assignment.

Stanford CS149, Fall 2021

Assignment examples in ISPC
export void ispc_sinx_interleaved( export void ispc_sinx_foreach(
uniform int N, uniform int N,
uniform int terms, uniform int terms,
uniform float* x, uniform float* x,
uniform float* result) uniform float* result)
{ {
// assumes N % programCount = 0 foreach (i = 0 ... N)
for (uniform int i=0; i<N; i+=programCount) {
{ float value = x[i];
int idx = i + programIndex; float numer = x[i] * x[i] * x[i];
float value = x[idx]; uniform int denom = 6; // 3!
float numer = x[idx] * x[idx] * x[idx]; uniform int sign = -1;
uniform int denom = 6; // 3!
uniform int sign = -1; for (uniform int j=1; j<=terms; j++)
{
for (uniform int j=1; j<=terms; j++) value += sign * numer / denom;
{ numer *= x[i] * x[i];
value += sign * numer / denom; denom *= (2*j+2) * (2*j+3);
numer *= x[idx] * x[idx]; sign *= -1;
denom *= (2*j+2) * (2*j+3); }
sign *= -1; result[i] = value;
} }
result[i] = value; }
}
} Decomposition of work by loop iteration
Decomposition of work by loop iteration foreach construct exposes independent work to system
Programmer-managed assignment: System-manages assignment of iterations (work) to ISPC program
Static assignment instances (abstraction leaves room for dynamic assignment, but
Assign iterations to ISPC program instances in interleaved fashion current ISPC implementation is static)
Stanford CS149, Fall 2021
Example 2: static assignment using C++11 threads
void my_thread_start(int N, int terms, float* x, float* results) { Decomposition of work by loop iteration
sinx(N, terms, x, result); // do work
}

Programmer-managed static assignment

void parallel_sinx(int N, int terms, float* x, float* result) { This program assigns loop iterations to threads in a
blocked fashion (first half of array assigned to the
int half = N/2.
spawned thread, second half assigned to main thread)
// launch thread to do work on first half of array
std::thread t1(my_thread_start, half, terms, x, result);

// do work on second half of array in main thread

sinx(N - half, terms, x + half, result + half);

t1.join();
}

Stanford CS149, Fall 2021

Dynamic assignment using ISPC tasks
void foo(uniform float* input, ISPC runtime assigns tasks to worker threads
uniform float* output,
uniform int N)
{
// create a bunch of tasks
launch[100] my_ispc_task(input, output, N);
}

Next task ptr

List of tasks:
task 0 task 1 task 2 task 3 task 4 ... task 99

Implementation of task assignment to threads: after completing current task,

worker thread inspects list and assigns itself the next uncompleted task.

Worker Worker Worker Worker

thread 0 thread 1 thread 2 thread 3

Stanford CS149, Fall 2021

Orchestration
Problem to solve

Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)

Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping

Execution on
** I had to pick a term
parallel machine

Stanford CS149, Fall 2021

Orchestration
▪ Involves:
- Structuring communication
- Adding synchronization to preserve dependencies if necessary
- Organizing data structures in memory
- Scheduling tasks

▪ Goals: reduce costs of communication/sync, preserve locality of data reference, reduce

overhead, etc.

▪ Machine details impact many of these decisions

- If synchronization is expensive, programmer might use it more sparsely

Stanford CS149, Fall 2021

Mapping to hardware
Problem to solve

Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)

Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping

Execution on
** I had to pick a term
parallel machine

Stanford CS149, Fall 2021

Mapping to hardware
▪ Mapping “threads” (“workers”) to hardware execution units
▪ Example 1: mapping by the operating system
- e.g., map a thread to HW execution context on a CPU core

▪ Example 2: mapping by the compiler

- Map ISPC program instances to vector instruction lanes

▪ Example 3: mapping by the hardware

- Map CUDA thread blocks to GPU cores (discussed in future lecture)

▪ Some interesting mapping decisions:

- Place related threads (cooperating threads) on the same processor
(maximize locality, data sharing, minimize costs of comm/sync)
- Place unrelated threads on the same processor (one might be bandwidth limited and another might be compute limited) to
use machine more efficiently
Stanford CS149, Fall 2021
Example: mapping to hardware
▪ Consider an application that creates two threads
▪ The application runs on the processor shown below
- Two cores, two-execution contexts per core, up to instructions per clock, one instruction is an 8-wide SIMD instruction.

▪ Question: “who” is responsible for mapping the applications’s threads to

the processor’s thread execution contexts?
Answer: the operating system

▪ Question: If you were implementing the OS, how would to map the two
threads to the four execution contexts?

▪ Another question: How would you map Fetch/

Decode
Fetch/
Decode
Fetch/
Decode
Fetch/
Decode

threads to execution contexts if your C Exec 1

SIMD Exec 2

Exec 1
SIMD Exec 2

program spawned five threads? Execution Execution Execution Execution

Context Context Context Context

Stanford CS149, Fall 2021

A parallel programming example

Stanford CS149, Fall 2021

A 2D-grid based solver
▪ Problem: solve partial differential equation (PDE) on (N+2) x (N+2) grid
▪ Solution uses iterative algorithm:
- Perform Gauss-Seidel sweeps over grid until convergence
N
A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j]
+ A[i,j+1] + A[i+1,j]);

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Grid solver algorithm: find the dependencies
C-like pseudocode for sequential algorithm is provided below
const int n;
float* A; // assume allocated for grid of N+2 x N+2 elements

void solve(float* A) {

float diff, prev;

bool done = false;

while (!done) { // outermost loop: iterations

diff = 0.f;
for (int i=1; i<n i++) { // iterate over non-border points of grid
for (int j=1; j<n; j++) {
prev = A[i,j];
A[i,j] = 0.2f * (A[i,j] + A[i,j-1] + A[i-1,j] +
A[i,j+1] + A[i+1,j]);
diff += fabs(A[i,j] - prev); // compute amount of change
}
}

if (diff/(n*n) < TOLERANCE) // quit if converged

done = true;
}
}

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Step 1: identify dependencies
(problem decomposition phase)
N
Each row element depends on element to left.

Each row depends on previous row.

N
...
...

Note: the dependencies illustrated on this slide are grid

element data dependencies in one iteration of the solver
(in one iteration of the “while not done” loop)

Stanford CS149, Fall 2021

Step 1: identify dependencies
(problem decomposition phase)
N
There is independent work along the diagonals!

Good: parallelism exists!

Possible implementation strategy:

1. Partition grid cells on a diagonal into tasks
N
2. Update values in parallel
3. When complete, move to next diagonal
...
...

Bad: independent work is hard to exploit

Not much parallelism at beginning and end of computation.
Frequent synchronization (after completing each diagonal)

Stanford CS149, Fall 2021

Let’s make life easier on ourselves
▪ Idea: improve performance by changing the algorithm to one that is more amenable
to parallelism
- Change the order that grid cell cells are updated
- New algorithm iterates to same solution (approximately), but converges to solution
differently
- Note: floating-point values computed are different, but solution still converges to within error threshold
- Yes, we needed domain knowledge of the Gauss-Seidel method to realize this
change is permissible
- But this is a common technique in parallel programming

Stanford CS149, Fall 2021

New approach: reorder grid cell update via red-black coloring
Reorder grid traversal: red-black coloring

N
Update all red cells in parallel

When done updating red cells ,

update all black cells in parallel
(respect dependency on red cells)
N
Repeat until convergence

Stanford CS149, Fall 2021

Possible assignments of work to processors
Reorder grid traversal: red-black coloring

Question: Which is better? Does it matter?

Answer: it depends on the system this program is running on
Stanford CS149, Fall 2021
Consider dependencies in the program
1. Perform red cell update in parallel Compute red cells
2. Wait until all processors done with update Wait
3. Communicate updated red cells to other processors
4. Perform black cell update in parallel Compute black cells

5. Wait until all processors done with update Wait

6. Communicate updated black cells to other processors
7. Repeat
P1 P2 P3 P4

Stanford CS149, Fall 2021

Communication resulting from assignment
Reorder grid traversal: red-black coloring

= data that must be sent to P2 each iteration

Blocked assignment requires less data to be communicated between processors
Stanford CS149, Fall 2021
Two ways to think about writing this program
▪ Data parallel thinking

▪ SPMD / shared address space

Stanford CS149, Fall 2021

Data-parallel expression of solver

Stanford CS149, Fall 2021

Data-parallel expression of grid solver
Note: to simplify pseudocode: just showing red-cell update
const int n;
Assignment: ???
float* A = allocate(n+2, n+2)); // allocate grid

void solve(float* A) {

bool done = false;

float diff = 0.f;
while (!done) { Decomposition:
for_all (red cells (i,j)) { processing individual
float prev = A[i,j]; grid elements constitutes
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + independent work
A[i+1,j] + A[i,j+1]);
reduceAdd(diff, abs(A[i,j] - prev));
}
Orchestration: handled by system
(builtin communication primitive: reduceAdd)
if (diff/(n*n) < TOLERANCE)
done = true; Orchestration:
}
handled by system
}
(End of for_all block is implicit wait for all
workers before returning to sequential control)

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Shared address space (with SPMD
threads) expression of solver

Stanford CS149, Fall 2021

Shared address space expression of solver
SPMD execution model
Compute red cells
▪ Programmer is responsible for synchronization
Wait
▪ Common synchronization primitives:
- Locks (provide mutual exclusion): only one Compute black cells
thread in the critical region at a time
Wait
- Barriers: wait for threads to reach this point

P1 P2 P3 P4

Stanford CS149, Fall 2021

Shared address space solver (pseudocode in SPMD execution model)
int
bool
n;
done = false;
// grid size
Assume these are global variables
float diff = 0.0; (accessible to all threads)
LOCK myLock;
BARRIER myBarrier; Assume solve function is executed by all threads.
// allocate grid (SPMD-style)
float* A = allocate(n+2, n+2);

void solve(float* A) { Value of threadId is different for each SPMD instance:

float myDiff;
int threadId = getThreadId();
use value to compute region of grid to work on
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)

while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS); Each thread computes the rows it is responsible for updating
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + A[i+1,j], A[i,j+1]);
myDiff += abs(A[i,j] - prev));
}
lock(myLock);
diff += myDiff;
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Shared address space solver (pseudocode in SPMD execution model)
int n; // grid size
bool done = false;
float diff = 0.0;
LOCK myLock;
BARRIER myBarrier;

Do you see a potential performance

// allocate grid
float* A = allocate(n+2, n+2);

void solve(float* A) {
float myDiff;
int threadId = getThreadId();
problem with this implementation?
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)

while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS);
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + A[i+1,j], A[i,j+1]);
myDiff += abs(A[i,j] - prev));
}
lock(myLock);
diff += myDiff;
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Shared address space solver (pseudocode in SPMD execution model)
int n; // grid size
bool done = false;

Improve performance by accumulating into partial

float diff = 0.0;
LOCK myLock;
BARRIER myBarrier;

// allocate grid
sum locally, then complete global reduction at the
float* A = allocate(n+2, n+2); end of the iteration.
void solve(float* A) {
float myDiff;
int threadId = getThreadId();
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)

}
myDiff += abs(A[i,j] - prev));
Compute partial sum per worker
lock(myLock);
diff += myDiff; Now only only lock once per thread,
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
not once per (i,j) loop iteration!
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Barrier synchronization primitive
▪ barrier(num_threads)
▪ Barriers are a conservative way to express dependencies Compute red cells

▪ Barriers divide computation into phases Barrier

▪ All computations by all threads before the barrier complete

Compute black cells
before any computation in any thread after the barrier begins
- In other words, all computations after the barrier are Barrier

assumed to depend on all computations before the barrier

P1 P2 P3 P4

Stanford CS149, Fall 2021

Shared address space solver
int n; // grid size
bool done = false;

Why are there three barriers?

float diff = 0.0;
LOCK myLock;
BARRIER myBarrier;

// allocate grid
float* A = allocate(n+2, n+2);

void solve(float* A) {
float myDiff;
int threadId = getThreadId();
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Shared address space solver: one barrier
int
bool
n;
done = false;
// grid size
Idea:
LOCK myLock;
BARRIER myBarrier;
float diff[3]; // global diff, but now 3 copies Remove dependencies by using different diff
float *A = allocate(n+2, n+2);
variables in successive loop iterations

void solve(float* A) {
float myDiff; // thread local variable
int index = 0; // thread local variable
Trade off footprint for removing dependencies!
diff[0] = 0.0f;
(a common parallel programming technique)
barrier(myBarrier, NUM_PROCESSORS); // one-time only: just for init

while (!done) {
myDiff = 0.0f;
//
// perform computation (accumulate locally into myDiff)
//
lock(myLock);
diff[index] += myDiff; // atomically update global diff
unlock(myLock);
diff[(index+1) % 3] = 0.0f;
barrier(myBarrier, NUM_PROCESSORS);
if (diff[index]/(n*n) < TOLERANCE)
break;
index = (index + 1) % 3;
}
}

Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2021
Grid solver implementation in two programming models
▪ Data-parallel programming model
- Synchronization:
- Single logical thread of control, but iterations of forall loop may be parallelized by the system (implicit
barrier at end of forall loop body)
- Communication
- Implicit in loads and stores (like shared address space)
- Special built-in primitives for more complex communication patterns:
e.g., reduce

▪ Shared address space

- Synchronization:
- Mutual exclusion required for shared variables (e.g., via locks)
- Barriers used to express dependencies (between phases of computation)
- Communication
- Implicit in loads/stores to shared variables
Stanford CS149, Fall 2021
Summary
▪ Amdahl’s Law
- Overall maximum speedup from parallelism is limited by amount of serial execution in a program

▪ Aspects of creating a parallel program

- Decomposition to create independent work, assignment of work to workers, orchestration (to coordinate
processing of work by workers), mapping to hardware
- We’ll talk a lot about making good decisions in each of these phases in the coming lectures (in practice,
they are very inter-related)

▪ Focus today: identifying dependencies

▪ Focus soon: identifying locality, reducing synchronization

Stanford CS149, Fall 2021

Business Modeler IDE Best Practices Guide V2.18
No ratings yet
Business Modeler IDE Best Practices Guide V2.18
234 pages
Solutions To Exercises On Parallelism and Concurrency
No ratings yet
Solutions To Exercises On Parallelism and Concurrency
5 pages
Course - DBMS: Course Instructor Dr. K. Subrahmanyam Department of CSE
100% (1)
Course - DBMS: Course Instructor Dr. K. Subrahmanyam Department of CSE
58 pages
06 Progperf2
No ratings yet
06 Progperf2
69 pages
02 Basicarch
No ratings yet
02 Basicarch
103 pages
04 Progbasics
No ratings yet
04 Progbasics
51 pages
02 Basicarch
No ratings yet
02 Basicarch
83 pages
6.191 Computation Structures: Name Athena Login Name Score Recitation Section
No ratings yet
6.191 Computation Structures: Name Athena Login Name Score Recitation Section
21 pages
03 Ch1 BasicArch Parallel
No ratings yet
03 Ch1 BasicArch Parallel
79 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
Android Intents 1
No ratings yet
Android Intents 1
30 pages
PDC Lecture 05
No ratings yet
PDC Lecture 05
48 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
3 Concurrency
No ratings yet
3 Concurrency
52 pages
Parallel Programming Models Lecture
No ratings yet
Parallel Programming Models Lecture
47 pages
What Is Parallel Computing
No ratings yet
What Is Parallel Computing
9 pages
CSED405 Lec3-Memory and Locality - 240912 - 113301
No ratings yet
CSED405 Lec3-Memory and Locality - 240912 - 113301
65 pages
RG1 Intro ParallelArch HPCAI Jan2020
No ratings yet
RG1 Intro ParallelArch HPCAI Jan2020
47 pages
Comporg6 ch12
No ratings yet
Comporg6 ch12
36 pages
Slides Taken From: Parallel Computing Platforms
No ratings yet
Slides Taken From: Parallel Computing Platforms
11 pages
Hardware Memory Models
No ratings yet
Hardware Memory Models
13 pages
Lec11 Protection
No ratings yet
Lec11 Protection
38 pages
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
No ratings yet
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
28 pages
01 Whyparallelism
No ratings yet
01 Whyparallelism
82 pages
Lecture02 Types
No ratings yet
Lecture02 Types
21 pages
KTMTSS Shared Memory Multiprocessor
No ratings yet
KTMTSS Shared Memory Multiprocessor
29 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
Advanced Performance Optimization in CUDA (S62192)
100% (1)
Advanced Performance Optimization in CUDA (S62192)
127 pages
03 Multicore2-Ispc
No ratings yet
03 Multicore2-Ispc
56 pages
02 Multicore
No ratings yet
02 Multicore
66 pages
High Performance Computing
No ratings yet
High Performance Computing
17 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Lec 4
No ratings yet
Lec 4
36 pages
Multiprocessor Basics & Performance
No ratings yet
Multiprocessor Basics & Performance
52 pages
L04 Parallel Systems Synchronization Communication Scheduling
No ratings yet
L04 Parallel Systems Synchronization Communication Scheduling
117 pages
Lecture 11: Consistency Models: Topics: Sequential Consistency, HW and HW/SW Optimizations
No ratings yet
Lecture 11: Consistency Models: Topics: Sequential Consistency, HW and HW/SW Optimizations
18 pages
Module 1
No ratings yet
Module 1
68 pages
HSECPresentation Lec1
No ratings yet
HSECPresentation Lec1
21 pages
Back To Basics Concurrency Arthur Odwyer
No ratings yet
Back To Basics Concurrency Arthur Odwyer
58 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
L04 Parallel Programming Models I
No ratings yet
L04 Parallel Programming Models I
72 pages
Pthreads Mod
No ratings yet
Pthreads Mod
110 pages
Parallel Architecture: Sathish Vadhiyar
No ratings yet
Parallel Architecture: Sathish Vadhiyar
26 pages
Final Review 09
No ratings yet
Final Review 09
21 pages
CS 4740/6740 Network Security: Lecture 7: Memory Corruption (Assembly Review, Basic Exploits)
No ratings yet
CS 4740/6740 Network Security: Lecture 7: Memory Corruption (Assembly Review, Basic Exploits)
189 pages
Lecture 03
No ratings yet
Lecture 03
39 pages
15 Synchronization
No ratings yet
15 Synchronization
57 pages
Unit 1 - Part - 2
No ratings yet
Unit 1 - Part - 2
30 pages
CS 61C: Great Ideas in Computer Architecture: Course Introduction
No ratings yet
CS 61C: Great Ideas in Computer Architecture: Course Introduction
55 pages
Shared Memory Multiprocessors: Logical Design and Software Interactions
No ratings yet
Shared Memory Multiprocessors: Logical Design and Software Interactions
107 pages
Lecture4 - CISCArchitecture (Nibs-PC's Conflicted Copy 2016-07-27)
No ratings yet
Lecture4 - CISCArchitecture (Nibs-PC's Conflicted Copy 2016-07-27)
32 pages
CSED405 Lec5-Threads and Atomics - 240921 - 193053
No ratings yet
CSED405 Lec5-Threads and Atomics - 240921 - 193053
34 pages
Instruction Sequence - 1
No ratings yet
Instruction Sequence - 1
21 pages
Distributed Shared Memory Guide
100% (1)
Distributed Shared Memory Guide
20 pages
BCS702 Module1 Detailed Notes
No ratings yet
BCS702 Module1 Detailed Notes
14 pages
6963 Midterm Review
No ratings yet
6963 Midterm Review
20 pages
Stanford CS149 Parallel Computing Assignment
No ratings yet
Stanford CS149 Parallel Computing Assignment
31 pages
2 Parallel Computer Memory Architectures
No ratings yet
2 Parallel Computer Memory Architectures
26 pages
High Performance Computing: Matthew Jacob Indian Institute of Science
No ratings yet
High Performance Computing: Matthew Jacob Indian Institute of Science
18 pages
Lecture 4
No ratings yet
Lecture 4
20 pages
Parallel Architecture
No ratings yet
Parallel Architecture
33 pages
S 3
No ratings yet
S 3
4 pages
SBK 2
No ratings yet
SBK 2
10 pages
Hwsol 9
No ratings yet
Hwsol 9
6 pages
TR
No ratings yet
TR
5 pages
【钢琴】Scriabin-Piano Sonata No. 5 Op. 53 亨乐版
0% (1)
【钢琴】Scriabin-Piano Sonata No. 5 Op. 53 亨乐版
22 pages
5.3 Text IO Vs Binary IO Binary IO Classes Motivations and Benefits Defining Generic Classes and Interfaces
No ratings yet
5.3 Text IO Vs Binary IO Binary IO Classes Motivations and Benefits Defining Generic Classes and Interfaces
5 pages
Python Pra 14
No ratings yet
Python Pra 14
4 pages
XMLP Bursting for Managers
No ratings yet
XMLP Bursting for Managers
16 pages
What Is MongoDB - Introduction, Architecture, Features & Example
No ratings yet
What Is MongoDB - Introduction, Architecture, Features & Example
8 pages
Docs Fire Ly Firely Server en 5.0b1
No ratings yet
Docs Fire Ly Firely Server en 5.0b1
368 pages
Ug Brochure
No ratings yet
Ug Brochure
43 pages
CSC-9618 Mock 1 A2 Paper 4 (Mark Scheme)
No ratings yet
CSC-9618 Mock 1 A2 Paper 4 (Mark Scheme)
27 pages
Data Analytics Course Overview 6
No ratings yet
Data Analytics Course Overview 6
8 pages
Tic-Tac-Toe Testing and Refactoring
No ratings yet
Tic-Tac-Toe Testing and Refactoring
18 pages
SCRIPT TAMBAH, Edit, Hapus, Cari Query Delphi
No ratings yet
SCRIPT TAMBAH, Edit, Hapus, Cari Query Delphi
7 pages
Patterns and Frameworks
No ratings yet
Patterns and Frameworks
12 pages
Java Roadmap 2024
No ratings yet
Java Roadmap 2024
1 page
ServiceNow Performance Review
No ratings yet
ServiceNow Performance Review
26 pages
20.ics 2276 Computer Programming Ii Paper I
No ratings yet
20.ics 2276 Computer Programming Ii Paper I
4 pages
Internship Sahil's Report
No ratings yet
Internship Sahil's Report
22 pages
Penchal's Resume - No PDF
No ratings yet
Penchal's Resume - No PDF
1 page
Notes On Queue
No ratings yet
Notes On Queue
5 pages
Software Engineering Course Guide
No ratings yet
Software Engineering Course Guide
33 pages
Crop Yield Prediction Based On Indian Agriculture Using Machine Learning
100% (1)
Crop Yield Prediction Based On Indian Agriculture Using Machine Learning
56 pages
Assignment No 2 ICT
No ratings yet
Assignment No 2 ICT
4 pages
DBMS MCQ
67% (6)
DBMS MCQ
8 pages
RE For Beginners-En
No ratings yet
RE For Beginners-En
285 pages
Model Deployment GL
No ratings yet
Model Deployment GL
20 pages
A Complete Step by Step Tutorial On How To Use Arduino Software Serial
No ratings yet
A Complete Step by Step Tutorial On How To Use Arduino Software Serial
4 pages
Library System Code for Developers
No ratings yet
Library System Code for Developers
8 pages
Evolution of Computers
100% (6)
Evolution of Computers
8 pages
AJP Question Paper
No ratings yet
AJP Question Paper
7 pages
Array and Linked List Operations
No ratings yet
Array and Linked List Operations
54 pages