0% found this document useful (0 votes)

37 views30 pages

Parallel Computing

Parallel computers use multiple processors to work on different parts of a problem simultaneously. This can reduce the execution time compared to using a single processor. The speedup from using p processors is generally between 1 and p, with best case being linear speedup of p. Shared memory parallel computers give all processors equal access to memory, while distributed memory machines have faster access to local versus remote memory. Benchmark tests like Linpack are used to measure real-world performance of parallel systems. A variety of interconnection networks are used between processors, balancing factors like communication latency, bandwidth and scalability.

Uploaded by

Martina Janeva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views30 pages

Parallel Computing

Uploaded by

Martina Janeva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Parallel Computing

Parallel Computers
Parallel processing/computing :

– at least two processors have to cooperate

– by means of exchanging data
– while working on different parts of one and the same
problem

Parallel computers : make use of multiple processors for parallel

processing
Speed-up
Basic idea of parallel processing: execution time can be reduced by
employing more than one processor, the larger the number of
processors the smaller the execution time.

Speed-up
s(p) = T1/Tp

T1 -- execution time on one processor

Tp -- execution time on p processors

• Best case: s(p) = p (linear speed-up)

• Worst case: s(p) = 1
• Generally: 1 ≤ s(p) ≤ p
Execution on one processor

T1 = Tseq + Tpar
Tseq -- execution time of sequential part
Tpar -- execution time of parallelisable part

Execution on p processors

Tp = Tseq + Tpar /p
Shared memory parallel computers

processors have equally fast access to any location in memory

Distributed memory parallel
computers

access to own memory is faster than access to other memory

Non-Uniform Memory Access
(NUMA) Architectures
Performance

• Theoretical peak performance

• Linpack benchmark performance

Theoretical peak performance
• Theoretical peak performance Rpeak – the maximal number of
arithmetical operations (additions and/or multiplications) a processor
can carry out per second
Rpeak 1 = f µpr

where:
– f – clock frequency
– µpr – maximum number of operations per clock cycle

• The theoretical peak performance of a parallel computer is equal to

the product of the number of processors and the theoretical peak
performance of one processor.

Rpeak p = pRpeak 1
Examples:

• Cray J32
– f = 100 MHz, µpr = 2, Rpeak 1 = 200 Mflops
– p = 32, Rpeak 32 = 6.4 Gflops

• NEC SX-5
– f = 250 MHz, µpr = 32, Rpeak 1 = 8 Gflops
– p = 16, Rpeak 16 = 128 Gflops
– n = 32 (NUMA), Rpeak 32*16 = 4 Tflops
Benchmark performance

• Benchmark

– a program for a specific problem

– the number of operations which are executed is known

– used to measure the run time in a single-user mode

– to determine the benchmark performance (operations per

second).
Linpack benchmark
• Linpack – a popular library of Fortran subroutines for the
numerical solution of linear algebra problems

• Linpack benchmark – based on one particular subroutine which

is used for the solution of a dense system of linear equations

– algorithm: LU factorization by Gaussian elimination with partial

pivoting

– number of operations: 2n3/3 + O(n2) (n – number of unknowns)

• Top-500 list of most powerful computer installations

http://www.top500.org/
Interconnection structures for
parallel computers
Bisection or cross-section
bandwidth

• Definition: the effective rate at which one half of the processing

nodes can send data to the other half (for worst case division of
the processors).

• It does not scale linearly with the number of processing nodes in

most interconnection schemes.
Complete communication graph

• The bisection bandwidth grows in proportion to the number of

nodes.

• The number of edges: n(n-1)/2

Bus

• The bisection bandwidth of the system is constant and equal to the

bandwidth of the bus.

• Simple software and hardware.

Crossbar switch
• Bisection bandwidth scales with the number of processing
nodes.

• Total number of communication network ports -- Θ(n)

• Number of links -- Θ(n2).

• In practice the crossbar switch is used only to interconnect a

relatively small number of processors.
Multistage switching networks

• A series of switches which are grouped in stages realizes the

connection between pairs of inputs and outputs.

• Can be organized in many different topologies fitted to particular

applications.

• Number of links -- Θ(n log(n))

• Bisection bandwidth -- Θ(n)

Example - Beneš network
Regular grids: 1-D arrays
• Linear processor array and ring.

• Bisection bandwidth -- Ω(1)

• Remote communication -- Ο(n)

Regular grids: 2-D arrays
• 2-D mesh

• Torus.

• Twisted torus.

• Remote communication needs time O(n1/2).

• Bisection bandwidth -- Ω(n1/2).

A two-dimensional mesh
Regular grids: 3-D arrays

• Remote communication -- O(n1/3)

• Bisection bandwidth -- Ω (n2/3)

Example: Cray T3E -- 10 x 10 x 10 grid

Trees

Binary tree
Trees
• Remote communication -- O(log(n)).

• Fit well the communicational requirements of reduction

operations and a number of optimal algorithms based on divide
and conquer techniques.

• Less suited for regular data array redistribution operations.

• The decreasing aggregate bandwidth of a tree network in its

upper levels and in particular around the root presents a severe
bottleneck for massive communication.
Fat tree

The aggregate bandwidth of a fat tree network is kept

constant at all levels of the tree
Binary Hypercubes
• A binary hypercube of degree d (d > 3) consists of n = 2d
nodes labeled by distinct d-bit binary numbers.

• Two nodes are connected by an edge, iff their respective labels

differ in exactly one bit position.

• O(n log(n)) links

• Bisection bandwidth scales in proportion with the number of nodes.

• Remote communication -- O(log(n))

Examples of binary hypercubes

Gerald Barry Piano Quartet No. 1
No ratings yet
Gerald Barry Piano Quartet No. 1
3 pages
Study Skills for Students
No ratings yet
Study Skills for Students
10 pages
Unit 1
No ratings yet
Unit 1
25 pages
Parallel Computing Challenges & Trends
No ratings yet
Parallel Computing Challenges & Trends
81 pages
Chap2 Slides Week3
No ratings yet
Chap2 Slides Week3
28 pages
Introduction
No ratings yet
Introduction
46 pages
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
No ratings yet
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
58 pages
HPC Unit 3
No ratings yet
HPC Unit 3
31 pages
Introduction
No ratings yet
Introduction
34 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
No ratings yet
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
13 pages
Unit4 Session3 Parallel Computing Concepts Terminology Design Issues
No ratings yet
Unit4 Session3 Parallel Computing Concepts Terminology Design Issues
30 pages
QNS. Parallel Computing
No ratings yet
QNS. Parallel Computing
44 pages
CSCI 8150 Advanced Computer Architecture
100% (2)
CSCI 8150 Advanced Computer Architecture
18 pages
Distributed Systems R19 - Unit-1
No ratings yet
Distributed Systems R19 - Unit-1
35 pages
UNIT - I: Parallel and Distributed Computing
No ratings yet
UNIT - I: Parallel and Distributed Computing
58 pages
Aca Notes
No ratings yet
Aca Notes
63 pages
Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
43 pages
Parallel Computing MCSE011
No ratings yet
Parallel Computing MCSE011
189 pages
Unit 1
No ratings yet
Unit 1
22 pages
Introduction To Computers
No ratings yet
Introduction To Computers
36 pages
4 TH
No ratings yet
4 TH
84 pages
RS - Pds-Oe 3010
No ratings yet
RS - Pds-Oe 3010
8 pages
PDS Merged
No ratings yet
PDS Merged
182 pages
2022 Mid 1
No ratings yet
2022 Mid 1
4 pages
HPC Lab: Parallel Computing Basics
No ratings yet
HPC Lab: Parallel Computing Basics
58 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
Aca Notes: Scalability
No ratings yet
Aca Notes: Scalability
13 pages
Week 7
No ratings yet
Week 7
27 pages
Course Outcome 1:: 15Cs4180 - Parallel Computing
No ratings yet
Course Outcome 1:: 15Cs4180 - Parallel Computing
23 pages
Parallel Computing 50 Questons
No ratings yet
Parallel Computing 50 Questons
5 pages
Module 5
No ratings yet
Module 5
45 pages
Unit 1
No ratings yet
Unit 1
21 pages
Lecture Notes On Parallel Computation
No ratings yet
Lecture Notes On Parallel Computation
30 pages
Lecture 5 Network Topologies For Parallel Architectures - Updated
No ratings yet
Lecture 5 Network Topologies For Parallel Architectures - Updated
46 pages
Pertemuan 11
No ratings yet
Pertemuan 11
18 pages
Parallel Programming Platforms (Part 1) : CSE3057Y Parallel and Distributed Systems
No ratings yet
Parallel Programming Platforms (Part 1) : CSE3057Y Parallel and Distributed Systems
38 pages
Lecture 4 Network Topologies For Parallel Architecture
No ratings yet
Lecture 4 Network Topologies For Parallel Architecture
34 pages
Chapter 3
No ratings yet
Chapter 3
21 pages
Slides Taken From: Parallel Computing Platforms
No ratings yet
Slides Taken From: Parallel Computing Platforms
11 pages
Distributed Systems Overview
No ratings yet
Distributed Systems Overview
54 pages
High Performance Computing
100% (2)
High Performance Computing
164 pages
Parallel Computing Quiz
No ratings yet
Parallel Computing Quiz
15 pages
Amdahl's Law & Karp-Flatt Metric
No ratings yet
Amdahl's Law & Karp-Flatt Metric
42 pages
Unit I 2 Marks With Answer
No ratings yet
Unit I 2 Marks With Answer
6 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
Mpi Course
No ratings yet
Mpi Course
202 pages
Additional Topics of Unit-I and Unit-II: Syed Rameem Zahra
No ratings yet
Additional Topics of Unit-I and Unit-II: Syed Rameem Zahra
21 pages
PDC Notes by Zatch-1
No ratings yet
PDC Notes by Zatch-1
42 pages
09 Communication Models of Parallel Platforms
No ratings yet
09 Communication Models of Parallel Platforms
25 pages
Chapter 2 - Parallel Programming Platforms
No ratings yet
Chapter 2 - Parallel Programming Platforms
33 pages
Lec1 Introduction To Parallel Computing
No ratings yet
Lec1 Introduction To Parallel Computing
40 pages
Unit 1
No ratings yet
Unit 1
54 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
Parallel Algorithms Explained
No ratings yet
Parallel Algorithms Explained
50 pages
1.1 Parallelism and Computing: 1.1.1 Trends in Applications
No ratings yet
1.1 Parallelism and Computing: 1.1.1 Trends in Applications
25 pages
Parallel and Distributed Computing Research Paper
No ratings yet
Parallel and Distributed Computing Research Paper
8 pages
Ex05 - To Create A CD Pipeline in Jenkins and Deploying To Azure Cloud
No ratings yet
Ex05 - To Create A CD Pipeline in Jenkins and Deploying To Azure Cloud
4 pages
How Does A Teacher Become A Facilitator of Learning
No ratings yet
How Does A Teacher Become A Facilitator of Learning
32 pages
Lesson Plan For Grade 12 DANCE
100% (1)
Lesson Plan For Grade 12 DANCE
2 pages
Liturgical Cycle
No ratings yet
Liturgical Cycle
29 pages
Compiler Token Separation Guide
No ratings yet
Compiler Token Separation Guide
5 pages
What'S New in This Version: Bugfix
No ratings yet
What'S New in This Version: Bugfix
10 pages
AT+CGEQREQ - 3G Quality of Service Profile
No ratings yet
AT+CGEQREQ - 3G Quality of Service Profile
1 page
Arduino Motor Shield 2A
No ratings yet
Arduino Motor Shield 2A
6 pages
World Literature 1
No ratings yet
World Literature 1
54 pages
Form 2 School Based Computer Science Syllabus
No ratings yet
Form 2 School Based Computer Science Syllabus
5 pages
The Stolen Legacy Student's Name University Affiliation Course Number and Name Instructor Name Assignment Due Date
No ratings yet
The Stolen Legacy Student's Name University Affiliation Course Number and Name Instructor Name Assignment Due Date
6 pages
Cat-Themed Musical Score
No ratings yet
Cat-Themed Musical Score
9 pages
9 It
No ratings yet
9 It
4 pages
New Features 12214 4470018
No ratings yet
New Features 12214 4470018
4 pages
BCA - Arithmetic Operations of Binary Numbers
No ratings yet
BCA - Arithmetic Operations of Binary Numbers
8 pages
Process of Writing
No ratings yet
Process of Writing
5 pages
City Life
No ratings yet
City Life
4 pages
Grammar Practice Activities
No ratings yet
Grammar Practice Activities
6 pages
KLS 9 ADVERTISEMENT SMT 2
100% (1)
KLS 9 ADVERTISEMENT SMT 2
28 pages
History of Yoga and Signifcance
No ratings yet
History of Yoga and Signifcance
1 page
Inquiry Unit Planning Template
No ratings yet
Inquiry Unit Planning Template
4 pages
Another Side of Life
No ratings yet
Another Side of Life
960 pages
Ba I Khao Sat HSG Anh 8 - V1-2021 39144
No ratings yet
Ba I Khao Sat HSG Anh 8 - V1-2021 39144
6 pages
Lesson Plan - Where Were You at
No ratings yet
Lesson Plan - Where Were You at
6 pages
EDGR - 698 - Literature Review Final Demographic Data
No ratings yet
EDGR - 698 - Literature Review Final Demographic Data
14 pages
Migration Tool Office Plus Templates
No ratings yet
Migration Tool Office Plus Templates
8 pages
Romance and Chivalry in English Medieval Literature
No ratings yet
Romance and Chivalry in English Medieval Literature
12 pages
Scattering Theory
No ratings yet
Scattering Theory
1 page

Parallel Computing

Uploaded by

Parallel Computing

Uploaded by

Parallel Computing

– at least two processors have to cooperate

Parallel computers : make use of multiple processors for parallel

T1 -- execution time on one processor

• Best case: s(p) = p (linear speed-up)

processors have equally fast access to any location in memory

access to own memory is faster than access to other memory

• Theoretical peak performance

• Linpack benchmark performance

• The theoretical peak performance of a parallel computer is equal to

– a program for a specific problem

– the number of operations which are executed is known

– used to measure the run time in a single-user mode

– to determine the benchmark performance (operations per

• Linpack benchmark – based on one particular subroutine which

– algorithm: LU factorization by Gaussian elimination with partial

– number of operations: 2n3/3 + O(n2) (n – number of unknowns)

• Top-500 list of most powerful computer installations

• Definition: the effective rate at which one half of the processing

• It does not scale linearly with the number of processing nodes in

• The bisection bandwidth grows in proportion to the number of

• The number of edges: n(n-1)/2

• The bisection bandwidth of the system is constant and equal to the

• Simple software and hardware.

• Total number of communication network ports -- Θ(n)

• Number of links -- Θ(n2).

• In practice the crossbar switch is used only to interconnect a

• A series of switches which are grouped in stages realizes the

• Can be organized in many different topologies fitted to particular

• Number of links -- Θ(n log(n))

• Bisection bandwidth -- Θ(n)

• Bisection bandwidth -- Ω(1)

• Remote communication -- Ο(n)

• Remote communication needs time O(n1/2).

• Bisection bandwidth -- Ω(n1/2).

• Remote communication -- O(n1/3)

• Bisection bandwidth -- Ω (n2/3)

Example: Cray T3E -- 10 x 10 x 10 grid

• Fit well the communicational requirements of reduction

• Less suited for regular data array redistribution operations.

• The decreasing aggregate bandwidth of a tree network in its

The aggregate bandwidth of a fat tree network is kept

• Two nodes are connected by an edge, iff their respective labels

• O(n log(n)) links

• Bisection bandwidth scales in proportion with the number of nodes.

• Remote communication -- O(log(n))

You might also like