Parallel Computing
Parallel Computers
Parallel processing/computing :
– at least two processors have to cooperate
– by means of exchanging data
– while working on different parts of one and the same
problem
Parallel computers : make use of multiple processors for parallel
processing
Speed-up
Basic idea of parallel processing: execution time can be reduced by
employing more than one processor, the larger the number of
processors the smaller the execution time.
Speed-up
s(p) = T1/Tp
T1 -- execution time on one processor
Tp -- execution time on p processors
• Best case: s(p) = p (linear speed-up)
• Worst case: s(p) = 1
• Generally: 1 ≤ s(p) ≤ p
Execution on one processor
T1 = Tseq + Tpar
Tseq -- execution time of sequential part
Tpar -- execution time of parallelisable part
Execution on p processors
Tp = Tseq + Tpar /p
Shared memory parallel computers
processors have equally fast access to any location in memory
Distributed memory parallel
computers
access to own memory is faster than access to other memory
Non-Uniform Memory Access
(NUMA) Architectures
Performance
• Theoretical peak performance
• Linpack benchmark performance
Theoretical peak performance
• Theoretical peak performance Rpeak – the maximal number of
arithmetical operations (additions and/or multiplications) a processor
can carry out per second
Rpeak 1 = f µpr
where:
– f – clock frequency
– µpr – maximum number of operations per clock cycle
• The theoretical peak performance of a parallel computer is equal to
the product of the number of processors and the theoretical peak
performance of one processor.
Rpeak p = pRpeak 1
Examples:
• Cray J32
– f = 100 MHz, µpr = 2, Rpeak 1 = 200 Mflops
– p = 32, Rpeak 32 = 6.4 Gflops
• NEC SX-5
– f = 250 MHz, µpr = 32, Rpeak 1 = 8 Gflops
– p = 16, Rpeak 16 = 128 Gflops
– n = 32 (NUMA), Rpeak 32*16 = 4 Tflops
Benchmark performance
• Benchmark
– a program for a specific problem
– the number of operations which are executed is known
– used to measure the run time in a single-user mode
– to determine the benchmark performance (operations per
second).
Linpack benchmark
• Linpack – a popular library of Fortran subroutines for the
numerical solution of linear algebra problems
• Linpack benchmark – based on one particular subroutine which
is used for the solution of a dense system of linear equations
– algorithm: LU factorization by Gaussian elimination with partial
pivoting
– number of operations: 2n3/3 + O(n2) (n – number of unknowns)
• Top-500 list of most powerful computer installations
http://www.top500.org/
Interconnection structures for
parallel computers
Bisection or cross-section
bandwidth
• Definition: the effective rate at which one half of the processing
nodes can send data to the other half (for worst case division of
the processors).
• It does not scale linearly with the number of processing nodes in
most interconnection schemes.
Complete communication graph
• The bisection bandwidth grows in proportion to the number of
nodes.
• The number of edges: n(n-1)/2
Bus
• The bisection bandwidth of the system is constant and equal to the
bandwidth of the bus.
• Simple software and hardware.
Crossbar switch
• Bisection bandwidth scales with the number of processing
nodes.
• Total number of communication network ports -- Θ(n)
• Number of links -- Θ(n2).
• In practice the crossbar switch is used only to interconnect a
relatively small number of processors.
Multistage switching networks
• A series of switches which are grouped in stages realizes the
connection between pairs of inputs and outputs.
• Can be organized in many different topologies fitted to particular
applications.
• Number of links -- Θ(n log(n))
• Bisection bandwidth -- Θ(n)
Example - Beneš network
Regular grids: 1-D arrays
• Linear processor array and ring.
• Bisection bandwidth -- Ω(1)
• Remote communication -- Ο(n)
Regular grids: 2-D arrays
• 2-D mesh
• Torus.
• Twisted torus.
• Remote communication needs time O(n1/2).
• Bisection bandwidth -- Ω(n1/2).
A two-dimensional mesh
Regular grids: 3-D arrays
• Remote communication -- O(n1/3)
• Bisection bandwidth -- Ω (n2/3)
Example: Cray T3E -- 10 x 10 x 10 grid
Trees
Binary tree
Trees
• Remote communication -- O(log(n)).
• Fit well the communicational requirements of reduction
operations and a number of optimal algorithms based on divide
and conquer techniques.
• Less suited for regular data array redistribution operations.
• The decreasing aggregate bandwidth of a tree network in its
upper levels and in particular around the root presents a severe
bottleneck for massive communication.
Fat tree
The aggregate bandwidth of a fat tree network is kept
constant at all levels of the tree
Binary Hypercubes
• A binary hypercube of degree d (d > 3) consists of n = 2d
nodes labeled by distinct d-bit binary numbers.
• Two nodes are connected by an edge, iff their respective labels
differ in exactly one bit position.
• O(n log(n)) links
• Bisection bandwidth scales in proportion with the number of nodes.
• Remote communication -- O(log(n))
Examples of binary hypercubes