0% found this document useful (0 votes)
90 views42 pages

CAQA5e ch1

ccc

Uploaded by

Minal Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views42 pages

CAQA5e ch1

ccc

Uploaded by

Minal Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Computer Architecture

A Quantitative Approach, Fifth Edition

Chapter 1
Fundamentals of Quantitative
Design and Analysis

Copyright 2012, Elsevier Inc. All rights reserved.

CAPP

CA Definition
PC Definition

If 1 Man takes 10 hours to complete the job; then


How long will it take if 10 Men work together

PARALLEL Computing!
Save wall clock time
Solve larger problems
Provide concurrency
(do multiple things at the same time)
Parallelism is the future of computing!

WHERE IS IT USEFUL?
Application Drivers for HPC
Weather Forecasting:
Atmosphere modeled by dividing it into 3-dimensional cells.
Calculations of each cell repeated many times to model passage of time.

Weather Forecasting Example

Consider atmosphere to be divided into cells of size 1 mile 1 mile 1 mile to a height of 10 miles
Global region can be schemed as 2x109 cells.

Each cell has many physical variables (temperature, pressure, humidity, wind speed and direction, etc) to be
computed

Suppose each calculation requires 200 floating point operations.


In one time step, 1011 floating point operations necessary.

FP

To forecast the weather over 7 days using 1-minute intervals, takes (7 * 24 * 60 * 1011) = 10080 * 1011 = 1015
operations

A computer operating at 1Gflops (109 floating point operations/s) takes 106 seconds or over 10 days.

To perform calculation in 5 minutes requires computer operating at 3.4 Tflops (3.4 1012 floating point
operations/sec).

Units of Measurement in HPC

Mflop/s 106 flop/sec


Gflop/s 109 flop/sec
Tflop/s 1012 flop/sec
Pflop/s 1015 flop/sec
Mbyte 106 byte (also 220 = 1048576)
MEMORY
Gbyte 109 byte (also 230 = 1073741824)
Tbyte 1012 byte(also 240 = 10995211627776)
Pbyte 1015 byte(also 250 = 1125899906842624)

Applications of Parallel Computing


- Prediction

of weather, climate, global changes


- Modeling motion of astronomical bodies
- Human genome
- Challenges in materials sciences
- Semiconductor design
- Superconductivity
- Structural biology
- Design of drugs
- Challenges in transportation
- Vehicle dynamics
- Nuclear fusion
- Enhanced oil and gas recovery
- Computational ocean sciences
- Speech
- Vision
- Visualization and graphics

High Performance Computing


Definition 1 (Wikipedia)
High Performance Computing (HPC) uses Supercomputers
and Computer Clusters to solve advanced computational problems.
Today, Computer Systems approaching the Teraflops- region are
counted as HPC-Computers
Definition 2
High Performance Computing (HPC) is the use of Parallel
Processing for running advanced application programs efficiently,
reliably and quickly.

The term applies especially to systems that function above a


TeraFlop or 1012
Floating-Point Operations Per Second.
#

TOP500
Jack Dongarra, H. Simon, E. Strohmaier and H. Meuer
Listing of the 500 most powerful Supercomputers in the World
Based on LINPACK Benchmark :Measure of Systems
Floating Point Computing Power
How fast a Computer solves a Dense N by N System of Linear
Equations

Result is reported as Rate of Execution in MFlops /Sec


Updated

twice a year
Supercomputing Conference (SC) in United States in
November
International Supercomputing Conference (ISC) in Europe
in June 14

www.top500.org

Indian Supercomputers

PC TO PP: LANGUAGES

OpenMP, MPI

OpenCL

CUDA

OpenACC

OPENMP
Open Specifications for Multi Processing
What is OpenMP?
De-facto standard API for writing shared memory
parallel applications in C, C++, and Fortran
Consists of:
Compiler Directives
Runtime Routines
Environment variables
Specification maintained by the OpenMP Architecture
Review Board (http://www.openmp.org)
Version 4.0 has been released July 2013

History

OPENMP
When to consider
When compiler cannot find parallelism
The granularity is not high enough

USE EXPLICIT PARALLELIZATION OpenMP

OPENMP
Memory Model
Shared Memory, Thread Based Parallelism
Explicit Parallelism
Fork - Join Model
Compiler Directive Based
Nested Parallelism Support
Dynamic Thread
Memory Model : Flush often

OPENMP
Data is private or shared.
All threads have access to same
globally shared memory.
Shared data accessible by all
threads.
Private accessed only by owned
threads.
Data transfer is transparent to
programmer.
Synchronization takes place, but it
is almost implicit.

OPENMP: Compilation
GNU Compiler Example :
gcc -o omp_helloc -fopenmp omp_hello.c

Advantages of OpenMP
Good performance and scalability
If you do it right ....

De-facto and mature standard


An OpenMP program is portable

Supported by a large number of compilers


Requires little programming effort
Allows the program to be parallelized incrementally

Performance improvements:

Improvements in semiconductor technology

Feature size, clock speed

Improvements in computer architectures

Introduction

Computer Technology

Enabled by HLL compilers, UNIX


Lead to RISC architectures

Together have enabled:

Lightweight computers
Productivity-based managed/interpreted
programming languages

Copyright 2012, Elsevier Inc. All rights reserved.

Move to multi-processor

Introduction

Single Processor Performance

RISC

Copyright 2012, Elsevier Inc. All rights reserved.

Cannot continue to leverage Instruction-Level


parallelism (ILP)

Single processor performance improvement ended in


2003

New models for performance:

Introduction

Current Trends in Architecture

Data-level parallelism (DLP)


Thread-level parallelism (TLP)
Request-level parallelism (RLP)

These require explicit restructuring of the


application
Copyright 2012, Elsevier Inc. All rights reserved.

Personal Mobile Device (PMD)

Desktop Computing

Emphasis on availability, scalability, throughput

Clusters / Warehouse Scale Computers

Emphasis on price-performance

Servers

e.g. start phones, tablet computers


Emphasis on energy efficiency and real-time

Classes of Computers

Classes of Computers

Used for Software as a Service (SaaS)


Emphasis on availability and price-performance
Sub-class: Supercomputers, emphasis: floating-point performance
and fast internal networks

Embedded Computers

Emphasis: price

Copyright 2012, Elsevier Inc. All rights reserved.

Classes of parallelism in applications:

Data-Level Parallelism (DLP)


Task-Level Parallelism (TLP)

Classes of Computers

Parallelism

Classes of architectural parallelism:

Instruction-Level Parallelism (ILP)


Vector architectures/Graphic Processor Units (GPUs)
Thread-Level Parallelism
Request-Level Parallelism

Copyright 2012, Elsevier Inc. All rights reserved.

Classes of Computers

Flynns Taxonomy

Single instruction stream, single data stream (SISD)

Single instruction stream, multiple data streams (SIMD)

Multiple instruction streams, single data stream (MISD)

Vector architectures
Multimedia extensions
Graphics processor units

No commercial implementation

Multiple instruction streams, multiple data streams (MIMD)

Tightly-coupled MIMD
Loosely-coupled MIMD

Copyright 2012, Elsevier Inc. All rights reserved.

Old view of computer architecture:

Instruction Set Architecture (ISA) design


i.e. decisions regarding:

registers, memory addressing, addressing modes, instruction


operands, available operations, control flow instructions,
instruction encoding

Defining Computer Architecture

Defining Computer Architecture

Real computer architecture:

Specific requirements of the target machine


Design to maximize performance within constraints:
cost, power, and availability
Includes ISA, microarchitecture, hardware

Copyright 2012, Elsevier Inc. All rights reserved.

Integrated circuit technology

Transistor density: 35%/year


Die size: 10-20%/year
Integration overall: 40-55%/year

DRAM capacity: 25-40%/year (slowing)

Flash capacity: 50-60%/year

Trends in Technology

Trends in Technology

15-20X cheaper/bit than DRAM

Magnetic disk technology: 40%/year

15-25X cheaper/bit then Flash


300-500X cheaper/bit than DRAM

Copyright 2012, Elsevier Inc. All rights reserved.

Bandwidth or throughput

Total work done in a given time


10,000-25,000X improvement for processors
300-1200X improvement for memory and disks

Trends in Technology

Bandwidth and Latency

Latency or response time

Time between start and completion of an event


30-80X improvement for processors
6-8X improvement for memory and disks

Copyright 2012, Elsevier Inc. All rights reserved.

Trends in Technology

Bandwidth and Latency

Log-log plot of bandwidth and latency milestones


Copyright 2012, Elsevier Inc. All rights reserved.

Feature size

Minimum size of transistor or wire in x or y


dimension
10 microns in 1971 to .032 microns in 2011
Transistor performance scales linearly

Trends in Technology

Transistors and Wires

Wire delay does not improve with feature size!

Integration density scales quadratically

Copyright 2012, Elsevier Inc. All rights reserved.

Problem: Get power in, get power out

Thermal Design Power (TDP)

Characterizes sustained power consumption


Used as target for power supply and cooling system
Lower than peak power, higher than average power
consumption

Clock rate can be reduced dynamically to limit


power consumption

Energy per task is often a better measurement


Copyright 2012, Elsevier Inc. All rights reserved.

Trends in Power and Energy

Power and Energy

Dynamic energy

Dynamic power

Transistor switch from 0 -> 1 or 1 -> 0


x Capacitive load x Voltage2

Trends in Power and Energy

Dynamic Energy and Power

x Capacitive load x Voltage2 x Frequency switched

Reducing clock rate reduces power, not energy

Copyright 2012, Elsevier Inc. All rights reserved.

Intel 80386
consumed ~ 2 W
3.3 GHz Intel Core
i7 consumes 130 W
Heat must be
dissipated from 1.5
x 1.5 cm chip
This is the limit of
what can be cooled
by air

Copyright 2012, Elsevier Inc. All rights reserved.

Trends in Power and Energy

Power

Techniques for reducing power:

Do nothing well
Dynamic Voltage-Frequency Scaling
Low power state for DRAM, disks
Overclocking, turning off cores

Copyright 2012, Elsevier Inc. All rights reserved.

Trends in Power and Energy

Reducing Power

Static power consumption

Currentstatic x Voltage
Scales with number of transistors
To reduce: power gating

Copyright 2012, Elsevier Inc. All rights reserved.

Trends in Power and Energy

Static Power

Cost driven down by learning curve

Trends in Cost

Trends in Cost
Yield

DRAM: price closely tracks cost

Microprocessors: price depends on volume

10% less for each doubling of volume

Copyright 2012, Elsevier Inc. All rights reserved.

Integrated circuit

Bose-Einstein formula:

Defects per unit area = 0.016-0.057 defects per square cm (2010)


N = process-complexity factor = 11.5-15.5 (40 nm, 2010)

Copyright 2012, Elsevier Inc. All rights reserved.

Trends in Cost

Integrated Circuit Cost

Module reliability

Dependability

Dependability
Mean time to failure (MTTF)
Mean time to repair (MTTR)
Mean time between failures (MTBF) = MTTF + MTTR
Availability = MTTF / MTBF

Copyright 2012, Elsevier Inc. All rights reserved.

Typical performance metrics:

Speedup of X relative to Y

Execution timeY / Execution timeX

Execution time

Response time
Throughput

Measuring Performance

Measuring Performance

Wall clock time: includes all system overheads


CPU time: only computation time

Benchmarks

Kernels (e.g. matrix multiply)


Toy programs (e.g. sorting)
Synthetic benchmarks (e.g. Dhrystone)
Benchmark suites (e.g. SPEC06fp, TPC-C)
Copyright 2012, Elsevier Inc. All rights reserved.

Take Advantage of Parallelism

e.g. multiple processors, disks, memory banks,


pipelining, multiple functional units

Principle of Locality

Principles

Principles of Computer Design

Reuse of data and instructions

Focus on the Common Case

Amdahls Law

Copyright 2012, Elsevier Inc. All rights reserved.

Principles

Principles of Computer Design


The Processor Performance Equation

Copyright 2012, Elsevier Inc. All rights reserved.

Principles

Principles of Computer Design


Different instruction types having different
CPIs

Copyright 2012, Elsevier Inc. All rights reserved.

You might also like