0% found this document useful (0 votes)
237 views72 pages

Ec8552-Cao Unit 5

Here is the proposed lecture plan for Unit 5 - Advanced Computer Architecture: Lecture 1: Parallel Processing Architectures and Challenges (3 periods) - K2, C303.6 Lecture 2: Processor Organization (Flynn's Classification) (3 periods) - K2, C303.6 Lecture 3: Hardware Multithreading (3 periods) - K2, C303.6 Lecture 4: Multicore and Shared Memory Multiprocessors (3 periods) - K2, C303.6 Lecture 5: Introduction to Graphics Processing Units (3 periods) - K2, C303.6 Lecture 6: Clusters (2 periods
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
237 views72 pages

Ec8552-Cao Unit 5

Here is the proposed lecture plan for Unit 5 - Advanced Computer Architecture: Lecture 1: Parallel Processing Architectures and Challenges (3 periods) - K2, C303.6 Lecture 2: Processor Organization (Flynn's Classification) (3 periods) - K2, C303.6 Lecture 3: Hardware Multithreading (3 periods) - K2, C303.6 Lecture 4: Multicore and Shared Memory Multiprocessors (3 periods) - K2, C303.6 Lecture 5: Introduction to Graphics Processing Units (3 periods) - K2, C303.6 Lecture 6: Clusters (2 periods
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Please read this disclaimer before proceeding:

This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
COMPUETR
ARCHITECTURE AND
ORGANIZATION
Department: Electronics and Communication Engineering
Batch/Year: 2019-23 / III
Created by:
Ms.P. Latha, Assoc.Prof/ECE,RMKEC
Mr. R.Babuji, Asst.Prof/ECE,RMKEC

Date:13.10.2021
Table of Contents
S.No Contents Page No

1 Course Objectives 7

2 Pre Requisites 8

3 Syllabus 9

4 Course outcomes 10

5 CO- PO/PSO Mapping 11

6 Unit 5 – Advanced Computer Architecture 12

6.1 Lecture plan 14

6.2 Activity based learning 16

6.3 Lecture Notes 17

Parallel Processing Architectures and Challenges 18

Processor Organization (Flynn’s Classification) 22

Hardware Multithreading 29

Multicore and Shared Memory Multiprocessors 32

Introduction to Graphics Processing Units 37

Clusters 40

Warehouse Scale Computers 41

Introduction to Multiprocessor Network Topologies 42

Video Lecture Links 47


E-Book Reference & Lecture PPT Links 48
S.No Contents Page No

6.4 Assignments 50

6.5 Part A Questions & Answers 52

6.6 Part B Questions 57

6.7 Supportive Online Certification Courses

6.8 Real Time Applications in day to day life and


to Industry
6.9 Content beyond the Syllabus

7 Assessment Schedule

8 Prescribed Text Books & Reference Books

9 Mini Project suggestions


1. COURSE OBJECTIVES

The student should be made to:


To make students understand the basic structure and operation of digital
computer.
To familiarize with implementation of fixed point and floating-point arithmetic
operations.
To study the design of data path unit and control unit for processor.
To understand the concept of various memories and interfacing.
To introduce the parallel processing technique.
2. PRE REQUISITES

Subject Name : Fundamentals of Data Structures in C


Subject Code : EC8393
Semester :3
Reason : Students should be familiar to develop C programs with conditionals
and loops.

Subject Name : Digital Electronics


Subject Code : EC8392
Semester :3
Reason : Students should be familiar with the digital fundamentals of number
systems and the design of various combinational digital circuits and
the knowledge of semiconductor memories and related technology.

Subject Name : Fundamentals of Data Structures in C Laboratory


Subject Code : EC8381
Semester :3
Reason : Students should be familiar to learn to implement C programs with
conditionals and loops.

Subject Name : Analog and Digital Circuits Laboratory


Subject Code : EC8361
Semester :3
Reason : Students should be familiar to design and implement the
combinational logic circuits
3. SYLLABUS

EC8552 COMPUTER ARCHITECTURE AND ORGANIZATION

LTPC
3003
UNIT I COMPUTER ORGANIZATION & INSTRUCTIONS 9
Basics of a computer system: Evolution, Ideas, Technology, Performance, Power
wall, Uniprocessors to Multiprocessors. Addressing and addressing modes.
Instructions: Operations and Operands, Representing instructions, Logical
operations, control operations.

UNIT II ARITHMETIC 9
Fixed point Addition, Subtraction, Multiplication and Division. Floating Point
arithmetic, High performance arithmetic, Subword parallelism

UNIT III THE PROCESSOR 9


Introduction, Logic Design Conventions, Building a Datapath - A Simple
Implementation scheme - An Overview of Pipelining - Pipelined Datapath and
Control. Data Hazards: Forwarding versus Stalling, Control Hazards, Exceptions,
Parallelism via Instructions.

UNIT IV MEMORY AND I/O ORGANIZATION 9


Memory hierarchy, Memory Chip Organization, Cache memory, Virtual memory.
Parallel Bus Architectures, Internal Communication Methodologies, Serial Bus
Architectures, Mass storage, Input and Output Devices.

UNIT V ADVANCED COMPUTER ARCHITECTURE 9


Parallel processing architectures and challenges, Hardware multithreading,
Multicore and shared memory multiprocessors, Introduction to Graphics
Processing Units, Clusters and Warehouse scale computers - Introduction to
Multiprocessor network topologies.
TOTAL:45 PERIODS
4. COURSE OUTCOMES

After successful completion of the course, the students should be able to

Level in
Course
Description Bloom’s
Outcomes
Taxonomy
Describe the basic structure and operation of digital
C303.1 K2
computer.
Experiment the Fixed point and Floating-point arithmetic
C303.2 K3
operations.
Study the design of data path unit and control unit for
C303.3 K2
processor.
Discuss about pipelined control units and various types of
C303.4 K2
hazards in the instructions.
C303.5 Describe the concept of various memories and interfacing. K2
Summarize the latest advancements in computer
C303.6 K2
architecture
5. CO – PO/PSO MAPPING

Program
Program Outcomes Specific
Course Outcomes
Level K3
Out
of CO K3 ,K PSO PSO PSO
K4 K4 K5 A3 A2 A3 A3 A3 A3 A2
Comes 5, 1 2 3
K6
PO PO PO PO PO PO PO PO PO PO PO PO K5 K5 K3
1 2 3 4 5 6 7 8 9 10 11 12
C303.1 K2 2 1 1 1 - - - - - 2 - - - - 2
C303.2 K3 3 3 3 2 1 - - - - 2 - - - - 3
C303.3 K2 2 1 1 1 - - - - - 2 - - - - 2
C303.4 K2 2 1 1 1 - - - - - 2 - - - - 2
C303.5 K2 2 1 1 1 - - - - - 2 - - - - 2
C303.6 K2 2 1 1 1 - - - - - 2 - - - - 2
C303 3 3 3 2 1 - - - - 2 - - - - 3
UNIT 5
ADVANCED COMPUTER
ARCHITECTURE
LECTURE PLAN
6.1 LECTURE PLAN

UNIT 5 – ADVANCED COMPUTER ARCHITECTURE

Taxonomy Level
Proposed Date
No. of periods

Pertaining CO
Actual Date

Reason for
Deviation
Delivery
Mode
S.No Topic

of
1 Parallel processing 2 PPT through
architectures and CO6 K2
Online
challenges
2 Hardware 1 PPT through
CO6 K2
multithreading Online
3 Multicore and shared 1 PPT through
memory CO6 K2
Online
multiprocessors
4 Content Beyond 1
Syllabus:
PPT through
Memory CO6 K2
Online
Management
Techniques
5 Introduction to 1 PPT through
Graphics Processing CO6 K2
Online
Units
6 Clusters and 1 PPT through
Warehouse scale CO6 K2
Online
computers
7 Introduction to 1
Multiprocessor PPT through
CO6 K2
network topologies. Online

8 Content Beyond 1
Syllabus:
PPT through
Memory CO6 K2
Online
Management
Techniques

Total No. of Periods : 09


ACTIVITY BASED
LEARNING
6.2 ACTIVITY BASED LEARNING

UNIT 5
ADVANCED COMPUER ARCHITECTURE

Sample Live Quizzes in www.myquiz.org


Connexions Event: https://tinyurl.com/CAOActivity
LECTURE NOTES
6.3 LECTURE NOTES

UNIT 5
ADVANCED COMPUTER ARCHITECTURE
Parallel processing architectures and challenges, Hardware multithreading, Multicore
and shared memory multiprocessors, Introduction to Graphics Processing Units,
Clusters and Warehouse scale computers - Introduction to Multiprocessor network
topologies.

5.1 PARALLEL PROCESSING ARCHITECTURES AND


CHALLENGES
Parallel processing will increase the performance of processor and it will reduce
the utilization time to execute a task. But obtaining the parallel processing is not
an easy task. We need to have difficulty in writing programs to execute a parallel
process.
The difficulty with parallelism is not in the hardware side it is in the software side.
Because some important application programs have been rewritten to complete
tasks sooner on multiprocessors.
It is difficult to write software that uses multiple processors to complete one task
faster and the problem gets worse as the number of processors increases.
Developing the parallel processing programs are so harder than the sequential
programs because of the following reasons:
First reason is we must get better performance or better energy efficiency from a
parallel processing program on a multiprocessor.
If we would not have efficient energy then we have to use a sequential program
on a uniprocessor because sequential programming is very simple.
Using uniprocessor we can solve the problem in developing parallel processing
programs. Because uniprocessor design techniques such as superscalar and out of
order execution take advantage of instruction level parallelism.
Uniprocessor design techniques will reduce the demand for rewriting programs for
multiprocessors.
If number of processors level increased means it is more difficult to write parallel
processing programs. Because of the following reasons we cannot get parallel
processing programs as faster than sequential programs. The reasons are
a) Scheduling
b) Partitioning the work into parallel pieces
c) Balancing the load evenly between the workers
d) Time to synchronize
e) Overhead for communication between the parties
These are the five most challenges the developers has to face to write a parallel
processing program.

a) Scheduling
Scheduling is the method by which threads, processes or data flows are given
access to system resources (e.g processor time, communication bandwidth).
Scheduling is done to load balance and share system resources effectively or
achieve a target quality of service.
Scheduling can be done in various fields among that process scheduling is more
important, because in parallel processing we need to schedule the process
correctly. Process scheduling cane be done in following ways:
➢ Long term scheduling
➢ Medium term scheduling
➢ Short term scheduling
➢ Dispatcher

b) Partitioning the work


The task must be broken into equal number of pieces because otherwise some
task may be idle while waiting for the ones with larger pieces to finish.
To obtain parallel processing task must be divided into equally to all the processor.
Then only we can avoid the idle time of any processor.

c) Balancing the load


Load balancing is the process of dividing the amount of work that a computer has
to do between two or more processor. So that more work gets done in the same
amount of time and in general all process get served faster.
Work load has to be distributed evenly between the processor to obtain parallel
processing task.

d) Time to synchronize
Synchronization is the most important challenge in parallel processing. Because all
the processor have equal work load so it must complete the task within specific
time period.
For parallel processing program it must have time to synchronization process,
because if any process does not complete the task within specific time period
then we cannot obtain parallel processing.
e) Communication between the parties
Inter-processor communication virtually always implies overhead.
Knowing which tasks must communicate with each other is critical during the
design stage of a parallel code.

PROBLEMS :

Example 1:
To achieve a speed-up of 90 times faster with 100
processors, What percentage of the original computation can be
sequential?

Answer:
Amdahl’s Law says

We can reformulate Amdahl’s Law in terms of speed-up versus the original


execution time:

This formula is usually rewritten assuming that the execution time before is 1 for
some unit of time, and the execution time affected by improvement is considered
the fraction of the original execution time:

Substituting 90 for speed-up and 100 for amount of improvement into the formula
above:
Then simplifying the formula and solving for fraction time affected:
90 * (1 -0.99 * Fraction time affected) = 1
90 - (90 * 0.99 * Fraction time affected) = 1
90 - 1 = 90 * 0.99 * Fraction time affected
Fraction time affected = 89/89.1 = 0.999
Thus, to achieve a speed-up of 90 from 100 processors, the sequential
percentage can only be 0.1%.

Example: 2
Speed-up Challenge: Bigger Problem
Suppose you want to perform two sums: one is a sum of 10 scalar variables, and
one is a matrix sum of a pair of two-dimensional arrays, with dimensions 10 by 10.
For now let’s assume only the matrix sum is parallelizable; we’ll see soon how to
parallelize scalar sums. What speed-up do you get with 10 versus 40 processors?
Next, calculate the speed-ups assuming the matrices grow to 20 by 20.

Answer:
If we assume performance is a function of the time for an addition, t, then there
are 10 additions that do not benefit from parallel processors and 100 additions
that do. If the time for a single processor is 110 t, the execution time for 10
processors is
So the speed-up with 40 processors is 110t/12.5t = 8.8.
Thus, for this problem size, we get about 55% of the potential speed-up with 10
processors, but only 22% with 40.
The sequential program now takes 10t + 400t = 410t.
The execution time for 10 processors is

So the speed-up with 10 processors is 410t/50t = 8.2.


The execution time for 40 processors is

So the speed-up with 40 processors is 410t/20t = 20.5.


Thus, for this larger problem size, we get 82% of the potential speed-up with 10
processors and 51% with 40.

5.2 PROCESSOR ORGANIZATION [FLYNN’S


CLASSIFICATION]
The processor organization deals with how the parts such as control units and
processing elements (ALU) of the processor are linked together to improve the
performance of the processor.

a) SISD (Single Instruction stream Single Data stream)


b) SIMD (Single Instruction stream Multiple Data stream)
c) MISD (Multiple Instruction stream Single Data stream)
d) MIMD (Multiple Instruction stream Multiple Data stream)
a) SISD (Single Instruction stream Single Data stream)
SISD has a single control unit and gets a single instruction from the main memory
at a time.
It has one processing Element (ALU) and uses one data stream connected with
the main memory.
Processing unit may have more functional units (add, multiply, load etc.)
Instruction stream (Is)= Data stream (Ds) = 1
Example of SISD is uniprocessor.

Control IS DS Main
ALU
Unit Memory
IS

Figure 5.2.1 SISD Processor

b) SIMD (Single Instruction stream Multiple Data stream)


SIMD has one instruction and multiple data stream.
SIMD is more suitable for handling arrays in the for loops and data parallelism is
achieved.
SIMD has a single control unit and producing a single stream of instruction and
multi stream of data.
SIMD has more than one processing unit and each processing unit has its own
associative data memory unit.
SIMD gets only one copy of the code from the main memory and this operation is
performed on multiple independent data obtained by the main memory as shown
in the Figure 5.2.2.
Reduces the instruction bandwidth and space.
Not suitable for the case … switch statements because the execution unit
(Processing unit) must perform different operations based on data
The processor should complete the current instruction before it takes the next
instruction i.e. execution of the instruction is synchronous.
Example of SIMD is Vector Processor and Array Processor
Instruction stream (Is)= 1, Data stream (Ds) > 1

PE-Processing
MM- Main
Element
Memory
IS DS1
PE1 MM1

I DS2
Control IS S PE2 MM2
Unit

IS
PEn
DSn

MMn

IS

Figure 5.2.2 SISD Processor

c) MISD (Multiple Instruction stream, Single Data stream)


In this organization, multiple control units are used to control multiple processing
units.
Each control unit is handling one instruction and process it through its
corresponding process element .
Only one data stream is passing through all processing elements at a time from a
common shared memory.
All processing elements are interacting with the common shared memory for the
organization of single data stream as shown in Figure 5.2.3.
The only known example of a computer capable of MISD operation is the C.mmp
built by Carnegie-Mellon University.
Instruction stream (Is)> 1, Data stream (Ds) = 1
Figure 5.2.3 MISD Processor

d) MIMD (Multiple Instruction stream, Multiple Data stream)


In this organization, multiple control units are used to handle multiple instructions
at same time.
Multiple processing elements are used with separate data stream drawn from
main memory for each processing element.
Each process works on its own instruction and own data.
Task executed by different processes are asynchronous i.e. each task can start or
finish at different times.
Therefore, for handling multiple instruction streams, multiple control units and
multiple processing elements are organizes such that multiple processing
elements are handling multiple data streams from the main memory as shown in
Figure 5.2.4.
This organization actually represents a real parallel computer - Example Graphics
Processing Unit (GPU).
Instruction stream (Is)> 1, Data stream (Ds) > 1
Figure 5.1.4 MIMD Processor

5.2.1 SIMD-VECTOR ARCHITECTURE


SIMD is a vector architecture and it is used for data level parallelism.
In vector architecture data are collected from memory, put them in proper order
into a set of registers. These registers are operated sequentially using pipelined
execution units. The results are written back to memory.
Old array processors use 64 ALUs for doing 64 additions simultaneously
But SIMD vector architecture use less number of ALUs (even one) and passes the
data through lanes and pipelines. This reduces the hardware cost.
In MIPS vector architecture, 32 vector registers are provided and each register
will point 64 vector elements each of 64 bit size.
Hardware gets the addresses of the vector from these vector vector-registers.
This indexed accesses are called as gather scatter.
The data need not be contiguous in main memory. Indexed load instructions
gather data from the main memory and put them in contiguous vector elements.
Indexed store instruction scatters vector elements across main memory.
The number of elements in a vector operation is not in the instruction or opcode,
it is in a separate register
MIPS vector instructions are obtained by appending letter v to MIPS instructions
for example
addv.d # adds two double precision vectors. This instruction accesses the
inputs using two vector registers
addvs.d # This instruction takes one input from a scalar register and another
input using a vector register. The scalar input is added with each element in the
vector.
lv # load vector (double precision data)
sv # store vector (double precision data)

Vector versus Multimedia Extensions:


If two vectors A and B are to be added and the result is to be stored in the vector
C, then it can be written as C = A + B . A(1), A(2) etc are vector elements
Figure 5.2.5 (a) shows a single ‘add’ pipeline to add vectors A and B and stores
the result in vector C. Here one addition is performed per cycle.

Figure 5.2.5 Using multiple functional units to improve the performance of


a single vector add instruction, C = A + B.
Figure 5.2.5 (b) uses four add pipelines or lanes. So, it completes four additions
per cycle.
The number of clocks required to execute a vector addition is reduced by a factor
of 4.
Each vector lane uses a portion of vector register.

Vector Lane:
Figure 5.2.6 uses four lanes. Each vector lane has more than one functional
units.
Three functional units FP add, FP Multiply, and a load store unit are provided
The elements in a single vector are interleaved across four lanes,
Each lane uses a portion of vector register.
The vector storage is divided across four lanes and each lane holds every fourth
element of each vector register.

Figure 5.2.6 Structure of a Vector Unit Containing 4 Lanes


Problem:
Write an MIPS program using vector instructions to solve
Y = a*X + Y
Where X and Y are vectors (arrays) with 64 double precision floating
point numbers i.e. 64 numbers of 64-bit size each stored in the memory.
Assume that the starting address of X is in $s0 and starting address of Y
is in $s1

Solution:

l.d $f0, a($sp) # load scalar ‘a’ into f0 register


lv $v1, 0($s0) # load vector X pointed by register s0 into v1 register
mulvs.d $v2, $v1, $f0 # multiply vector v1 by scalar f0 and store the
result in vector v2
lv $v3, 0($s1) # load vector Y in v3
addv.d $v4, $v2, $v3 # add vector Y (in v3) to the product (in v2) and
store the result in vector v4
sv $v4, 0($s1) # store the result v4 in the vector Y (pointed by register s1)

5.3 HARDWARE MULTITHREADING


Thread:
Thread is a sequence of instructions that can run independently from other
programs.
When a sequence of instructions are being executed, the processor may have to
wait if the next instruction or data is not available. This is called stalling.
Instead of waiting, the processor may switch to another thread , execute it and
come back to this thread.
Multithreading:
Switching from one thread (stalled thread) to another thread is known as
multithreading. All the threads generally share a single address space using a
program counter, and stack and register states.

Process:
A process includes one or more threads and their address spaces. Switching from
one process to another process invokes the operating system (OS).
The main difference between multithreading and the process: Multithreading uses
single address space and it does not invoke the OS. But, a process switches from
threads at different address spaces and requires the help of the operating system
to do this switching. So, you can say that multithreading is a lightweight process
and it is smaller than the process.

Types of multithreading:

5.3.1 Fine-grained multithreading


5.3.2 Coarse-grained multithreading
5.3.3 Simultaneous multithreading

5.3.1 Fine-grained multithreading


Switching between threads happens on each instruction.
Switching happens on every clock cycle.
Switching is done in a round-robin fashion as shown in figure 1 i.e. 1st instruction
from the 1st thread is executed then 1st instruction from the 2nd thread is
executed and it continues until the 1st instruction in the last thread completes.
Then the 2nd instruction from the 1st thread is executed, 2nd instruction from the
2nd thread is executed and so on.
Advantage: If any thread is stalled, it is skipped and the next thread continues .
This is called interleaving and this approach improves the throughput.
Disadvantage: Threads that are ready to execute the next instruction without
stalls should wait until other threads are over. This slows down the execution of
individual threads.
5.3.2 Coarse-grained multithreading
Threads are switched only when costly stalls (i.e. last-level cache miss) occur.
So, frequent thread switching is avoided and hence slow down of individual thread
execution is avoided.
Disadvantage: The new thread starts execution only after the pipeline is filled
up and the current instructions completes its execution. This is called start-up
overhead.
Advantage: The processor issues instructions from the same thread when
shorter stalls occur. So pipeline may be required to be emptied or frozen.
However, the time taken in this case is less compared to the pipeline start-up time
and hence the throughput cost is minimized.
This approach reduces the penalty of high-cost stalls because the pipeline refill
time is negligible compared to stall time.

5.3.3 Simultaneous multithreading


Multiple instructions are executed from independent multiple threads using
register renaming. If the threads are dependent, the dependencies are handled
by dynamic scheduling. But it does not switch resources every clock cycle.
SMT processors have dynamically scheduled pipelines for thread level parallelism
and instruction level parallelism. These processors have more functional units to
implement parallelism.
Instead, SMT is always executing instructions from multiple threads, leaving it up
to the hardware to associate instruction slots and renamed registers with their
proper threads.
Figure 5.3.1 shows three threads that will execute independently with stalling.
The empty rows indicate unused clock cycle or stall.
One row in each thread is issued to pipe line in each clock cycle
Figure 5.3.2 (a) shows how the three threads shown in Figure 5.3.1 are
executed when fine grained multithreading is applied
Figure 5.3.2 (b) shows how the three threads shown in Figure 5.3.1 are
executed when coarse grained multithreading is applied
Figure 5.3.2 (c) shows how the three threads shown in Figure 5.3.1 are
executed when simultaneous multithreading is applied
Figure 5.3.1 Independent threads

Figure 5.3.2 The above three threads executed with Multithreading

5.4 MULTICORE AND OTHER SHARED MEMORY


MULTIPROCESSORS

Multiprocessor : A computer system with atleast two processors is called as


multiprocessor.
Multicore : More than one processor available within a single chip is called as
Multicore.
Figure 5.4.1 Classic organization of a shared memory multiprocessor

The conventional multiprocessor system used is commonly referred as shared


memory multiprocessor system.
Shared Memory Multiprocessor (SMM) is one that offers the programmer a
single physical address space across all processors which is nearly always the case
for multicore chips.
Processors communicate through shared variables in memory, with all processors
capable of accessing any memory location via loads and stores.
Systems can still run independent jobs in their own virtual address space, even if
they all share a physical address space.
Use of shared data must be coordinated via synchronization primitives (locks) that
allow access to data to only one processor at a time.
Two Types of Multiprocessing:
➢ More than one processors are in a single chip - Multicore processor.
➢ More than one processors are connected in a single system -
Multiprocessor system.
5.4.1. SHARED MEMORY SYSTEM [TIGHTLY COUPLED SYSTEM]
All the processors share a single global memory. The global memory may be
divided into many modules, but single address space is used. (e.g. Multicore
processor)
Processors communicate using shared locations (variables) in the global memory.
These shared data are coordinated using locks (synchronization primitives) which
allow data to be accessed by only one processor at a time.
Shared memory system use a common bus or a cross bar or a multistage network
to connect processors, memory and I/O devices
Programs stored in the virtual address space of each processor can run
independently.
This is used for high speed real-time processing and provides high throughput
compared to loosely coupled systems.

Figure 5.4.2 Tightly Coupled System Organization


5.4.1.1 Uniform memory Access (UMA) System:
In this model, main memory is uniformly shared by all processors in
multiprocessor systems and each processor has equal access time to shared
memory.
This model is used for time-sharing applications in a multiuser environment.
UMA systems are divided into two,

A) Symmetric UMA (SUMA)


B) Asymmetric UMA (AUMA)

In the case of SUMA all processors are identical. Processors may have local cache
memories and I/O devices. Physical memory is uniformly shared by all processors
with equal access time to all words.
In the case of AUMA, one master processor executes the operating system and
other processors may be dedicated to special tasks such as graphics rendering,
doing mathematical functions etc.

Figure 5.4.3 UMA Processor


5.4.1.2 Non-uniform Memory Access (NUMA) System:
In NUMA system each processor may have a local memory. Local memories will
have its own private program and private data. The collection of all local
memories form the global memory i.e. local memory of one processor may be
accessed by another processor using shared variables.
The time taken for accessing the local memory of one processor by another
remote processor is not uniform. It depends on the location of the processor and
the memory.
NUMAs can scale to larger size with lower latency (access time) to local memory.

Figure 5.4.4 NUMA Processor


5.4.2 DISTRIBUTED MEMORY SYSTEM (DMS) [LOOSELY
COUPLED SYSTEMS]
DMS systems do not use a global shared memory.
Use of global memory creates memory conflicts and slows down the execution.
DMS has multiple processors and each processor has a large local memory and a
set of I/O devices, which are not shared by any other processor. So, this system
is called distributed multicomputer system.
The group of computers connected together is called a cluster and each computer
is called a node.
These computers communicate with each other by passing messages through an
interconnection network.
To pass messages to other computers in the cluster ‘send message’ routine is
used.
To receive messages from other computers in the cluster ‘receive message’
routine is used.
Figure 5.4.5 Loosely Coupled system Configuration

5.5 INTRODUCTION TO GRAPHICS PROCESSING UNITS


GPU is a processor specially designed for handling graphics rendering tasks.
GPU is used to accelerate the processing in video editing, video game rendering,
3D modelling (AUTO CAD), AI based tasks etc.
GPU breaks complex problems into many tasks and work on them parallelly.
GPUs are highly multithreaded.
Central Processing Unit (CPU) Graphics Processing Unit (GPU)

CPU has general purpose instructions GPU has its own special instructions to
and they are more suitable for serial handle graphics and more suitable for
processing. parallel processing.

CPUs will have just few cores with GPUs have hundreds of cores and can
cache and can handle only few handle thousands of threads
threads at a time simultaneously

CPU has a large main memory which GPU has a separate large main
is oriented toward low latency. memory which is oriented toward
bandwidth rather than latency and
provides high throughput
CPU is mainly designed for instruction GPU is designed for data level
level parallelism. parallelism
5.5.1 GPU ARCHITECTURE - NVIDIA
NVIDIA is an American company. They developed ‘Compute Unified Device
Architecture’ (CUDA) for graphics processing units and its commercial name is
Fermi
GeForce is a brand of graphics processing units designed by NVIDIA
GPU contains a collection of multithreaded SIMD processors and hence it is a
MIMD processor.
The number of SIMD processors will be 7 or 11 or 14 or 15 in the Fermi
architecture.
A CUDA program calls kernels parallelly.
The GPU executes a kernel on a grid, A grid has many thread blocks. The thread
block consists of many threads which will be executed parallelly (Figure 5.5.2).
Each thread with in the thread block is called as machine object and it has a
thread ID, program instructions, program counter, registers, per-thread private
memory, inputs and outputs.
The machine object is created , managed, scheduled and executed by the GPU.
Figure 5.5.1 shows a simplified block diagram of a multithreaded SIMD processor.

Figure 5.5.1 Simplified block diagram of the datapath of a multithreaded SIMD


Processor
The GPU has two schedulers

a) Thread Block Scheduler: Assigns blocks of threads to multithreaded


SIMD processors.
b) SIMD Thread Scheduler: This is available within the SIMD processor and
it has a controller. It identifies the threads that are ready run , schedules
them for executing and sends them to the dispatcher when needed.

5.5.2 NVIDIA GPU MEMORY STRUCTURES


Figure 5.5.2 shows the memory structures of an NVIDIA GPU.
The on-chip memory that is local to each multithreaded SIMD processor Local
Memory. It is shared by the SIMD Lanes within a multithreaded SIMD processor,
but this memory is not shared between multithreaded SIMD processors.
The off-chip DRAM is shared by all thread blocks. It is called global memory or
GPU memory.

Figure 5.5.2 NVIDIA GPU Memory structures


5.6 CLUSTERS AND WAREHOUSE SCALE COMPUTERS
The alternative approach to sharing an address space is for the processors to
each have their own private physical address space.
Figure 5.6.1 shows the classic organization of a multiprocessor with multiple
private address spaces.

Figure 5.5.1 Organization of a multiprocessor with multiple private


address spaces

Since there is no shared memory space, this alternative multiprocessor must


communicate via explicit message passing. Hence such processors are also known
as Message-Passing Multiprocessors.
In multiprocessors systems with no shared memory we use message passing
mechanism to perform inter-process communication. The communication between
processors is achieved y message passing through I/O channels.
When processor wants to communicate with another processor it uses a special
procedure which initiates communication. It identifies the destination processor
and once source and destination processors are identified a communication
channel is established. A message is then sent through the communicate channel.

5.6.1 CLUSTERS
Clusters are collections of desktop computers or servers connected by local area networks
to act as a single larger computer. Each node runs its own operating system, and nodes
communicate using a networking protocol.
Since a cluster consists of independent computers connected through a local area
network, it is much easier to replace a computer without bringing down the system in a
cluster than in a shared memory multiprocessor.
It is also easy for clusters to scale down gracefully when a server fails,
thereby improving dependability.
Since the cluster software is a layer that runs on top of the local operating
systems running on each computer, it is much easier to disconnect and
replace a broken computer.
Lower cost, higher availability and rapid, incremental expandability of
clusters make them attractive to service Internet providers, despite their
poorer communication performance when compared to large-scale shared
memory multiprocessors.

5.6.2 WAREHOUSE SCALE COMPUTERS


The largest of the clusters are called Warehouse-scale computers
(WSCs).
A warehouse-scale computer (WSC) is a cluster comprised of tens of
thousands of servers.
WSC share some common goals with servers and they are,
➢ Ample, easy parallelism: In the case of server, we need to worry about
the parallelism available in applications to justify the amount of parallel
hardware. This is not the case with WSCs. Most jobs are totally
independent and exploit “Request-Level Parallelism”.
➢ Operational Costs Count: Server architects normally design for peak
performance within a cost budget. Power concerns are not too much as
long as the cooling requirements are maintained. The operational costs are
ignored. WSCs, however have a longer lifetimes and the building, electrical
and cooling costs are very high. So, the operational costs cannot e ignored.
In WSC, the power consumption is a primary, not secondary constraint
when designing the system.
➢ Scale and the Opportunities/Problems Associated with Scale:
Since WSCs are so massive internally that you get economy of scale even if
there are not many WSCs. On the other hand, custom hardware can be
very expensive, particularly if only small numbers are manufactured. These
economies of scale led to cloud computing, as the lower per unit costs of
a WSC meant that cloud companies could rent servers at a profitable rate
and still be below what it costs outsiders to do it themselves. The other side
of the scale is failures. Even if a server had a Mean Time To Failure
(MTTF) of twenty five years, the WSC architect should design for five
server failures per day.
5.7 INTRODUCTION TO MULTIPROCESSOR NETWORK
TOPOLOGIES
Network topology indicates how the nodes are connected in a network.
Multicore chips require on-chip networks to connect cores together.
Clusters require local area networks to connect servers together.
Networks are drawn as graphs and edges of the graph represents links of the
communication network.
Nodes (computers or processor-memory-nodes) are connected to this graph
through network switches. .
In the following diagrams coloured circles represent switches and black squares
represent processor-memory-nodes.
Network costs include the number of switches, number of links on a switch to
connect, the length of the links, the width (number of bits) per link.

a) BUS TOPOLOGY
Uses a shared set of wires that allows broadcasting messages to all nodes at the
same time.
Bandwidth of the network = bandwidth of the bus.

Figure 5.7.1 Bus Topology

b) Ring Topology
Messages will have to travel along the intermediate nodes until they arrive at the
final destination. A ring is capable of many simultaneous transfers
Bandwidth of the network = Bandwidth of each link x number od links
Ring is a fully connected network. Every processor (P) has a bidirectional link to
every other processor.
If a link is as fast as the bus, then a ring is P times faster than bus in the best-
case.
The total bandwidth = Px(P-1)/2
Bisection bandwidth = (P/2)2 Bisection bandwidth is calculated by dividing the
machines into two halves.

Figure 5.7.2 Ring Topology

c) Star topology
In Star topology all nodes are connected to a central device called hub using a
point-to-point connection.

Figure 5.7.3 Star Topology

Instead of placing a processor at every node in a network, a switch is placed at


some of these nodes.
Switches are smaller than processor-memory-switch nodes.
Boolean Cube tree network
➢ n=3 for cube
➢ So n (here n=3) links per switch are used in these networks
➢ 23 = 8 i.e. 2n nodes are connected.
➢ One link goes to the processor

Switch

Node

Figure 5.7.3 Boolean Cube tree network

2D Grid or Mesh network


➢ Here n=2
➢ So n (here n=2) links per switch
➢ are used in these networks
➢ One link goes to the processor

Figure 5.7.4 2D Grid or Mesh network


5.7.1 IMPLEMENTING NETWORK TOPOLOGIES

Multistage networks: Messages can travel in multiple steps


Fully connected or Crossbar Networks – Any node can communicate with any
other node in one pass through the network.
Omega Networks: Uses less hardware than the crossbar network, bus
contention may occur between messages.

a) Crossbar Network:
n= number of processors = 8
Number of switches = n2 = 64
Any node can communicate with any other node in one pass through the
network.

Figure 5.7.5 Crossbar Network


b) Omega Networks:
Uses less hardware than the crossbar network.
There are 12 switch boxes in the network shown in the figure. Each switch box
has 4 smaller switches.
Number of switches used = 12x 4 = 48
It is given by the formula 2nlog2n , here n = number of processors = 8
So, number of switches used = 2x8x log28 = 2x8x3=48 as given above.
This network can not support all combinations for message passing, bus
contention may occur between messages.
i.e. P0 can not send messages to P6 because these two are not connected
If P1 sends a message to P4 , then P0 may not send messages to P4 or P5 at
the same time.

Omega Network
Switch Box

Figure 5.7.6 Omega Networks


VIDEO LECTURE LINKS

YouTube Channel :
https://www.youtube.com/channel/UC5bAzaok-uwWO7CKaHbTVzA
https://www.youtube.com/AKSHARAM

Topic Wise Videos

UNIT 5

1. Processor Organization https://youtu.be/zjFTP_Y4Dhk


2. Hardware Multithreading https://youtu.be/WvMUvurjocA
3. Multiprocessing https://youtu.be/TQNXMYclw5Y
4. Graphics Processing Unit https://youtu.be/KGhf58XMwK4
5. Network Topologies https://youtu.be/ddRNSjBXvnc
E-BOOK REFERENCE & LECTURE PPT LINKS

E BOOK
Text Book : https://tinyurl.com/CAOTextBook
Reference Book 1: https://tinyurl.com/CAOReferenceBook1
Reference Book 2: https://tinyurl.com/CAOReferenceBook2

Slide Share Link: https://www.slideshare.net/babuece

Topic wise PPT

UNIT 5

1. Advanced Computer Architecture


http://dx.doi.org/10.13140/RG.2.2.10526.15682
ASSIGNMENTS
6.4 ASSIGNMENTS

UNIT 5
ADVANCED COMPUTER ARCHITECTURE
Q.No Questions BT

Level
CO
Level
1 https://tinyurl.com/AssignmentUnit5 CO6 K3
PART A Q & A
6.5 PART A Q & A

UNIT 5 ADVANCED COMPUTER ARCHITECTURE


Q.No QUESTIONS & ANSWERS BT CO
Level Level
1 What is meant by multiprocessor? K1 CO6
Multiprocessor is a computer system with at least two
processors. This computer is contrast to a uniprocessor.

2 What is cluster? K1 CO6


Cluster is a set of computers connected over a local area
network that function as a single large multiprocessor.

3 What are the challenges includes in parallel K2 CO6


programming?
Parallel programming challenges includes scheduling,
partitioning the work into parallel pieces, balancing the
loud evenly between the workers, time to synchronize and
overhead for communication between the parties.

4 Define strong scaling and weak scaling. [NOV/DEC K1 CO6


2017]
STONG SCALING : In these methods speed up achieved
on a multiprocessor without increasing the size of the
problem.
“Strong scaling means measuring speed up while keeping
the problem size fixed”.

WEAK SCALING : In this method speed up is achieved


on a multiprocessor while increasing the size of the
problem proportionally to the increase in the number of
processors.
5 Write two methods used to increase the scale up. K2 CO6
Two methods are find to increase the scale up such
methods are
➢ Strong scaling
➢ Weak scaling

6 What is multithreading? [NOV/DEC 2016] K1 CO6


Switching from one thread (stalled thread) to another
thread is known as multithreading. All the threads
generally share a single address space using a program
counter, and stack and register states.
7 What is meant by hardware multithreading? [NOV/DEC K1 CO6
2019]
Hardware multithreading allows multiple threads to share the
functional units of a single processor in an overlapping fashion
to try to utilize the hardware resources efficiently.

8 What are the approaches involved in hardware K2 CO6


multithreading process?
There are two main approaches to hardware multithreading
such as,
➢ Fine grained multithreading
➢ Coarse grained multithreading
➢ Simultaneous multithreading

9 Difference between Fine-grained Multithreading and K2 CO6


Coarse-grained multithreading. [NOV/DEC 2017]

Fine-grained Coarse-grained
Multithreading multithreading

A version of hardware A version of hardware


multithreading that suggests multithreading that suggests
switching between threads switching between threads
after every instruction is only after significant events,
called interleaved or fine- such as cache miss is called
grained multithreading. blocked or coarse-grained
multithreading.

10 Distinguish implicit multithreading and explicit K2 CO6


multithreading. [APRIL/MAY 2017]

Implicit Multithreading Explicit multithreading

Implicit multithreading refers Explicit multithreading refers


to the concurrent execution to the concurrent execution
of multiple threads extracted of instructions from different
from a single sequential explicit threads, either by
program. interleaving instructions from
different threads on shared
pipelines or by parallel
execution on parallel
pipelines.
11 Give example for each class in Flynn’s classification. K2 CO6
[APRIL/MAY 2018]
CLASSIFICATION EXAMPLE
CATEGORY

SISD IBM 704, VAX 11/780, CRAY-I

SIMD ILLIAC-IV, MPP, CM-2, STARAN

MISD Systolic – Array Computers

MIMD CRAY XMP, IBM 370/168 M

12 Classify shared memory multiprocessor based on the K2 CO6


memory access latency. [NOV/DEC 2018]
Based on the memory access latency , shared memory
multiprocessor is classified into two types,
➢ Uniform Memory Access(UMA)
➢ Non-Uniform Memory Access (NUMA)

13 Compare UMA and NUMA processors. [APRIL/MAY K2 CO6


2015]
Uniform Memory Non Uniform Memory
Access Access

Programming challenges are Programming challenges are


easy. hard.

UMA machines can scale NUMA machines can scale to


small sizes. larger sizes.

It has higher latency. It has lower latency to


nearby memory.

14 What is vector lane? K1 CO6


Vector lane is one or more vector functional units and a portion
of the vector register file. Inspired by lanes on highways that
increase traffic speed and multiple lanes execute vector
operations simultaneously.
15 List the network topologies in parallel processor. K1 CO6
➢ Star Topology
➢ Ring Topology
➢ Star Topology
➢ Boolean Cube tree network
➢ 2D Grid or Mesh network

16 What is GPU? K1 CO6


➢ GPU (Graphics Processing Unit) is a processor specially
designed for handling graphics rendering tasks.
➢ GPU is used to accelerate the processing in video editing,
video game rendering, 3D modelling (AUTO CAD), AI
based tasks etc.

17 Define uniform memory access (UMA). K1 CO6


Uniform Memory Access is a multiprocessor in which latency to
any word in main
memory is same no matter which processor requests the
access.

18 What is non uniform memory access (NUMA)? K1 CO6


Non Uniform Memory Access is a type of single address space
multiprocessor in which some memory accesses are much
faster than others depending on which processor asks for
which word.

19 What is meant by thread? K1 CO6


Thread is a lightweight process which includes the program
counter, the register state and stack. It shares a single address
space.

20 Write Flynn’s classification for parallel hardware. K1 CO6


Flynn’s classification divides parallel hardware into four groups
based on the number of instruction streams and the number of
data streams.
➢ Single Instruction stream and Single Data stream(SISD)
➢ Single Instruction stream Multiple Data stream (SIMD)
➢ Multiple Instruction stream and Single Data stream(MISD)
➢ Multiple Instruction stream and Multiple data stream(MIMD)
PART B Qs
6.6 PART B Qs

UNIT 5 ADVANCED COMPUTER ARCHITECTURE


Q. QUESTIONS & ANSWERS BT CO
No Level Level
1 What are the challenges in parallel processing? [APRIL/MAY K2 CO6
2018]
(OR)
Discuss the challenges in parallel processing in enhancing
computer architecture. [NOV/DEC 2018]
(OR)
Discuss the challenges in parallel processing with necessary
examples. [APRIL/MAY 2017]

2 Explain Flynn’s classification with neat diagrams. [APRIL/MAY K2 CO6


2019] [NOV/DEC 2017] [MAY/JUNE 2016]
(OR)
Define the classes in Flynn’s Taxonomy of computer
architectures. Give one example for each class. [NOV/DEC
2018]
(OR)
Explain Flynn’s classification of parallel processing with
necessary diagrams. [APRIL/MAY 2017] [NOV/DEC 2016 &
2015]

3 Compare and contrast fine-grained multi-threading, coarse- K2 CO6


grained multi-threading and simultaneous multi-threading.
[APRIL/MAY 2018] [APRIL/MAY 2015]

4 Explain hardware multithreading with neat diagrams. K2 CO6


[APRIL/MAY 2019] [NOV/DEC 2015]
(OR)
Explain any three types of hardware multithreading.
[NOV/DEC 2018]
(OR)
Explain the three principal approaches to multithreading with
necessary diagrams. [APRIL/MAY 2017]

5 Write short notes on: K2 CO6


(i) Hardware Multithreading
(ii) Multicore Processors. [MAY/JUNE 2016]
6 (i) List the characteristics of Graphics Processing Units. K2 CO6
(ii) Differentiate in-order execution and out - of order
execution. [NOV/DEC 2019]
7 Explain in detail, the shared memory multiprocessor, with a neat K2 CO6
diagram. [NOV/DEC 2019]
(OR)
Classify shared memory multiprocessor based on the memory
access latency. [APRIL/MAY 2018]
(OR)
Discuss the steps involved in the address translation of virtual
memory with necessary block diagram. [NOV/DEC 2016]
8 K2 CO6
Compare UMA and NUMA multiprocessors. [NOV/DEC 2019]
9 K2 CO6
Apply your knowledge on graphics processing units and explain
how it helps computer to improve processor performance.
10 Illustrate the following in detail K2 CO6
i). Clusters
ii). Warehouse scale computers
11 Discuss the multiprocessor network topologies in detail. K2 CO6
SUPPORTIVE ONLINE
CERTIFICATION COURSES
6.7 SUPPORTIVE ONLINE CERTIFICATION COURSES

UNIT 5

UDEMY COURSE

Advanced Computer Architecture & Organization: HD Course

https://www.udemy.com/course/advance-computer-architecture-and-organization/

Course Overview

This course provides a comprehensive overview of Computer Architecture and

Organization from a practical perspective. This course includes video and text

explanations particular covers everything from Computer Architecture and Computer

Organization. This course consists of different sections: each section covers a specific

module related to computer architecture.


REAL-TIME
APPLICATIONS
6.8 REAL-TIME APPLICATIONS

UNIT 5
Design of a 4 bit Processor

Project Description: https://tinyurl.com/CAOMiniProject


CONTENTS BEYOND THE
SYLLABUS
6.9 CONTENTS BEYOND THE SYLLABUS

CONTENTS BEYOND THE SYLLABUS


UNIT 5

Q.No Topic CO BT
leve level
l
1. Hardware Support for exposing CO6 K2
parallelism
2. Heterogeneous Multi-core processors CO6 K3
& IBM Cell processor
ASSESSMENT SCHEDULE
7. ASSESSMENT SCHEDULE

Assessment Proposed Date Actual Date

Unit 1 Assignment
Assessment
Unit Test 1

Unit 2 Assignment
Assessment
Internal Assessment 1

Retest for IA 1

Unit 3 Assignment
Assessment
Unit Test 2

Unit 4 Assignment
Assessment
Internal Assessment 2

Retest for IA 2

Unit 5 Assignment
Assessment
Revision Test 1

Revision Test 2

Model Exam

Remodel Exam

University Exam
TEXT BOOKS &
REFERENCE BOOKS
8. TEXT BOOKS & REFERENCE BOOKS

TEXT BOOKS:
1. David A. Patterson and John L. Hennessey, ―Computer Organization and Design‖,
Fifth edition, Morgan Kauffman / Elsevier, 2014. (UNIT I-V)
2. Miles J. Murdocca and Vincent P. Heuring, ―Computer Architecture and
Organization: An Integrated approach‖, Second edition, Wiley India Pvt Ltd, 2015
(UNIT IV,V)
REFERENCES
1. V. Carl Hamacher, Zvonko G. Varanesic and Safat G. Zaky, ―Computer
Organization―, Fifth edition, Mc Graw-Hill Education India Pvt Ltd, 2014.
2. William Stallings ―Computer Organization and Architecture‖, Seventh Edition,
Pearson Education, 2006.
3. Govindarajalu, ―Computer Architecture and Organization, Design Principles and
Applications", Second edition, McGraw-Hill Education India Pvt Ltd, 2014.

E BOOK
Text Book : https://tinyurl.com/CAOTextBook
Reference Book 1: https://tinyurl.com/CAOReferenceBook1
Reference Book 2: https://tinyurl.com/CAOReferenceBook2
MINI PROJECT
9. MINI PROJECT

1. Design of a 4 bit Processor

Project Description: https://tinyurl.com/CAOMiniProject

2. Program for Integer Multiplication and Division


Develop a program in C or C++ or Java or Python for an application which does the
following
Accepts inputs from user on a,b, operation to be done, algorithm to be used; a and
b are integer numbers (in decimal format), operation is multiplication or division
Displays a and b in binary format
Displays the contents of the registers at each step
Multiplicand, multiplier and product if multiplication
Dividend, divisor, quotient and remainder if division
3. Program for Floating point Multiplication and Division
Same as above for a and b as floating point numbers ; It should be possible for user
to give inputs in floating point number in normal form (eg 123.678) or in scientific
notation (eg 123.567 * 10 ^5 or 456.93 * 10 ^ -3)

4. Demonstration system for Data path and Control path :


Develop a visual demonstration system which accepts input as a Assemble level
instruction with operands, and displays the steps in the execution of the instruction
in a step by step display of data path and control path.
Hint : Use Microsoft PowerPoint Presentation features of animation for showing step
by step display. Use the hyperlink features to get the instruction, identify the type
(R, I, J types) , and branch to the respective slides.

5. Database of MIPS instructions


Create a database of all MIPS instructions in a file with details opcode, operand ,
cycle times etc.

6. Program Execution Demonstration


Create a C, C++, Java or Python Program to demonstrate the instruction execution
instruction set for an input of assembly code of 5 or 6 instructions
Use the above file of instruction set
List all the steps done by the processor for executing the above input program for
each cycle
Demonstrate the program with various input programs showing the parallelism,
hazards, avoiding hazards etc.
Thank you

Disclaimer:

This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.

You might also like