Ec8552-Cao Unit 5
Ec8552-Cao Unit 5
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
COMPUETR
ARCHITECTURE AND
ORGANIZATION
Department: Electronics and Communication Engineering
Batch/Year: 2019-23 / III
Created by:
Ms.P. Latha, Assoc.Prof/ECE,RMKEC
Mr. R.Babuji, Asst.Prof/ECE,RMKEC
Date:13.10.2021
Table of Contents
S.No Contents Page No
1 Course Objectives 7
2 Pre Requisites 8
3 Syllabus 9
4 Course outcomes 10
Hardware Multithreading 29
Clusters 40
6.4 Assignments 50
7 Assessment Schedule
LTPC
3003
UNIT I COMPUTER ORGANIZATION & INSTRUCTIONS 9
Basics of a computer system: Evolution, Ideas, Technology, Performance, Power
wall, Uniprocessors to Multiprocessors. Addressing and addressing modes.
Instructions: Operations and Operands, Representing instructions, Logical
operations, control operations.
UNIT II ARITHMETIC 9
Fixed point Addition, Subtraction, Multiplication and Division. Floating Point
arithmetic, High performance arithmetic, Subword parallelism
Level in
Course
Description Bloom’s
Outcomes
Taxonomy
Describe the basic structure and operation of digital
C303.1 K2
computer.
Experiment the Fixed point and Floating-point arithmetic
C303.2 K3
operations.
Study the design of data path unit and control unit for
C303.3 K2
processor.
Discuss about pipelined control units and various types of
C303.4 K2
hazards in the instructions.
C303.5 Describe the concept of various memories and interfacing. K2
Summarize the latest advancements in computer
C303.6 K2
architecture
5. CO – PO/PSO MAPPING
Program
Program Outcomes Specific
Course Outcomes
Level K3
Out
of CO K3 ,K PSO PSO PSO
K4 K4 K5 A3 A2 A3 A3 A3 A3 A2
Comes 5, 1 2 3
K6
PO PO PO PO PO PO PO PO PO PO PO PO K5 K5 K3
1 2 3 4 5 6 7 8 9 10 11 12
C303.1 K2 2 1 1 1 - - - - - 2 - - - - 2
C303.2 K3 3 3 3 2 1 - - - - 2 - - - - 3
C303.3 K2 2 1 1 1 - - - - - 2 - - - - 2
C303.4 K2 2 1 1 1 - - - - - 2 - - - - 2
C303.5 K2 2 1 1 1 - - - - - 2 - - - - 2
C303.6 K2 2 1 1 1 - - - - - 2 - - - - 2
C303 3 3 3 2 1 - - - - 2 - - - - 3
UNIT 5
ADVANCED COMPUTER
ARCHITECTURE
LECTURE PLAN
6.1 LECTURE PLAN
Taxonomy Level
Proposed Date
No. of periods
Pertaining CO
Actual Date
Reason for
Deviation
Delivery
Mode
S.No Topic
of
1 Parallel processing 2 PPT through
architectures and CO6 K2
Online
challenges
2 Hardware 1 PPT through
CO6 K2
multithreading Online
3 Multicore and shared 1 PPT through
memory CO6 K2
Online
multiprocessors
4 Content Beyond 1
Syllabus:
PPT through
Memory CO6 K2
Online
Management
Techniques
5 Introduction to 1 PPT through
Graphics Processing CO6 K2
Online
Units
6 Clusters and 1 PPT through
Warehouse scale CO6 K2
Online
computers
7 Introduction to 1
Multiprocessor PPT through
CO6 K2
network topologies. Online
8 Content Beyond 1
Syllabus:
PPT through
Memory CO6 K2
Online
Management
Techniques
UNIT 5
ADVANCED COMPUER ARCHITECTURE
UNIT 5
ADVANCED COMPUTER ARCHITECTURE
Parallel processing architectures and challenges, Hardware multithreading, Multicore
and shared memory multiprocessors, Introduction to Graphics Processing Units,
Clusters and Warehouse scale computers - Introduction to Multiprocessor network
topologies.
a) Scheduling
Scheduling is the method by which threads, processes or data flows are given
access to system resources (e.g processor time, communication bandwidth).
Scheduling is done to load balance and share system resources effectively or
achieve a target quality of service.
Scheduling can be done in various fields among that process scheduling is more
important, because in parallel processing we need to schedule the process
correctly. Process scheduling cane be done in following ways:
➢ Long term scheduling
➢ Medium term scheduling
➢ Short term scheduling
➢ Dispatcher
d) Time to synchronize
Synchronization is the most important challenge in parallel processing. Because all
the processor have equal work load so it must complete the task within specific
time period.
For parallel processing program it must have time to synchronization process,
because if any process does not complete the task within specific time period
then we cannot obtain parallel processing.
e) Communication between the parties
Inter-processor communication virtually always implies overhead.
Knowing which tasks must communicate with each other is critical during the
design stage of a parallel code.
PROBLEMS :
Example 1:
To achieve a speed-up of 90 times faster with 100
processors, What percentage of the original computation can be
sequential?
Answer:
Amdahl’s Law says
This formula is usually rewritten assuming that the execution time before is 1 for
some unit of time, and the execution time affected by improvement is considered
the fraction of the original execution time:
Substituting 90 for speed-up and 100 for amount of improvement into the formula
above:
Then simplifying the formula and solving for fraction time affected:
90 * (1 -0.99 * Fraction time affected) = 1
90 - (90 * 0.99 * Fraction time affected) = 1
90 - 1 = 90 * 0.99 * Fraction time affected
Fraction time affected = 89/89.1 = 0.999
Thus, to achieve a speed-up of 90 from 100 processors, the sequential
percentage can only be 0.1%.
Example: 2
Speed-up Challenge: Bigger Problem
Suppose you want to perform two sums: one is a sum of 10 scalar variables, and
one is a matrix sum of a pair of two-dimensional arrays, with dimensions 10 by 10.
For now let’s assume only the matrix sum is parallelizable; we’ll see soon how to
parallelize scalar sums. What speed-up do you get with 10 versus 40 processors?
Next, calculate the speed-ups assuming the matrices grow to 20 by 20.
Answer:
If we assume performance is a function of the time for an addition, t, then there
are 10 additions that do not benefit from parallel processors and 100 additions
that do. If the time for a single processor is 110 t, the execution time for 10
processors is
So the speed-up with 40 processors is 110t/12.5t = 8.8.
Thus, for this problem size, we get about 55% of the potential speed-up with 10
processors, but only 22% with 40.
The sequential program now takes 10t + 400t = 410t.
The execution time for 10 processors is
Control IS DS Main
ALU
Unit Memory
IS
PE-Processing
MM- Main
Element
Memory
IS DS1
PE1 MM1
I DS2
Control IS S PE2 MM2
Unit
…
IS
PEn
DSn
…
MMn
IS
Vector Lane:
Figure 5.2.6 uses four lanes. Each vector lane has more than one functional
units.
Three functional units FP add, FP Multiply, and a load store unit are provided
The elements in a single vector are interleaved across four lanes,
Each lane uses a portion of vector register.
The vector storage is divided across four lanes and each lane holds every fourth
element of each vector register.
Solution:
Process:
A process includes one or more threads and their address spaces. Switching from
one process to another process invokes the operating system (OS).
The main difference between multithreading and the process: Multithreading uses
single address space and it does not invoke the OS. But, a process switches from
threads at different address spaces and requires the help of the operating system
to do this switching. So, you can say that multithreading is a lightweight process
and it is smaller than the process.
Types of multithreading:
In the case of SUMA all processors are identical. Processors may have local cache
memories and I/O devices. Physical memory is uniformly shared by all processors
with equal access time to all words.
In the case of AUMA, one master processor executes the operating system and
other processors may be dedicated to special tasks such as graphics rendering,
doing mathematical functions etc.
CPU has general purpose instructions GPU has its own special instructions to
and they are more suitable for serial handle graphics and more suitable for
processing. parallel processing.
CPUs will have just few cores with GPUs have hundreds of cores and can
cache and can handle only few handle thousands of threads
threads at a time simultaneously
CPU has a large main memory which GPU has a separate large main
is oriented toward low latency. memory which is oriented toward
bandwidth rather than latency and
provides high throughput
CPU is mainly designed for instruction GPU is designed for data level
level parallelism. parallelism
5.5.1 GPU ARCHITECTURE - NVIDIA
NVIDIA is an American company. They developed ‘Compute Unified Device
Architecture’ (CUDA) for graphics processing units and its commercial name is
Fermi
GeForce is a brand of graphics processing units designed by NVIDIA
GPU contains a collection of multithreaded SIMD processors and hence it is a
MIMD processor.
The number of SIMD processors will be 7 or 11 or 14 or 15 in the Fermi
architecture.
A CUDA program calls kernels parallelly.
The GPU executes a kernel on a grid, A grid has many thread blocks. The thread
block consists of many threads which will be executed parallelly (Figure 5.5.2).
Each thread with in the thread block is called as machine object and it has a
thread ID, program instructions, program counter, registers, per-thread private
memory, inputs and outputs.
The machine object is created , managed, scheduled and executed by the GPU.
Figure 5.5.1 shows a simplified block diagram of a multithreaded SIMD processor.
5.6.1 CLUSTERS
Clusters are collections of desktop computers or servers connected by local area networks
to act as a single larger computer. Each node runs its own operating system, and nodes
communicate using a networking protocol.
Since a cluster consists of independent computers connected through a local area
network, it is much easier to replace a computer without bringing down the system in a
cluster than in a shared memory multiprocessor.
It is also easy for clusters to scale down gracefully when a server fails,
thereby improving dependability.
Since the cluster software is a layer that runs on top of the local operating
systems running on each computer, it is much easier to disconnect and
replace a broken computer.
Lower cost, higher availability and rapid, incremental expandability of
clusters make them attractive to service Internet providers, despite their
poorer communication performance when compared to large-scale shared
memory multiprocessors.
a) BUS TOPOLOGY
Uses a shared set of wires that allows broadcasting messages to all nodes at the
same time.
Bandwidth of the network = bandwidth of the bus.
b) Ring Topology
Messages will have to travel along the intermediate nodes until they arrive at the
final destination. A ring is capable of many simultaneous transfers
Bandwidth of the network = Bandwidth of each link x number od links
Ring is a fully connected network. Every processor (P) has a bidirectional link to
every other processor.
If a link is as fast as the bus, then a ring is P times faster than bus in the best-
case.
The total bandwidth = Px(P-1)/2
Bisection bandwidth = (P/2)2 Bisection bandwidth is calculated by dividing the
machines into two halves.
c) Star topology
In Star topology all nodes are connected to a central device called hub using a
point-to-point connection.
Switch
Node
a) Crossbar Network:
n= number of processors = 8
Number of switches = n2 = 64
Any node can communicate with any other node in one pass through the
network.
Omega Network
Switch Box
YouTube Channel :
https://www.youtube.com/channel/UC5bAzaok-uwWO7CKaHbTVzA
https://www.youtube.com/AKSHARAM
UNIT 5
E BOOK
Text Book : https://tinyurl.com/CAOTextBook
Reference Book 1: https://tinyurl.com/CAOReferenceBook1
Reference Book 2: https://tinyurl.com/CAOReferenceBook2
UNIT 5
UNIT 5
ADVANCED COMPUTER ARCHITECTURE
Q.No Questions BT
Level
CO
Level
1 https://tinyurl.com/AssignmentUnit5 CO6 K3
PART A Q & A
6.5 PART A Q & A
Fine-grained Coarse-grained
Multithreading multithreading
UNIT 5
UDEMY COURSE
https://www.udemy.com/course/advance-computer-architecture-and-organization/
Course Overview
Organization from a practical perspective. This course includes video and text
Organization. This course consists of different sections: each section covers a specific
UNIT 5
Design of a 4 bit Processor
Q.No Topic CO BT
leve level
l
1. Hardware Support for exposing CO6 K2
parallelism
2. Heterogeneous Multi-core processors CO6 K3
& IBM Cell processor
ASSESSMENT SCHEDULE
7. ASSESSMENT SCHEDULE
Unit 1 Assignment
Assessment
Unit Test 1
Unit 2 Assignment
Assessment
Internal Assessment 1
Retest for IA 1
Unit 3 Assignment
Assessment
Unit Test 2
Unit 4 Assignment
Assessment
Internal Assessment 2
Retest for IA 2
Unit 5 Assignment
Assessment
Revision Test 1
Revision Test 2
Model Exam
Remodel Exam
University Exam
TEXT BOOKS &
REFERENCE BOOKS
8. TEXT BOOKS & REFERENCE BOOKS
TEXT BOOKS:
1. David A. Patterson and John L. Hennessey, ―Computer Organization and Design‖,
Fifth edition, Morgan Kauffman / Elsevier, 2014. (UNIT I-V)
2. Miles J. Murdocca and Vincent P. Heuring, ―Computer Architecture and
Organization: An Integrated approach‖, Second edition, Wiley India Pvt Ltd, 2015
(UNIT IV,V)
REFERENCES
1. V. Carl Hamacher, Zvonko G. Varanesic and Safat G. Zaky, ―Computer
Organization―, Fifth edition, Mc Graw-Hill Education India Pvt Ltd, 2014.
2. William Stallings ―Computer Organization and Architecture‖, Seventh Edition,
Pearson Education, 2006.
3. Govindarajalu, ―Computer Architecture and Organization, Design Principles and
Applications", Second edition, McGraw-Hill Education India Pvt Ltd, 2014.
E BOOK
Text Book : https://tinyurl.com/CAOTextBook
Reference Book 1: https://tinyurl.com/CAOReferenceBook1
Reference Book 2: https://tinyurl.com/CAOReferenceBook2
MINI PROJECT
9. MINI PROJECT
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
reliance on the contents of this information is strictly prohibited.