Unit 1 Principles of Computer Design 5
Unit 1 Principles of Computer Design 5
Introduction
Copy from page-12, BSIT-301, PTU
Software
Software, or program enables a computer to perform specific tasks, as
opposed to the physical components of the system (hardware). This
includes application software such as a word processor, which enables a
user to perform a task, and system software such as an operating system,
which enables other software to run properly, by interfacing with
hardware and with other software or custom software made to user
specifications.
Types of Software
Practical computer systems divide software into three major classes:
system software, programming software and application software, although
the distinction is arbitrary, and often blurred.
• System software helps run the computer hardware and computer system. It
includes operating systems, device drivers, diagnostic tools, servers, windowing
systems, utilities and more. The purpose of systems software is to insulate the
applications programmer as much as possible from the details of the particular
computer complex being used, especially memory and other hardware features,
and such accessory devices as communications, printers, readers, displays,
keyboards, etc.
• Programming software usually provides tools to assist a programmer in writing
computer programs and software using different programming languages in a
more convenient way. The tools include text editors, compilers, interpreters,
1
linkers, debuggers, and so on. An Integrated development environment (IDE)
merges those tools into a software bundle, and a programmer may not need to
type multiple commands for compiling, interpreter, debugging, tracing, and etc.,
because the IDE usually has an advanced graphical user interface, or GUI.
• Application software allows end users to accomplish one or more specific (non-
computer related) tasks. Typical applications include industrial automation,
business software, educational software, medical software, databases, and
computer games. Businesses are probably the biggest users of application
software, but almost every field of human activity now uses some form of
application software. It is used to automate all sorts of functions.
Operation
Computer software has to be "loaded" into the computer's storage (such
as a hard drive, memory, or RAM). Once the software is loaded, the
computer is able to execute the software. Computers operate by executing
the computer program. This involves passing instructions from the
application software, through the system software, to the hardware which
ultimately receives the instruction as machine code. Each instruction
causes the computer to carry out an operation -- moving data, carrying
out a computation, or altering the control flow of instructions.
Data movement is typically from one place in memory to another.
Sometimes it involves moving data between memory and registers which
enable high-speed data access in the CPU. Moving data, especially large
amounts of it, can be costly. So, this is sometimes avoided by using
"pointers" to data instead. Computations include simple operations such
as incrementing the value of a variable data element. More complex
computations may involve many operations and data elements together.
Instructions may be performed sequentially, conditionally, or
iteratively. Sequential instructions are those operations that are
performed one after another. Conditional instructions are performed such
that different sets of instructions execute depending on the value(s) of
some data. In some languages this is known as an "if" statement.
Iterative instructions are performed repetitively and may depend on some
data value. This is sometimes called a "loop." Often, one instruction
may "call" another set of instructions that are defined in some other
program or module. When more than one computer processor is used,
instructions may be executed simultaneously.
A simple example of the way software operates is what happens when a
user selects an entry such as "Copy" from a menu. In this case, a
conditional instruction is executed to copy text from data in a
'document' area residing in memory, perhaps to an intermediate storage
2
area known as a 'clipboard' data area. If a different menu entry such as
"Paste" is chosen, the software may execute the instructions to copy the
text from the clipboard data area to a specific location in the same or
another document in memory.
Depending on the application, even the example above could become
complicated. The field of software engineering endeavors to manage the
complexity of how software operates. This is especially true for
software that operates in the context of a large or powerful computer
system.
Currently, almost the only limitations on the use of computer software
in applications is the ingenuity of the designer/programmer.
Consequently, large areas of activities (such as playing grand master
level chess) formerly assumed to be incapable of software simulation are
now routinely programmed. The only area that has so far proved
reasonably secure from software simulation is the realm of human art—
especially, pleasing music and literature.
Kinds of software by operation: computer program as executable, source
code or script, configuration.
Hardware
Computer hardware is the physical part of a computer, including the
digital circuitry, as distinguished from the computer software that
executes within the hardware. The hardware of a computer is infrequently
changed, in comparison with software and data, which are "soft" in the
sense that they are readily created, modified or erased on the computer.
Firmware is a special type of software that rarely, if ever, needs to be
changed and so is stored on hardware devices such as read-only memory
(ROM) where it is not readily changed (and is therefore "firm" rather
than just "soft").
3
Typical Motherboard found in a computer
• Motherboard or system board with slots for expansion cards and holding parts
o Central processing unit (CPU)
Computer fan - used to cool down the CPU
o Random Access Memory (RAM) - for program execution and short term
data storage, so the computer does not have to take the time to access the
hard drive to find the file(s) it requires. More RAM will normally
contribute to a faster PC. RAM is almost always removable as it sits in
slots in the motherboard, attached with small clips. The RAM slots are
normally located next to the CPU socket.
o Basic Input-Output System (BIOS) or Extensible Firmware Interface (EFI)
in some newer computers
o Buses
• Power supply - a case that holds a transformer, voltage control, and (usually) a
cooling fan
• Storage controllers of IDE, SATA, SCSI or other type, that control hard disk,
floppy disk, CD-ROM and other drives; the controllers sit directly on the
motherboard (on-board) or on expansion cards
• Video display controller that produces the output for the computer display. This
will either be built into the motherboard or attached in its own separate slot (PCI,
PCI-E or AGP), requiring a Graphics Card.
• Computer bus controllers (parallel, serial, USB, FireWire) to connect the
computer to external peripheral devices such as printers or scanners
• Some type of a removable media writer:
o CD - the most common type of removable media, cheap but fragile.
CD-ROM Drive
CD Writer
o DVD
DVD-ROM Drive
DVD Writer
DVD-RAM Drive
o Floppy disk
o Zip drive
o USB flash drive AKA a Pen Drive
o Tape drive - mainly for backup and long-term storage
• Internal storage - keeps data inside the computer for later use.
o Hard disk - for medium-term storage of data.
o Disk array controller
• Sound card - translates signals from the system board into analog voltage levels,
and has terminals to plug in speakers.
4
• Networking - to connect the computer to the Internet and/or other computers
o Modem - for dial-up connections
o Network card - for DSL/Cable internet, and/or connecting to other
computers.
• Other peripherals
• Input devices
o Text input devices
Keyboard
o Pointing devices
Mouse
Trackball
o Gaming devices
Joystick
Game pad
Game controller
o Image, Video input devices
Image scanner
Webcam
o Audio input devices
Microphone
• Output devices
o Image, Video output devices
Printer: Peripheral device that produces a hard copy. (Inkjet, Laser)
Monitor: Device that takes signals and displays them. (CRT,
LCD)
o Audio output devices
Speakers: A device that converts analog audio signals into the
equivalent air vibrations in order to make audible sound.
Headset: A device similar in functionality to that of a regular
telephone handset but is worn on the head to keep the hands free.
Student-Activity
5
Software-Hardware Interaction layers in Computer
Architecture
In computer engineering, computer architecture is the conceptual design
and fundamental operational structure of a computer system. It is a
blueprint and functional description of requirements (especially speeds
and interconnections) and design implementations for the various parts
of a computer — focusing largely on the way by which the central
processing unit (CPU) performs internally and accesses addresses in
memory.
6
Fig : A typical vision of a computer architecture as a series of
abstraction layers: hardware, firmware, assembler, kernel, operating
system and applications
Abstraction Layer
Firmware
7
In computing, firmware is software that is embedded in a hardware
device. It is often provided on flash ROMs or as a binary image file
that can be uploaded onto existing hardware by a user. Firmware is
defined as:
Assembler
An assembly language program is translated into the target computer's
machine code by a utility program called an assembler.Typically a
modern assembler creates object code by translating assembly
instruction mnemonics into opcodes, and by resolving symbolic names for
memory locations and other entities. The use of symbolic references is a
key feature of assemblers, saving tedious calculations and manual
address updates after program modifications.
Kernel
In computing, the kernel is the central component of most computer
operating systems (OSs). Its responsibilities include managing the
system's resources and the communication between hardware and software
components. As a basic component of an operating system, a kernel
provides the lowest-level abstraction layer for the resources
(especially memory, processor and I/O devices) that applications must
control to perform their function. It typically makes these facilities
available to application processes through inter-process communication
mechanisms and system calls.
8
modularity of the code base. A range of possibilities exists between
these two extremes.
Operating System
An operating system (OS) is a computer program that manages the
hardware and software resources of a computer. At the foundation of all
system software, an operating system performs basic tasks such as
controlling and allocating memory, prioritizing system requests,
controlling input and output devices, facilitating networking, and
managing files. It also may provide a graphical user interface for
higher level functions. It forms a platform for other software.
Application Software
Application software is a subclass of computer software that employs
the capabilities of a computer directly to a task that the user wishes
to perform. This should be contrasted with system software which is
involved in integrating a computer's various capabilities, but typically
does not directly apply them in the performance of tasks that benefit
the user. In this context the term application refers to both the
application software and its implementation.
9
Student Activity
Addressing Modes
Copy from page-66 to page-69, upto student activity, BSIT-301, PTU
Instruction Types
10
The type of instruction is recognized by the computer control from the
four bits in position 12 through 15 of the instruction. If the three
opcode bits in positions 12 through 14 are not equal to 111, the
instruction is a memory reference type and the bit in the position 15 is
taken as the addressing mode. If the bit is 1, the instruction is an
input-output instruction.
Only three bits of the instruction are used for the operation code. It
may seem that the computer is restricted to a maximum of eight distinct
operations. Since register reference and input output instructions use
the remaining 12 bits as part of the total number of instruction chosen
for the basic computer is equal to 25.
1 000 0001 0 RR
MMMMM
LOAD [REG] [MEM] LOAD R2 13 R2 = M[13]
1 000 0010 0 RR
STORE [MEM] [REG] STORE 8 R3 M[8] = R3
MMMMM
MOVE [REG1] [REG2] MOVE R2 R0 R2 = R0
1 001 0001 0000 RR
RR
Machine Language
Instruction: Example: Meaning:
Instruction:
Branching Instructions:
11
Machine Language
Instruction: Example: Meaning:
Instruction:
PC = 10
BRANCH [MEM] BRANCH 10 PC = 2 IF ALU RESULT 0 000 0001 000 MMMMM
BZERO [MEM] BZERO 2 IS ZERO 0 000 0010 000 MMMMM
BNEG [MEM] BNEG 7 PC = 7 IF ALU RESULT 0 000 0011 000 MMMMM
IS NEGATIVE
Other Instructions:
Machine Language
Instruction: Example: Meaning:
Instruction:
Student Activity
1. How will you express machine level instructions?
2. How does the computer control recognize the type of instruction?
3. Describe various types of machine level instructions.
4. Describe the addressing mode of computer instructions.
The optimal selection of instructions is more complex on x86 than it is on Alpha for the
following reasons:
12
the instruction selection algorithm always picks an add instruction that adds a
memory and a register location.
• Limited set of registers per instruction: When picking the next instruction, the
code generator always checks in which registers the current values are in and
chooses the instruction appropriately. If the current register allocation doesn't fit
to the instruction at all, values need to be moved. This scheme could be improved
with more global analysis, but at the expense of a larger compile-time cost.
• Efficient 64-bit operations: The Java bytecode contains 64-bit integer and
floating-point operations that the x86 platform needs to support. For each of these
bytecode operations the number of temporary registers and the amount of memory
accesses need to be minimized. For example, the following code is one possible
implementation of the add (64-bit integer addition) bytecode instruction.
• mov 0x0(%esp,1),%eax
• add 0x8(%esp,1),%eax
• mov 0x4(%esp,1),%ecx
• adc 0x10(%esp,1),%ecx
Instruction Cycle
Copy from page-63-64, Instruction Cycle, MCA-301, PTU
Execution Cycle
Copy from page-50 to 54, (Instruction Execution), BSIT-301, PTU
Student Activity
1. Define an instruction cycle. Describe its various parts.
2. While designing the instruction set of a computer, what are the important things to
be kept in mind? When is a set of instructions said to be complete?
3. Describe the execution cycle of an instruction.
Summary
• Our computer system consists of software and hardware. Software, or program
enables a computer to perform specific tasks, as opposed to the physical
components of the system (hardware).
• Computer hardware is the physical part of a computer, including the digital
circuitry, as distinguished from the computer software that executes within the
hardware.
• Computer architecture is defined as the science and art of selecting and
interconnecting hardware components to create computers that meet functional,
performance and cost goals.
• A computer architecture is considered as a series of abstraction layers: hardware,
firmware, assembler, kernel, operating system and applications
• (copy summary from page-33, MCA-204, GJU)
13
• A program has a sequence of instructions which gets executed through a cycle ,
called instruction cycle. The basic parts of and instruction cycle include fetch,
decode, read the effective address from memory and execute.
Keywords
Copy the following from page-33, MCA-204, GJU
• Instruction code
• Computer register
• System bus
• External bus
• Input/Output
• Interrupt
Copy the following from page-69, MCA-301, PTU
• Common Bus
• Fetch
• Control Flow Chart
Review Questions
1. Define software and hardware.
2. Describe the function of firmware.
3. What is the role of kernel.
4. What are applications?
5. Describe various machine language instructions.
6. Give different phases of instruction cycle.
7. What is the significance of instruction?
8. How are computer instructions identified.
9. What do you understand by the term instruction code?
10. Describe the time and control of instructions.
Further Readings
Copy from page-34, MCA-204, GJU
14
Unit-2
Control Unit and Microprogramming
Learning Objectives
After completion of this unit, you should be able to :
• describe control unit
• describe data path and control path design
• describe microprogramming
• comparing microprogramming and hardwired control
• comparing RISC and CISC architecture
• Describe pipelining in CPU Design
• Describe superscalar processors
Introduction
Copy from page-97-98, MCA-301, PTU
Control Unit
As the name suggests, a control unit is used to control something. In
this case, the control unit provides instructions to the other CPU
devices (previously listed) in a way that causes them to operate
coherently to achieve some goal as shown in Fig. (2.1). Basically, there
is one control unit, because two control units may cause conflict. The
control unit of a simple CPU performs the FETCH / DECODE / EXECUTE /
WRITEBACK von Neumann sequence.
To describe how the CPU works we may describe what signals the control
unit issues and when. Clearly these instructions are more complicated
than those that the control unit receives as input. Thus the control
unit must store the instructions within itself, perhaps using a memory
or perhaps in the form of a complicated network.
15
In either case, let us describe what the control unit does in terms of a
program (for ease of understanding) called the micro-program, consisting
naturally of micro-instructions. Let the micro-program be stored in a
micro-memory.
The control unit may not be micro-programmed, however we can still use
micro-instructions to indicate what the control unit is doing. In this
case we take a logical view of the control unit. The possible
instructions are dictated by the architecture of the CPU. Different
architectures allow for different instructions and this is a major
concept to consider when examining CPU design and operation. We are not
interested in design in this subject, but we concentrate on operation.
16
• Program Counter, : Stores the number that represents the address of the next
instruction to execute (found in memory).
• General Purpose Registers: Examples are , , , , , to store
intermediate results of execution.
Modern computers are more complex, but the operation is essentially the same. First
consider the micro-program executed by the Control Unit to provide the FETCH:
1.
2.
3.
17
Note the first line may be written , but
in this case the path is inferred from the diagram. The second line instructs the memory
sub-system to retrieve the contents of the memory at address given by , the
contents is put into . The third line does two things at the same time, moves the
into the for DECODING and EXECUTING plus it instructs the to
increase by 1, so as to point to the next instruction for execution. This increment
instruction is available for some registers like the . The ALU does not have to be
used in this case. Consider some more small examples of micro-programs which use the
PDP8 micro-architecture.
1.
Example the contents of the to the and put the result in the
.
1.
2.
1.
2.
18
Example the to obtain the next instruction.
1.
2.
A stored program computer consists of a processing unit and an attached memory system.
Commands that instruct the processor to perform certain operations are placed in the
memory along with the data items to be operated on. The processing unit consists of
data-path and control. The data-path contains registers to hold data and functional units,
such as arithmetic logic units and shifters, to operate on data. The control unit is little
more than a finite state machine that sequences through its states to
The critical design issues for a data-path are how to "wire" the various components
together to minimize hardware complexity and the number of control states to complete a
typical operation. For control, the issue is how to organize the relatively complex
"instruction interpretation" finite state machine.
Microprogramming
Copy from page-102 to page-111, upto student activity, MCA-301, PTU
RISC Characteristics
Copy from page-112, MCA-301, PTU
19
RISC Design Philosophy
Copy section-7.4, page-60 to 62, upto student activity-2, MCA-204, GJU
CISC, which stands for Complex Instruction Set Computer, is a philosophy for designing chips
that are easy to program and which make efficient use of memory. Each instruction in a CISC
instruction set might perform a series of operations inside the processor. This reduces the number
of instructions required to implement a given program, and allows the programmer to learn a small
but flexible set of instructions.
Since the earliest machines were programmed in assembly language and memory was slow and
expensive, the CISC philosophy made sense, and was commonly implemented in such large
computers as the PDP-11 and the DECsystem 10 and 20 machines.
Most common microprocessor designs --- including the Intel(R) 80x86 and Motorola 68K series ---
also follow the CISC philosophy.
Microprogramming is as easy as assembly language to implement, and much less expensive than
hardwiring a control unit.
The ease of microcoding new instructions allowed designers to make CISC machines upwardly
compatible: a new computer could run the same programs as earlier computers because the new
computer would contain a superset of the instructions of the earlier computers.
As each instruction became more capable, fewer instructions could be used to implement a given
task. This made more efficient use of the relatively slow main memory.
Because microprogram instruction sets can be written to match the constructs of high-level
languages, the compiler does not have to be as complicated.
Earlier generations of a processor family generally were contained as a subset in every new
version --- so instruction set & chip hardware become more complex with each generation of
computers.
So that as many instructions as possible could be stored in memory with the least possible
wasted space, individual instructions could be of almost any length---this means that different
instructions will take different amounts of clock time to execute, slowing down the overall
performance of the machine.
Many specialized instructions aren't used frequently enough to justify their existence ---
approximately 20% of the available instructions are used in a typical program.
CISC instructions typically set the condition codes as a side effect of the instruction. Not only
does setting the condition codes take time, but programmers have to remember to examine the
condition code bits before a subsequent instruction changes them.
20
Pipelining in CPU Design
Copy section-6.2, page-50 to page-53, before student activity-1, MCA-
204, GJU
A Typical Pipeline
Consider the steps necessary to do a generic operation:
• Fetch opcode.
• Decode opcode and (in parallel) prefetch a possible displacement or constant
operand (or both)
• Compute complex addressing mode (e.g., [ebx+xxxx]), if applicable.
• Fetch the source value from memory (if a memory operand) and the destination
register value (if applicable).
• Compute the result.
• Store result into destination register.
Assuming you're willing to pay for some extra silicon, you can build a little "mini-
processor" to handle each of the above steps. The organization would look something like
Figure 2.5.
Note how we've combined some stages from the previous section. For example, in stage
four of Figure 2.5 the CPU fetches the source and destination operands in the same step.
You can do this by putting multiple data paths inside the CPU (e.g., from the registers to
21
the ALU) and ensuring that no two operands ever compete for simultaneous use of the
data bus (i.e., no memory-to-memory operations).
If you design a separate piece of hardware for each stage in the pipeline above, almost all
these steps can take place in parallel. Of course, you cannot fetch and decode the opcode
for more than one instruction at the same time, but you can fetch one opcode while
decoding the previous instruction. If you have an n-stage pipeline, you will usually have
n instructions executing concurrently.
Figure 2.6 shows pipelining in operatoin. T1, T2, T3, etc., represent consecutive "ticks"
of the system clock. At T=T1 the CPU fetches the opcode byte for the first instruction.
At T=T2, the CPU begins decoding the opcode for the first instruction. In parallel, it
fetches a block of bytes from the prefetch queue in the event the instruction has an
operand. Since the first instruction no longer needs the opcode fetching circuitry, the
CPU instructs it to fetch the opcode of the second instruction in parallel with the
decoding of the first instruction. Note there is a minor conflict here. The CPU is
attempting to fetch the next byte from the prefetch queue for use as an operand, at the
same time it is fetching operand data from the prefetch queue for use as an opcode. How
can it do both at once? You'll see the solution in a few moments.
At T=T3 the CPU computes an operand address for the first instruction, if any. The CPU
does nothing on the first instruction if it does not use an addressing mode requiring such
computation. During T3, the CPU also decodes the opcode of the second instruction and
fetches any necessary operand. Finally the CPU also fetches the opcode for the third
instruction. With each advancing tick of the clock, another step in the execution of each
instruction in the pipeline completes, and the CPU fetches yet another instruction from
memory.
22
This process continues until at T=T6 the CPU completes the execution of the first
instruction, computes the result for the second, etc., and, finally, fetches the opcode for
the sixth instruction in the pipeline. The important thing to see is that after T=T5 the CPU
completes an instruction on every clock cycle. Once the CPU fills the pipeline, it
completes one instruction on each cycle. Note that this is true even if there are complex
addressing modes to be computed, memory operands to fetch, or other operations which
use cycles on a non-pipelined processor. All you need to do is add more stages to the
pipeline, and you can still effectively process each instruction in one clock cycle.
A bit earlier you saw a small conflict in the pipeline organization. At T=T2, for example,
the CPU is attempting to prefetch a block of bytes for an operand and at the same time it
is trying to fetch the next opcode byte. Until the CPU decodes the first instruction it
doesn't know how many operands the instruction requires nor does it know their length.
However, the CPU needs to know this information to determine the length of the
instruction so it knows what byte to fetch as the opcode of the next instruction. So how
can the pipeline fetch an instruction opcode in parallel with an address operand?
The second solution is to throw (a lot) more hardware at the problem. Operand and
constant sizes usually come in one, two, and four-byte lengths. Therefore, if we actually
fetch three bytes from memory, at offsets one, three, and five, beyond the current opcode
we are decoding, we know that one of these bytes will probably contain the opcode of the
next instruction. Once we are through decoding the current instruction we know how long
it will be and, therefore, we know the offset of the next opcode. We can use a simple data
selector circuit to choose which of the three opcode bytes we want to use.
In actual practice, we have to select the next opcode byte from more than three candidates
because 80x86 instructions take many different lengths. For example, an instruction that
moves a 32-bit constant to a memory location can be ten or more bytes long. And there
are instruction lengths for nearly every value between one and fifteen bytes. Also, some
opcodes on the 80x86 are longer than one byte, so the CPU may have to fetch multiple
bytes in order to properly decode the current instruction. Nevertheless, by throwing more
hardware at the problem we can decode the current opcode at the same time we're
fetching the next.
Stalls in a Pipeline
Unfortunately, the scenario presented in the previous section is a little too simplistic.
There are two drawbacks to that simple pipeline: bus contention among instructions and
non-sequential program execution. Both problems may increase the average execution
time of the instructions in the pipeline.
23
Bus contention occurs whenever an instruction needs to access some item in memory. For
example, if a "mov( reg, mem);" instruction needs to store data in memory and a "mov(
mem, reg);" instruction is reading data from memory, contention for the address and data
bus may develop since the CPU will be trying to simultaneously fetch data and write data
in memory.
One simplistic way to handle bus contention is through a pipeline stall. The CPU, when
faced with contention for the bus, gives priority to the instruction furthest along in the
pipeline. The CPU suspends fetching opcodes until the current instruction fetches (or
stores) its operand. This causes the new instruction in the pipeline to take two cycles to
execute rather than one (see Figure 2.7).
This example is but one case of bus contention. There are many others. For example, as
noted earlier, fetching instruction operands requires access to the prefetch queue at the
same time the CPU needs to fetch an opcode. Given the simple scheme above, it's
unlikely that most instructions would execute at one clock per instruction (CPI).
Fortunately, the intelligent use of a cache system can eliminate many pipeline stalls like
the ones discussed above. The next section on caching will describe how this is done.
However, it is not always possible, even with a cache, to avoid stalling the pipeline. What
you cannot fix in hardware, you can take care of with software. If you avoid using
memory, you can reduce bus contention and your programs will execute faster. Likewise,
using shorter instructions also reduces bus contention and the possibility of a pipeline
stall.
24
What happens when an instruction modifies the EIP register? This, of course, implies that
the next set of instructions to execute do not immediately follow the instruction that
modifies EIP. By the time the instruction
JNZ Label;
completes execution (assuming the zero flag is clear so the branch is taken), we've
already started five other instructions and we're only one clock cycle away from the
completion of the first of these. Obviously, the CPU must not execute those instructions
or it will compute improper results.
The only reasonable solution is to flush the entire pipeline and begin fetching opcodes
anew. However, doing so causes a severe execution time penalty. It will take six clock
cycles (the length of the pipeline in our examples) before the next instruction completes
execution. Clearly, you should avoid the use of instructions which interrupt the sequential
execution of a program. This also shows another problem - pipeline length. The longer
the pipeline is, the more you can accomplish per cycle in the system. However,
lengthening a pipeline may slow a program if it jumps around quite a bit. Unfortunately,
you cannot control the number of stages in the pipeline. You can, however, control the
number of transfer instructions which appear in your programs. Obviously you should
keep these to a minimum in a pipelined system.
Student Activity
Copy student activity from page-53, MCA-204, GJU
5. Describe various stalls of pipelining.
Superscalar processors
Superscalar is a term coined in the late 1980s. Superscalar processors
arrived as the RISC movement gained widespread acceptance, and RISC
processors are particularly suited to superscalar techniques. However,
the approach can be used on non-RISC processors (e.g. Intel's P6-based
processors, the Pentium 4, and AMD's IA32 clones.), with considerable
effort. All current desktop and server market processors are now
superscalar.
25
The development of superscalar from 'vanilla' pipelining came about via
more sophisticated, though still non-superscalar, pipelining
technologies - most of which were seen on supercomputers from the mid
60s onwards. However, true superscalar processors are a concept that is
unique to microprocessor-based systems.
26
order. This is known as out-of-order completion, and brings potential
performance benefits as well as big problems, as we will see.
Superscalar Concept
27
Fig.2.12. Basic Superscalar Structure
Note however that it is not the case that we would typically be able to
fetch, decode etc. as many instructions as there are execution units -
for example, in fig 2.12 most of the pipeline can handle three
instructions, but there are five execution units.
28
no guarantee that a superscalar processor will succeed in executing as
many instructions as is, in principle, possible.
29
is a register, but difficult when it is a memory word because of the possibility of
pointers creating aliases.
• Retirement or Completion - finally, an instruction finishes and leaves the
pipeline. Typically this happens immediately after write back and we say the
instruction is completed or retired.
Classification
We can divided superscalar processors into a number of classes of
varying complexity.
A superscalar CPU has, essentially, several execution units (see Figure 2.12). If it
encounters two or more instructions in the instruction stream (i.e., the prefetch queue)
which can execute independently, it will do so.
30
Figure 2.12 A CPU that Supports Superscalar Operation
There are a couple of advantages to going superscalar. Suppose you have the following
instructions in the instruction stream:
If there are no other problems or hazards in the surrounding code, and all six bytes for
these two instructions are currently in the prefetch queue, there is no reason why the CPU
cannot fetch and execute both instructions in parallel. All it takes is extra silicon on the
CPU chip to implement two execution units.
31
As an assembly language programmer, the way you write software for a superscalar CPU
can dramatically affect its performance. First and foremost is that rule you're probably
sick of by now: use short instructions. The shorter your instructions are, the more
instructions the CPU can fetch in a single operation and, therefore, the more likely the
CPU will execute faster than one CPI. Most superscalar CPUs do not completely
duplicate the execution unit. There might be multiple ALUs, floating point units, etc.
This means that certain instruction sequences can execute very quickly while others
won't. You have to study the exact composition of your CPU to decide which instruction
sequences produce the best performance.
A data hazard exists between the first and second instructions above. The second
instruction must delay until the first instruction completes execution. This introduces a
pipeline stall and increases the running time of the program. Typically, the stall affects
every instruction that follows. However, note that the third instruction's execution does
not depend on the result from either of the first two instructions. Therefore, there is no
reason to stall the execution of the "mov( 2000, ecx );" instruction. It may continue
executing while the second instruction waits for the first to complete. This technique,
appearing in later members of the Pentium line, is called "out of order execution" because
the CPU completes the execution of some instruction prior to the execution of previous
instructions appearing in the code stream.
Clearly, the CPU may only execute instruction out of sequence if doing so produces
exactly the same results as in-order execution. While there a lots of little technical issues
that make this problem a little more difficult than it seems, with enough engineering
effort it is quite possible to implement this feature.
Although you might think that this extra effort is not worth it (why not make it the
programmer's or compiler's responsibility to schedule the instructions) there are some
situations where out of order execution will improve performance that static scheduling
could not handle.
32
Register Renaming
One problem that hampers the effectiveness of superscalar operation on the 80x86 CPU
is the 80x86's limited number of general purpose registers. Suppose, for example, that the
CPU had four different pipelines and, therefore, was capable of executing four
instructions simultaneously. Actually achieving four instructions per clock cycle would
be very difficult because most instructions (that can execute simultaneously with other
instructions) operate on two register operands. For four instructions to execute
concurrently, you'd need four separate destination registers and four source registers (and
the two sets of registers must be disjoint, that is, a destination register for one instruction
cannot be the source of another). CPUs that have lots of registers can handle this task
quite easily, but the limited register set of the 80x86 makes this difficult. Fortunately,
there is a way to alleviate part of the problem: through register renaming.
Register renaming is a sneaky way to give a CPU more registers than it actually has.
Programmers will not have direct access to these extra registers, but the CPU can use
these additional register to prevent hazards in certain cases. For example, consider the
following short instruction sequence:
mov( 0, eax );
mov( eax, i );
mov( eax, j );
Clearly a data hazard exists between the first and second instructions and, likewise, a data
hazard exists between the third and fourth instructions in this sequence. Out of order
execution in a superscalar CPU would normally allow the first and third instructions to
execute concurrently and then the second and fourth instructions could also execute
concurrently. However, a data hazard, of sorts, also exists between the first and third
instructions since they use the same register. The programmer could have easily solved
this problem by using a different register (say EBX) for the third and fourth instructions.
However, let's assume that the programmer was unable to do this because the other
registers are all holding important values. Is this sequence doomed to executing in four
cycles on a superscalar CPU that should only require two?
One advanced trick a CPU can employ is to create a bank of registers for each of the
general purpose registers on the CPU. That is, rather than having a single EAX register,
the CPU could support an array of EAX registers; let's call these registers EAX[0],
EAX[1], EAX[2], etc. Similarly, you could have an array of each of the registers, so we
could also have EBX[0]..EBX[n], ECX[0]..ECX[n], etc. Now the instruction set does not
give the programmer the ability to select one of these specific register array elements for
a given instruction, but the CPU can automatically choose a different register array
33
element if doing so would not change the overall computation and doing so could speed
up the execution of the program. For example, consider the following sequence (with
register array elements automatically chosen by the CPU):
mov( 0, eax[0] );
mov( eax[0], i );
mov( eax[1], j );
Since EAX[0] and EAX[1] are different registers, the CPU can execute the first and third
instructions concurrently. Likewise, the CPU can execute the second and fourth
instructions concurrently.
The code above provides an example of register renaming. Dynamically, the CPU
automatically selects one of several different elements from a register array in order to
prevent data hazards. Although this is a simple example, and different CPUs implement
register renaming in many different ways, this example does demonstrate how the CPU
can improve performance in certain instances through the use of this technique.
The Intel IA-64 Architecture is not the only computer system to employ a VLIW
architecture. Transmeta's Crusoe processor family also uses a VLIW architecture. The
Crusoe processor is different than the IA-64 architecture insofar as it does not support
native execution of IA-32 instructions. Instead, the Crusoe processor dynamically
translates 80x86 instructions to Crusoe's VLIW instructions. This "code morphing"
technology results in code running about 50% slower than native code, though the Crusoe
processor has other advantages.
34
We will not consider VLIW computing any further since the IA-32 architecture does not
support it. But keep this architectural advance in mind if you move towards the IA-64
family or the Crusoe family.
Student Activity
Summary
Keywords
Copy from page-48-49, MCA-204, GJU
Copy Pipelining Processing from page-57, MCA-204, GJU
Review Questions
Copy Q-1 to 5 from page-49, MCA-204, GJU
6. Describe pipeline processing.
7. Describe Superscalar processors, their functioning and classification
Further Readings
Copy from page-58, MCA-204, GJU
35
Unit-3
Memory Organization
Learning Objectives
Introduction
Copy from page-149, MCA-301, PTU
Memory Subsystem
Copy from page-150 (Memory System), MCA-301, PTU
Storage Technologies
Copy from page-132-133 (upto Student Activity)
Of course, the first question you should ask is, "What exactly is a memory
location?" The 80x86 supports byte addressable memory. Therefore, the
basic memory unit is a byte. So with 20, 24, 32, and 36 address lines, the
80x86 processors can address one megabyte, 16 megabytes, four
gigabytes, and 64 gigabytes of memory, respectively.
Think of memory as a linear array of bytes. The address of the first byte is
zero and the address of the last byte is 2n-1. For an 8088 with a 20 bit
address bus, the following pseudo-Pascal array declaration is a good
approximation of memory:
36
Figure 1.2 Memory Write Operation
To execute the equivalent of "CPU := Memory [125];" the CPU places the
address 125 on the address bus, asserts the read line (since the CPU is
reading data from memory), and then reads the resulting data from the
data bus (see Figure 1.3).
37
The above discussion applies only when accessing a single byte in
memory. So what happens when the processor accesses a word or a double
word? Since memory consists of an array of bytes, how can we possibly
deal with values larger than eight bits?
38
Figure 1.4 Byte, Word, and DWord Storage in Memory
Note that it is quite possible for byte, word, and double word values to
overlap in memory. For example, in Figure 1.4 you could have a word
variable beginning at address 193, a byte variable at address 194, and a
double word value beginning at address 192. These variables would all
overlap.
By carefully arranging how you use memory, you can improve the speed
of your program on these CPUs.
Memory Management
Copy section 8.6 from page 74 to 87, upto student activity-2, MCA-204, GJU
Memory Hierarchy
Copy section 8.10 onwards (from page-91 to page-105), upto student activity-5, MCA-
204, GJU
Cache Memory
Copy from page – 159 to 161, upto set associative mapping, MCA-301, PTU
Summary
Copy from page-105 (delete first two points), MCA-204, GJU
Key Words
Copy from page-105-106 (delete first two definitions), MCA-204, GJU
Review questions
copy question 1 to 8, from page-165, MCA-301, PTU
Further Readings
Copy from page-106, MCA-204, GJU
39
Unit-4
Input-Output Devices and Characteristics
Learning Objectives
After completion of this unit, you should be able to :
• describe input-output processing
• describe bus interface
• describe I/O Interrupt channels
• Describe performance evaluation-SPEC-MARKS
• Describe various benchmarks of transaction processing
Introduction
Copy from page-117 to 118 (Introduction and I/O and their brief description), MCA-301,
PTU
Input-Output Processing
Copy from page-118 (I/O Processing), MCA-301, PTU
Input-Output Processor
Copy from page-139 to page-144 (upto figure-5.23), MCA-301, PTU
Bus Interface
Copy from page-118 (Bus Interface) to Student Activity, Page-129, MCA-301, PTU
I/O Interrupts
Copy from page-130 to 136 (before DMA), MCA-301, PTU
Student Activity
1. What is an Interrupt?
2. What is the purpose of having a channel?
3. The CPU and the channel are usually in a master-slave relationship. Explain.
4. Explain the CPU I/O instructions executed by the CPU.
5. (Copy question-1 to 5 of student activity, page-144, MCA-301, PTU)
40
Student Activity
1. Describe the following :
(a) Spec-Mark
(b) TPC-H
(c) TPC-R
(d) TPC-W
Summary
Copy from page-145, Summary (copy para 1,2 and 4 only), Page-145, MCA-301, PTU
Copy from page-118, summary, MCA-204, GJU
Keywords
Copy keywords from page-145, (Delete DMS), MCA-301, PTU
Copy keywords from page-118, MCA-204, GJU
Review Questions
Copy Q-1,2,3,5,6,7 from page-145-146, MCA-301, PTU
Copy Q-1to5 from page-118-119, MCA-204, GJU
Further Readings
Copy from page-146, MCA-301, PTU and page-119, MCA-204, GJU
41