Principles of Compiler Design Q&A
Principles of Compiler Design Q&A
No part of this eBook may be used or reproduced in any manner whatsoever without the
publisher’s prior written consent.
This eBook may or may not include all assets that were part of the print version. The publisher
reserves the right to remove any material present in this eBook at any time.
ISBN 9788131761267
eISBN xxxxxxxxxxxxx
Head Office: A-8(A), Sector 62, Knowledge Boulevard, 7th Floor, NOIDA 201 309, India
Registered Office: 11 Local Shopping Centre, Panchsheel Park, New Delhi 110 017, India
Contents
Preface v
1. Introduction to Compilers 1
2. Lexical Analysis 12
3. Specification of Programming Languages 35
4. Basic Parsing Techniques 46
5. LR Parsers 65
6. Syntax-directed Translations 94
7. Intermediate Code Generation 105
8. Type Checking 124
9. Runtime Administration 131
1 0. Symbol Table 140
11. Code Optimization and Code Generation 151
Index 175
This page is intentionally left blank.
Preface
A compiler is a program that translates high-level languages such as C, C++ and Java into lower-level
languages like equivalent machine codes. These machine codes can be understood and directly executed
by the computer system to perform various tasks. Given its importance, Compiler Design is a compul-
sory course for B.Tech. (CSE and IT) students in most universities. The book in your hand Principles
of Compiler Design in its unique easy-to-understand question-and-answer format directly addresses the
need of students enrolled in these courses.
The questions and corresponding answers in this book have been designed and selected to cover all
the basic and advanced level concepts of Compiler Design including lexical analysis, syntax analysis,
code optimization and generation, and error handling and recovery. This book is specifically designed to
help those who are attempting to learn Compiler Design by them. The organized and accessible format
allows students to quickly find the questions on specific topics.
The book Principles of Compiler Design forms a part of series called the Express Learning Series,
which has a number of books designed as quick reference guides.
Unique Features
1. Designed as student friendly self-learning guide. The book is written in a clear, concise and lucid
manner.
2. Easy-to-understand question-and-answer format.
3. Includes previously asked as well as new questions organized in chapters.
4. All types of questions including multiple-choice questions, short and long questions are covered.
5. Solutions to the numerical questions asked in the examinations are provided.
6. All ideas and concepts are presented with clear examples.
7. Text is well structured and well supported with suitable diagrams.
8. Inter-chapter dependencies are kept to a minimum.
Chapter Organization
All the question–answers are organized into 11 chapters. The outline of the chapters are as follows:
q Chapter 1 provides an overview of compilers. It discusses the difference between interpreter and
compiler, various phases in the compilation process with the help of an example, error-handling in
compilers and the concept of cross compiler and bootstrapping. This chapter forms the basis for the
rest of the book.
vi Preface
q Chapter 2 details the lexical analysis phase including lexical analyzer, tokens, patterns and lex-
emes, strings and languages and the role of input buffering. It also explains regular expressions,
transition diagrams, finite automata and the design of lexical analyzer generator (LEX).
q Chapter 3 describes the context free grammars (CFG) along with its ambiguities, advantages and
capabilities. It also discusses the difference between regular expressions, and CFG and introduces
context free language.
q Chapter 4 spells out the syntax analysis phase including role of parser, categories of parsing tech-
niques and parsed tree. It elaborates the top–down parsing techniques, which include backtracking
and non-backtracking parsing techniques.
q Chapter 5 deals with bottom up parsing techniques, which include simple LR (SLR) parsing,
canonical LR (CLR) parsing and lookahead LR (LALR) parsing. The chapter also introduces the
tool yacc to show the automatic generation of LALR parsers.
q Chapter 6 explains the concept of syntax-directed translations (SDT) and syntax-directed defini-
tions (SDD).
q Chapter 7 expounds on how to generate an intermediate code for a typical programming language.
It discusses different representations of the intermediate code and also introduces the concept of
backpatching.
q Chapter 8 throws light on type checking process and its rules. It also explains type expressions,
static and dynamic type checking, design process of a type checker, type equivalence and type
conversions.
q Chapter 9 familiarizes the reader with runtime environment, its important elements and various
issues it deals with. It also discusses static and dynamic allocation, control stack, activation records
and register allocation.
q Chapter 10 explores the usage of symbol table in a compiler. It also discusses the operations per-
formed on the symbol table and various data structures used for implementing the symbol table.
q Chapter 11 familiarizes the reader with code optimization and the code generation process.
Acknowledgements
q Our publisher Pearson Education, their editorial team and panel reviewers for their valuable con-
tributions toward content enrichment.
q Our technical and editorial consultants for devoting their precious time to improve the quality of
the book.
q Our entire research and development team who have put in their sincere efforts to bring out a high-
quality book.
Feedback
For any suggestions and comments about this book, please contact us at [email protected]. Hope
you enjoy reading this book as much as we have enjoyed writing it.
Rohit Khurana
Founder and CEO
ITL ESL
1
Introduction to Compilers
Source Target
Program Compiler Program
Execution of the target program: During execution, the target program is first loaded into the
main memory and then the user interacts with the target program to generate the output. The exe-
cution phase is shown in Figure 1.2.
2 Principles of Compiler Design
Source Program
Interpreter Output
Inputs
Machine
Source New Source Language
Program Program Code
Preprocessor Compiler
Assemblers: In some cases, compiler generates the target program in assembly language. In that
case, the assembly language program is given to the assembler as input. An assembler then trans-
lates the assembly language program into machine language program which is relocatable machine
code. An assembly language program is in mnemonics.
Assembly Machine
Source Language Language
Program Program Code
Compiler Assembler
(Mnemonics)
Loaders and link editors: The larger source programs are compiled in small pieces by the com-
piler. To run the target machine code of any source program successfully, there is a need to link the
relocated machine language code with library files and other relocatable object files. So, loader and
link editor programs are used for the link editing and loading of the relocated codes. Link editors
create a single program from several files of relocated machine code. Loaders read the relocated
machine code and alter the relocatable addresses. To run the machine language program, the code
with altered data and commands is placed at the correct location in the memory.
5. Discuss the steps involved in the analysis of a source program with the help of a block
diagram.
Ans: The steps involved in the analysis of
source program are given below.
Source program acts as an input to the prepro- Source Program
cessor. Preprocessor modifies the source code
by replacing the header files with the suitable
content. Output (modified source program)
Preprocessor
of the preprocessor acts as an input for the
compiler.
Modified Source Program
Compiler translates the modified source pro-
gram of high-level language into the target
program. If the target program is in machine Compiler
language, then it can be executed directly. If Target Program in Assembly
the target program is in assembly language, Language
then that code is given to the assembler for
translation. Assembler translates the assembly Assembler
language code into the relocatable machine
language code. Relocatable Machine Code
Relocatable machine language code acts as an
Library Files and
input for the linker and loader. Linker links the Linker/Loader Relocatable Object
relocatable code with the library files and the
Files
relocatable objects, and loader loads the inte-
grated code into memory for the execution. The
Target Machine Code
output of the linker and loader is the equivalent
machine language code for the source code. Figure 1.6 Block Diagram of Source Program Analysis
4 Principles of Compiler Design
Source Program
Character Stream
Token
Stream
Syntax Analysis
Semantic Analysis
Parse Tree
Intermediate
Code
Code Optimization
Phase
Intermediate
Code
Lexical analysis phase: Lexical analysis (also known as scanning) is the first phase of a compiler.
Lexical analyzer or scanner reads the source program in the form of character stream and groups
the logically related characters together that are known as lexemes. For each lexeme, a token is
generated by the lexical analyzer. A stream of tokens is generated as the output of the lexical analy-
sis phase, which acts as an input for the syntax analysis phase. Tokens can be of different types,
namely, keywords, identifiers, constants, punctuation symbols, operator symbols, etc. The syntax
for any token is:
(token_name, value)
here token_name is the name or symbol which is used during the syntax analysis phase and
w
value is the location of that token in the symbol table.
Syntax analysis phase: Syntax analysis phase is also known as parsing. Syntax analysis phase
can be further divided into two parts, namely, syntax analysis and semantic analysis.
Syntax analysis: Parser uses the token_name token from the token stream to generate the
output in the form of a tree-like structure known as syntax tree or parse tree. The parse tree
illustrates the grammatical structure of the token stream.
Semantic analysis: Semantic analyzer uses the parse tree and symbol table for checking the
semantic consistency of the language definition of the source program. The main function of
the semantic analysis is type checking in which semantic analyzer checks whether the oper-
ator has the operands of matching type. Semantic analyzer gathers the type information and
saves it either in the symbol table or in the parse tree.
Intermediate code generation phase: In intermediate code generation phase, the parse tree rep-
resentation of the source code is converted into low-level or machine-like intermediate representa-
tion. The intermediate code should be easy to generate and easy to translate into machine language.
There are several forms for representing the intermediate code. Three address code is the most
popular form for representing intermediate code. An example of three address code language is
given below.
x1 = x2 + id
id1 = x3
Code optimization phase: Code optimization phase, which is an optional phase, performs the
optimization of the intermediate code. Optimization means making the code shorter and less com-
plex, so that it can execute faster and takes lesser space. The output of the code generation phase is
also an intermediate code, which performs the same task as the input code, but requires lesser time
and space.
Code generation phase: Code generation phase translates the intermediate code representation of
the source program into the target language program. If the target program is in machine language,
the code generator produces the target code by assigning registers or memory locations to store
variables defined in the program and to hold the intermediate computation results. The machine
code produced by the code generation phase can be executed directly on the machine.
Symbol table management: A symbol table is a data structure that is used by the compiler to
record and collect information about source program constructs like variable names and all of its
attributes, which provide information about the storage space occupied by a variable (name, type,
and scope of the variables). A symbol table should be designed in an efficient way so that it permits
the compiler to locate the record for each token name quickly and to allow rapid transfer of data
from the records.
6 Principles of Compiler Design
Error handler: Error handler is invoked whenever any fault occurs in the compilation process of
source program.
Both the symbol table management and error handling mechanisms are associated with all phases of
the compiler.
7. Discuss the action taken by every phase of compiler on the following instruction of
source program while compilation.
Total = number1 + number2 * 5 Total = number 1 + number 2 * 5
Ans: Consider the source program as a stream
of characters.
Total = number1 + number2 * 5 Lexical Analyzer
Lexical analysis phase: Stream of charac-
ters (source program) acts as an input for the
lexical analyzer, which produces the token
stream as output (see Figure 1.8). <id, 1> < = > <id, 2> <+> <id, 3> <*> <5>
Syntax analysis phase: The token stream Figure 1.8 Lexical Analysis Phase
acts as the input for the syntax analyzer.
Output of the syntax analyzer is a parse tree (see Figure 1.9(a)) that acts as the input for the
semantic analyzer; the output of the semantic analyzer is also a parse tree after type checking
(see Figure 1.9(b)).
<id, 1>
+
<id, 2> *
<id, 1> < = > <id, 2> <+> <id, 3> <*> <5>
<id, 3> 5
Syntax Analyzer
Semantic Analyzer
=
=
<id, 1> +
<id, 1> +
<id, 2>
* <id, 2>
*
<id, 3> 5
<id, 3>
(a) Syntax Analyzer inttofloat
5
(b) Semantic Analyzer
t3 = id3 * 5.0
t3 = inttofloat (5)
id1 = id2 + t3
t2 = id3 * t3
t1 = id2 + t2
id1 = t1
Code Generator
Code Optimizer
Figure 1.11 Code Optimization Phase Figure 1.12 Code Generation Phase
8. What is a pass in the compilation process? Compare and contrast the features of a
single-pass compiler with multi-pass compiler.
Ans: In an implementation of a compiler, the activities of one or more phases are combined into a
single module known as a pass. A pass reads the input, either as a source program file or as the output of
the previous pass, transforms the input and writes the output into an intermediate file. The intermediate
file acts as either the input for the next pass or the final machine code.
When all the phases of a compiler are grouped together into a single pass, then that compiler is known
as single-pass compiler. On the other hand, when different phases of a compiler are grouped together
into two or more passes, then that compiler is known as multi-pass compiler.
A single-pass compiler is faster than the multi-pass compiler because in multi-pass compiler each
pass reads and writes an intermediate file, which makes the compilation process time consuming.
Hence, time required for compilation increases with the increase in the number of passes in a compiler.
8 Principles of Compiler Design
A single-pass compiler takes more space than the multi-pass compiler because in multi-pass compiler
the space used by the compiler during one pass can be reused by the subsequent pass. So, for comput-
ers having small memory, multi-pass compilers are preferred. On the other hand, for computers having
large memory, single-pass compiler or compiler with fewer number of passes can be used.
In a single-pass compiler, the complicated optimizations required for high quality code generation are
not possible. To count the exact number of passes for an optimizing compiler is a difficult task.
Source Target
Language S T
Language
Implementation
I Language
S T A M
A M
S T S T
A M
A M
Bootstrapping: Bootstrapping is an important concept for building a new compiler. This concept
uses a simple language to translate complicated programs which can further handle more complicated
programs. The process of bootstrapping can be better understood with the help of an example given
here.
Suppose we want to create a cross compiler for the new source language S that generates a target code
in language T, and the implementation language of this compiler is A. We can represent this compiler as
CST
A (see Figure 1.14(a)). Further, suppose we already have a compiler written for language A with both
target and implementation language as M. This compiler can be represented as CAM M (see Figure 1.14(b)).
Now, if we run CSTA with the help of C AM
M , then we get a compiler C ST
M (see Figure 1.14(c)). This com-
piler compiles a source program written in language S and generates the target code in T, which runs on
machine M (that is, the implementation language for this compiler is M).
11. Explain error handling in compiler.
Ans: Error detection and reporting of errors are important functions of the compiler. Whenever
an error is encountered during the compilation of the source program, an error handler is invoked.
Error handler generates a suitable error reporting message regarding the error encountered. The error
reporting message allows the programmer to find out the exact location of the error. Errors can be
encountered at any phase of the compiler during compilation of the source program for several rea-
sons such as:
In lexical analysis phase, errors can occur due to misspelled tokens, unrecognized characters, etc.
These errors are mostly the typing errors.
In syntax analysis phase, errors can occur due to the syntactic violation of the language.
In intermediate code generation phase, errors can occur due to incompatibility of operands type for
an operator.
10 Principles of Compiler Design
In code optimization phase, errors can occur during the control flow analysis due to some unreach-
able statements.
In code generation phase, errors can occurs due to the incompatibility with the computer architec-
ture during the generation of machine code. For example, a constant created by compiler may be
too large to fit in the word of the target machine.
In symbol table, errors can occur during the bookkeeping routine, due to the multiple declaration
of an identifier with ambiguous attributes.
Multiple-Choice Questions
1. A translator that takes as input a high-level language program and translates into machine language
in one step is known as —————.
(a) Compiler (b) Interpreter
(c) Preprocessor (d) Assembler
2. ————— create a single program from several files of relocated machine code.
(a) Loaders (b) Assemblers
(c) Link editors (d) Preprocessors
3. A group of logically related characters in the source program is known as —————.
(a) Token (b) Lexeme
(c) Parse tree (d) Buffer
4. The ————— uses the parse tree and symbol table checking the semantic consistency of the
source program.
(a) Lexical analyzer (b) Intermediate code generator
(c) Syntax translator (d) Semantic analyzer
5. The ————— phase converts an intermediate code into an optimized code that takes lesser space
and lesser time to execute.
(a) Code optimization (b) Syntax directed translation
(c) Code generation (d) Intermediate code generation
6. ————— is invoked whenever any fault occurs in the compilation process of source program.
(a) Syntax analyzer (b) Code generator
(c) Error handler (d) Lexical analyzer
7. In compiler, the activities of one or more phases are combined into a single module known as a
—————.
(a) Phase (b) Pass
(c) Token (d) Macro
8. For the construction of a compiler, the compiler writer uses different types of software tools that are
known as —————.
(a) Compiler writer tools (c) Programming tools
(c) Compiler construction tools (d) None of these
Introduction to Compilers 11
9. A compiler that runs on one machine and produces the target code for another machine is known
as —————.
(a) Cross compiler (b) Linker
(c) Preprocessor (d) Assembler
10. If we run a compiler CST AM
A with the help of another compiler C M , then we get a new compiler that
is —————.
(a) CSM
M (b) CST
A
(c) CST
M (d) CAM
M
Answers
1. (a) 2. (c) 3. (b) 4. (d) 5. (a) 6. (c) 7. (b) 8. (c) 9. (a) 10. (c)
2
Lexical Analysis
Token
Lexical
Source Parser Intermediate
Analyser
Program Code
Symbol
Table
Besides generation of tokens, the lexical analyzer also performs certain other tasks such as:
Stripping out comments and whitespace (tab, newline, blank, and other characters that are used to
separate tokens in the input).
Correlating error messages that are generated by the compiler during lexical analysis with the
source program. For example, it can keep track of all newline characters so that it can associate an
ambiguous statement line number with each error message.
Performing the expansion of macros, in case macro preprocessors are used in the source program.
2. What do you understand by the terms tokens, patterns, and lexemes?
Ans: Tokens: The lexical analyzer separates the characters of the source language into groups
that logically belong together, commonly known as tokens. A token consists of a token name and an
optional attribute value. The token name is an abstract symbol that represents a kind of lexical unit and
the optional attribute value is commonly referred to as token value. Each token represents a sequence
of characters that can be treated as a single entity. Tokens can be identifiers, keywords, constants,
operators, and punctuation symbols such as commas and parenthesis. In general, the tokens are broadly
classified into two types:
Specific strings such as if, else, comma, or a semicolon.
Classes of strings such as identifiers, constants, or labels.
X = t o t a l * 5
lexemeBegin forward
pointer pointer
Each buffer is of the same size N, where N is the size of a disk block, for example 1024 bytes. Thus,
instead of one character, N characters can be read at a time. The pointers used in the input buffer for
recognizing the lexeme are as follows:
Pointer lexemeBegin points the beginning of the current lexeme being discovered.
Pointer forward scans ahead until a pattern match is found for lexeme.
Initially, both pointers point to the first character of the next lexeme to be found. The forward
pointer is scanned ahead until a match for a pattern is found. After the lexeme is processed, both
the pointers are set to the character following that processed lexeme. For example, in Figure 2.2 the
lexemeBegin pointer is at character t and forward pointer is at character a. The forward pointer
is scanned until the lexeme total is found. Once it is found, both these pointers point to *, which is
the next lexeme to be discovered.
4. What are strings and languages in lexical analysis? What are the operations performed
on the languages?
Ans: Before defining the terms strings and languages, it is necessary to understand the term
alphabet. An alphabet (or character class) denotes any finite set of symbols. Symbols include letters,
digits, punctuation, etc. The ASCII, Unicode, and EBCDIC are the most important examples of
alphabet. The set {0, 1} is the binary alphabet.
A string (also termed as sentence or word) is defined as a finite sequence of symbols drawn from
an alphabet. The length of a string s is measured as the number of occurrences of symbols in s and is
denoted by |s|. For example, the word ‘orange’ is a string of length six. The empty string (Î) is the
string of length zero.
A language is any finite set of strings over some specific alphabet. This is an enormously broad def-
inition. Simple sets such as f, the empty set, or {Î}, the set containing only the empty string, are also
the languages under this definition.
In lexical analysis, there are several important operations like union, concatenation, and closure that
can be applied to languages. Union operation means taking all the strings of both the set of languages
and creating a new set of language containing all the strings. The concatenation of languages is done
Lexical Analysis 15
Example:
Let P = {A, B,
. . . , Z, a, b, . . . , z}
Operation Definition and Q = {0, 1, 2,
. . . , 9}
Union of P and Q PÈQ = {s|s is in P or s is in Q} P È Q is the set of letters and digits, with 62
strings of length one.
Concatenation of P and Q PQ = {st|s is in P and t is in Q} PQ is the set of 520 strings of length two, con-
sisting one letter followed by one digit.
Kleene closure of P P* = Ui=0
¥
Pi P* is the set of all strings of letters, including
Î, the empty string.
or
P(P È Q)* is the set of all strings of letters
and digits beginning with a letter.
Qi = Ui=0 Qi
¥
Positive closure of Q Q+ is the set of all strings of one or more digits.
by concatenating a string from the first language and a string from the second language forming the new
strings, in all possible ways. The (Kleene) closure of a language P, denoted by P*, is the set of strings
achieved by concatenating P zero or more times. P0, ‘the concatenation of P zero times,’ is defined to
be {Î}. The positive closure, denoted by P+, is same as the Kleene closure, but without the term P0,
precisely, P+ is Pi-1P. Î will not be in P+, unless it is in P itself. These operations are listed in Table 2.2.
5. Define the following terms in context of a string: prefix, suffix, substring, and
subsequence.
Ans: Prefix: If zero or more symbols are removed from the end of any string s, a new string is
obtained known as a prefix of string s. For example, app, apple, and Î are prefixes of apple.
Suffix: If zero or more symbols are removed from the beginning of any string s, a new string is
obtained known as suffix of string s. For example, ple, apple, and Î are suffixes of apple.
Substring: If we delete any prefix and any suffix from a string s, we obtain a new string known as
substring of s. For example, pp, apple, and Î are substrings of apple.
Subsequence: If we delete zero or more not necessarily consecutive positions of a string s, a new
string is formed known as subsequence of s. For example, ale is a subsequence of apple.
6. What do you mean by a regular expression? Write a regular expression over alphabet
(x, y, z) that represents all strings of length three.
S =
Ans: A regular expression is a compact notation that is used to represent the patterns
corresponding to a token. It is used to describe all the languages that can be built by applying union,
concatenation, and closure operations to the symbols of some alphabet. The regular expression
represents pattern to define the language which includes a set of strings. The strings are considered
to be in the said language if they match the pattern; otherwise, they are not in the said language.
For example, consider the identifiers in a programming language, where an identifier may consist
of a letter or more followed by any number of digits or an underscore (_). Thus, the language for C
identifiers can be described as:
letter_(letter_|digit)*
Here, the vertical bar indicates union and the star indicates zero or more instances. The parentheses
are used to group subexpressions.
16 Principles of Compiler Design
There exist some primitive regular expressions which are of universal type, over some alphabet S,
which are defined as follows:
x (for each x Î S), the primitive regular expression x defines the language {x}, that is, the only
string is ‘x’ in this particular language which is of length one.
l (empty string), the primitive regular expression l defines the language {l}, that is, the only
string is the empty string in this particular language. The language denoted by l is of universal
type.
f (indicates no string at all), the primitive regular expression f denotes the language {}, that is, no
string at all in this particular language. The language denoted by f is also of universal type.
Thus, it must be noted that if |S| = number of symbols present in it = n, then there are n + 2 primi-
tive regular expressions defined over it.
7. List the rules for constructing regular expressions. Write some properties to compose
additional regular expressions. What is a regular definition? Give a suitable example.
Ans: The rules for constructing regular expressions over some alphabet S are divided into two major
classifications which are as follows:
(i) Basic rules (ii) Induction rules
Basic rules: There are two rules that form the basis:
1. Î is a regular expression, and L(Î) is {Î}, that is, its language contains only an empty string.
2. If a is a symbol in S, then a is a regular expression, and L(a) = {a}, which implies the language
with one string, of length one, with a in its one position.
Induction rules: There are four induction rules that built larger regular expressions recursively from
smaller regular expressions. Suppose R and S are regular expressions with languages L(R) and
L(S), respectively.
1. (R)(S) is a regular expression representing the language L(R).L(S).
2. (R)|(S) is a regular expression representing the language L(R) È L(S).
3. (R)* is a regular expression representing the language (L(R))*.
4. (R) is a regular expression representing L(R). This rule states that additional pairs of parentheses
can be added around expressions without modifying the language.
Lexical Analysis 17
Properties of Regular Expression: To compose additional regular expressions, the following prop-
erties are to be considered, a finite number of times:
1. If a1 is a regular expression, then (a1) is also a regular expression.
2. If a1 is a regular expression, then a1* is also a regular expression.
3. If a1 and a2 are two regular expressions, then a1a2 is also a regular expression.
4. If a1 and a2 are two regular expressions, then a1 + a2 is also a regular expression.
Regular Definition: If S = alphabet set, then a regular definition is a sequence of definitions of the
form:
D1 ® R1
D2 ® R2
. . .
Dn ® Rn
where
Di is a new symbol, not in S and not the same as any of the other D’s.
Ri is a regular expression over the alphabet S È {D1, D2, . . . , Di-1}.
For example, let us consider the C identifiers that are strings of letters, digits, and underscores. Here,
we give a regular definition for the language of C identifiers.
letter_ ® A| B | . . . | Z | a | b | . . . | z | _
digit ® 0 | 1 | . . . | 9
id ® letter_(letter_|digit)*
8. What is a transition diagram? Draw a transition diagram to identify the keywords IF,
THEN, ELSE, DO, WHILE, BEGIN, END.
Ans: While constructing a lexical analyzer, we represent patterns in the form of flowcharts, called
transition diagrams. A transition diagram consists of a set of nodes and edges that connect one state
to another. A node (or a circle) in a transition diagram represents a state and each edge (or an arrow)
represents the transition from one state to another. Each edge is labeled with one or more symbols.
A state is basically a condition that could occur while scanning the input to find out a lexeme
that matches one of the several patterns. We can also think of a state as summarizing all we need
to know what characters have been seen between the lexemeBegin pointer and the forward
pointer. Suppose, currently we are at state q, and the next input symbol is a, then we look for an
edge e coming out of the current state q that is having the label a. If such an edge is found, then we
move ahead the forward pointer and enter the state of the transition diagram to which this edge is
connected.
Among all the states, one state, say q0, is termed as initial or start state. The transition diagram
always begins in the start state before any input symbols have been read. One or more states are said to
be final or accepting states and are represented by double circles. We may also attach actions to the final
states to indicate that a token and an attribute value are being returned to the parser. In some cases, it
is also necessary to move the forward pointer backward by certain number of positions, then we can
place that many number of *’s near the final state. For example, if we want to retract the pointer by one
position, then we can place a single *, for two positions, ** can be placed, and so on.
The transition diagram to identify the keywords BEGIN, END, IF, THEN, ELSE, DO, and WHILE is
shown in Figure 2.3.
18 Principles of Compiler Design
blank
or
start B E G I N newline *
q0 q1 q2 q3 q4 q5 q6
9. Draw the transition diagram for identifiers, constants, and relational operators (relops).
Ans: Transition diagram for identifiers is shown in Figure 2.4.
letter or digit
digit
start digit *
q0 q1 q2 return (2, INSTALL ())
not digit
The transition diagram for relational operators (relops) is shown in Figure 2.6.
not = or
start < < *
q0 q1 q2 return (relop, LT)
=
q3 return (relop, LE)
>
q4 return (relop, NE)
=
q5 return (relop, EQ)
> not = *
q6 q7 return (relop, GT)
=
q8 return (relop, GE)
E digit
other * other *
q8 q9
In the transition diagram for unsigned numbers, we begin with the start state q0, if we see a digit,
we move to state q1. In that state, we can read any number of additional digits.
In case we see anything except a digit, dot, or E from state q1, it implies that we have seen an inte-
ger number, for example 789. In such case, we enter the state q8, where we return token number
and a pointer to a table of constants where lexeme is entered.
20 Principles of Compiler Design
If we see a dot from state q1, then we have an ‘optional fraction,’ and we enter the state q2. Now if
we look for one or more additional digits, we move to the state q3 for this purpose.
In case we see an E in state q3, then we have an ‘optional exponent,’ which is recognized by the
states q4 through q7, and return the lexeme at final state q7.
In state q3, if we have come to an end of the fraction, and we have not seen any exponent E, we
move to the state q9, and return the lexeme found.
d is transition function, which takes two arguments, a state and an input symbol, and returns a
single state ‘represented by Q * S ® Q’. Let q is the state and a be the input symbol passed
to the transition function, then d(q, a) = q’, where q’ is the output function, which may
be same as q.
Graphically, the transition function can be represented as follows:
d (q, a) ® q’
DFA is a special case of an NFA where
There are no moves on input Î and
For each state q and input symbol a, there is exactly one edge out of q labeled a.
12. What do you mean by NFA with Є-transition?
Ans: NFA with Î-transition is defined as a modified finite automata that permits transition with-
out input symbols, along with zero, one or more transitions on input symbols. Let us take an example,
where we have to design an NFA with Î-transition for the following accepting language:
L = {ab È aab*}
To solve this problem, first we divide the language as follows:
L = L1 È L2, where L1 = ab and L2 = aab*
Now, we construct NFA for L1.
start a b
q1 q2 q3
b
start a a
q4 q5 q6
Finally, we combine the transition diagram of L1 and L2, to construct the NFA with Î-transition for
given input language as shown in Figure 2.8. In this NFA, we use Î-transitions to reach at states q1 and q2.
a b a
start e e
q0 q1 q2
In this NFA,
Î-closure(q0) = {q0, q1, q2}
Î-closure(q1) = {q1, q2}
Î-closure(q2) = {q2}
14. Write an algorithm to convert a given NFA into an equivalent DFA.
Or
Give the algorithm for subset construction and computation of Є-closure.
Ans: The basic idea behind constructing a DFA from NFA is to merge two or more states of DFA
into one. To convert a given NFA into an equivalent DFA, we note that a set of states in an NFA cor-
responds to a state in the DFA. All the NFA states are reachable from at least one state of the same set
using Î-transition only, without considering any further input. Moreover, from this set of states which
are based on some input symbol we can reach another set of states. In the DFA, we take these sets as
unique states. We define two sets that are as follows:
Î-closure(q): In an NFA, Î-closure of a state q defined to be the set of states (including q)
that are reachable from q using Î-transitions only.
Î-closure(Q): Î-closure of a set of states Q of an NFA is defined to be the set of states reach-
able from any state in Q using Î-transitions only.
The algorithm for computing Î-closure of a set of states Q is given in Figure 2.9.
Î-closure(Q) = Q
Set all the states of Î-closure(Q) unmarked
For each unmarked state q in Î-closure(Q) do
Begin
Mark q
For each state q’ having an edge from q to q’ labeled Î do
Begin
If q’ is not in Î-closure(Q) then
Begin
add q’ to Î-closure(Q)
Set q’ unmarked
End
End
End
Now, to convert an NFA to the corresponding DFA, we consider the algorithm shown in Figure 2.10.
Input: An NFA with set of states Q, start state q0, set of final states F
Output: Corresponding DFA with start state d0, set of states QD, set of final states FD
Lexical Analysis 23
Begin
d0 = Î-closure(q0)
QD = {d0}
If d0 contains a state from F then FD = {d0} else FD = f
Set d0 unmarked
While there are unmarked states in QD do
Begin
Let d be such a state
For each input symbol x do
Begin
Let S be the set of states in Q having transitions on x from
any state of the NFA corresponding to the DFA state d
d’ = Î-closure(S)
If d’ is already present in QD then
add the transition d ® d’ labeled x
else
Begin
QD = QD È {d’}
add the transition d ® d’ labeled x
Set d’ unmarked
If d’ contains a state of F then FD = FD È {d’}
End
End
End
End
Figure 2.10 Algorithm to Convert NFA to DFA
15. Give Thompson’s construction algorithm. Explain the process of constructing an NFA
from a regular expression.
Ans: To construct an NFA from a regular expression, we present a technique that can be used as a
recognizer for the tokens corresponding to a regular expression. In this technique, a regular expression
is first broken into simpler subexpressions, then the corresponding NFA are constructed and finally,
these small NFAs are combined with the help of regular expression operations. This construction is
known as Thompson’s construction.
Thompson’s construction algorithm: The brief description of Thompson’s construction algorithm
is as follows:
Step 1: Find the alphabet set S from the given regular expression. For example, for the regular
expression a (a | b) * ab, S = {a,b}. Now, determine all primitive regular expressions.
Step 2: Construct equivalent NFAs for all primitive regular expressions. For example, an equivalent
NFA for the primitive regular expression ‘a’ is shown below:
start a
Step 3: Apply the rules for union, concatenation, grouping, and (Kleene)* to get the equivalent NFA
of the given regular expression.
24 Principles of Compiler Design
While constructing an NFA from a regular expression using Thompson’s construction, these rules are
followed:
For Î or any alphabet symbol x in the alphabet set S, the NFA consists of two states—a start state
and a final state. The transition is labeled by Î or x as shown below:
start Î/x
If we are given NFAs of two regular expressions r1 and r2 as N(r1) and N(r2), then we can
construct a composite NFA for the regular expression (r1|r2) as follows:
Add new initial state (q0) and final
state qf.
Introduce Î-transitions from q0 to N(r1) e
e
the start state of N(r1) and N(r2). start
Similarly, introduce Î-transitions from
final states of N(r1) and N(r2) to the e
new final state qf (see Figure 2.11). N(r2) e
Note that the final states of N(r1)
and N(r2) are no longer be the final Figure 2.11 NFA for r1|r2
states in the composite NFA N(r1|r2).
The NFA N(r1r2)for the regular expres-
sion r1r2 can be constructed by merging
the final state of N(r1) with the start state start
N(r1) N(r2)
of N(r2). The start state of N(r1) will
become the start state of new NFA and the
final state of N(r2) will become the final Figure 2.12 NFA for r1r2
state of new NFA as shown in Figure 2.12.
If we are given NFA N(r*), we construct a regular expression r* from the NFA N(r)of r as
follows:
Add new start state (q0) and final state (qf).
Introduce Î-transitions from q0 to the start state of N(r), from the final state of N(r) to qf,
from the final state of N(r) back to the start state of N(r) that corresponds to repeated occur-
rence of r, and from q0 to qf corresponding to the zero-occurrence of r (see Figure 2.13).
If N(r) be the NFA for a regular expression r, it is also the NFA for the parenthesized
expression (r).
start e e
N(r)
firstpos(n): It is the set of positions in the subtree rooted at n corresponding to the first
symbol of at least one string in the language of the subexpression rooted at n. The rules to compute
firstpos(n) for any node n are as follows:
For a leaf labeled Î, firstpos(n) will be f.
For a leaf with position i, firstpos(n) will be i itself.
For an or-node n = c1|c2, we take the union of the firstpos of left child and right child.
For a cat-node n = c1c2, if the left child c1 is nullable, then we take the union of firstpos
of the left child c1 as well as the right child c2, otherwise only firstpos of the left child c1 is
possible.
For star-node n = c1*, we take the value of firstpos of the left child c1.
lastpos(n): It is the set of positions in the subtree rooted at n corresponding to the last symbol
of at least one string in the language of the subexpression rooted at n. The rules to compute lastpos
are the same as that of firstpos, except the rule for the cat-node, where the roles of its children are
interchanged. That is, for a cat-node n = c1c2, we consider whether the right child c2 is nullable.
If yes, then we take the union of lastpos(c1)and lastpos(c2), otherwise only lastpos(c2)
is possible.
followpos(p): It is set of positions q, for a position p, in the syntax tree such that there exist
some string s = x1x2 . . . xn in L((r)#) such that for some i, there is a way to explain the member-
ship of s in L((r)#) by matching xi to position p of the syntax tree and xi+1 to position q. To com-
pute followpos, there are only two ways given as follows:
If n = c1c2, then for every position i in lastpos(c1), followpos(i) will be all positions
in firstpos(c2).
If n is a star-node, and i is a position in lastpos(n), then followspos(i) will be all posi-
tions in firstpos(n).
To understand how to compute these functions, consider the syntax tree for the expression
(x|y) * xyy# shown in Figure 2.14. The numeric value associated with each leaf node indicates the
position of the leaf and also the position of its symbol.
In this syntax tree, only the star-node is nullable because every star-node is nullable. All the leaf nodes
correspond to non-Î operands; thus, none of them is nullable. The or-node is also not nullable because
26 Principles of Compiler Design
neither of its child nodes is nullable. Finally, the cat-nodes also have non-nullable child nodes, and hence
none of them is nullable. The firstpos and lastpos of all the nodes are shown in Figure 2.15.
{1, 2, 3} {6}
{1, 2, 3} {5} #
#
{6} {6}
6
{1, 2, 3} {4}
y y
5 {5} {5}
{1, 2, 3} {3} y
y {4} {4}
4
x x
* {1, 2} * {1, 2} {3} {3}
3
| |
{1, 2} {1, 2}
x y x y
1 2 {1} {1} {2} {2}
Figure 2.14 Syntax Tree for (x|y) * xyy# Figure 2.15 Firstpos and Lastpos for the Nodes
Value of n followpos(n)
1 {1,2,3}
2 {1,2,3}
3 {4}
4 {5}
5 {6}
6 f
17. Describe the process of constructing a DFA directly from a regular expression.
Ans: The process for constructing a DFA directly from a regular expression consists of the following
steps:
From the augmented regular expression (r)#, construct a syntax tree T rooted at node n0.
For syntax tree T, compute nullable, firstpos, lastpos, and followpos.
Construct Dstates, the set of states of DFA D, and Dtran, the transition function for D, by using
the algorithm given in Figure 2.16.
The states of D are sets of position in T. Initially, all the states are unmarked, and a state becomes
marked just before its out-transitions. firstpos(n0) is set as the start state of D, and the states con-
taining the position for the endmarker symbol # are considered as the accepting states.
Lexical Analysis 27
Lex source
program Lex Compiler lex.yy.c
(lex.1)
The lex source program, lex.1, is passed through lex compiler to produce the C program file
lex.yy.c. The file lex.1 basically contains a set of regular expressions along with the routines for
each regular expression. The routines contain a set of instructions that need to be executed whenever
a token specified in the regular expression is recognized. The file lex.yy.c is then compiled using
a C compiler to produce the lexical analyzer a.out. This lexical analyzer can now take a stream of
input characters and produce a stream of tokens.
The lexical analyzer a.out is basically a function that is used as a subroutine of the parser. It returns
an integer code for one of the possible token names. The attribute value for the token is stored in a global
variable yylval. This variable is shared by both lexical analyzer and parser. This enables to return
both the name and the attribute value of a token.
28 Principles of Compiler Design
.
.
.
231 constant, integer, value = 20
.
.
.
642 label, value = 100
.
.
.
782 identifier, integer, value = i
After finding the required tokens and storing them into the symbol table, code is rewritten as
follows:
If([identifier, 782] = [constant, 231]) Then GOTO [label, 642]
22. Design a Finite Automata that accepts set of strings such that every string ends with 00,
over alphabets {0,1}.
Ans: Here, we have to construct a finite automata that will accept all strings like {00, 01100,
110100, . . .}. The finite automata for the given problem is given below:
1 1
0
0 0
start
q0 q1 q2
The symbol ® in the table indicates that q0 is the start state, and * indicates that q2 is the final state.
23. Design a finite automata which will accept the language
L = {w Î (0,1)*/second symbol of w is ‘0’ and fourth input is ‘1’}.
Ans: Here, we have to construct finite automata that will accept all the strings of which second
symbol is 0 and fourth is 1. The finite automata for the given problem is shown below:
30 Principles of Compiler Design
0, 1
0, 1 0 0, 1 1
start
q0 q1 q2 q3 q4
0
q5
0, 1
d 0 1
® q0 q1 q1
q1 q2 q5
q2 q3 q3
q3 q5 q4
*q4 q4 q4
q5 q5 q5
24. Construct a DFA for language over alphabet S = {a,b}that will accept all strings
beginning with ‘ab’.
Ans: Here, we have to construct a DFA that will accept all strings beginning with ab like {ab, abb,
abaab, ababb, abba, . . .}.
a, b
start a b
q0 q1 q2
a
b
q3 a, b
Lexical Analysis 31
Inputs
States 0 1
® q0 {q0, q1} {q1}
q1 f {q0, q1}
Ans: We will first draw the NFA according to the given transition table, as shown below:
0
1
0, 1
start
q0 q1
Now, we convert the NFA into DFA by following the given steps:
Step 1: Find all the transitions from initial state q0 for every input symbol, that is, S = {0,1}. If we
get a set having more than one state for a particular input, then we consider that set as new
single state. From the given transition table, it is clear that
d(q0,0) ® {q0,q1}, that is, q0 transits to both q0 and q1 for input 0. (1)
d(q0,1) ® {q1}, that is, for input 1, q0 transits to q1. (2)
d(q1,0) ® f, that is, for input 0, there is no transition from q1. (3)
d(q1,1) ® {q0,q1}, that is, q1 transits to both q0 and q1 for input 1. (4)
Step 2: In step 1, we have got a new state {q0,q1}. Now step 1 is repeated for this new state only,
that is,
d({q0,q1},0) ® d(q0,0)È d(q1,0) (A)
Since d(q0,0) ® {q0,q1} (from equation (1))
And d(q1,0) ® f (from equation (3))
32 Principles of Compiler Design
Inputs
States 0 1
®{q0} {q0, q1} {q1}
{q1} f {q0, q1}
{q0, q1} {q0, q1} {q0, q1}
Since the starting state of given NFA is q0, it will also be the starting state for DFA. Moreover, q1 is
the final state of NFA; therefore, we have to consider all those set of states containing q1 as the member.
All such sets will become the final states of DFA. Thus, F for the resultant DFA is:
F = {{q1},{q0,q1}}
The equivalent DFA for the given NFA is as follows:
q1
1
start 1
q0
q0, q1
0
0,1
Inputs
States 0 1
®A C B
*B - C
*C C C
B
1
start 1
A
0,1
0 C
a a
q2 q4 q9 q11 e
e e
e
start e e a
q0 q1 q6 q7 q8 q13
e e e
q3 q5 e q10 q12
b b
Multiple-Choice Questions
1. A ————— acts as an interface between the source program and the rest of the phases of
compiler.
(a) Semantic analyzer (b) Parser
(c) Lexical analyzer (d) Syntax analyzer
2. Which of these tasks are performed by the lexical analyzer?
(a) Stripping out comments and whitespace
(b) Correlating error messages with the source program
(c) Performing the expansion of macros
(d) All of these
3. A ————— is any finite set of strings over some specific alphabet.
(a) Sentence (b) Word
(c) Language (d) Character class
4. If zero or more symbols are removed from the end of any string s, a new string is obtained known
as a ————— of string s.
(a) Prefix (b) Suffix
(c) Substring (d) Subsequence
34 Principles of Compiler Design
5. If we have more than one possible transition on the same input symbol from some state, then the
recognizer is said to be —————.
(a) Non-deterministic finite automata (b) Deterministic finite automata
(c) Finite automata (d) None of these
6. A tool for automatically generating a lexical analyzer for a language is defined as —————.
(a) Lex (b) YACC
(c) Handler (d) All of these
7. For A = 10 to 50 do, in the given code, A is defined as a/an —————.
(a) Constant (b) Identifier
(c) Keyword (d) Operator
8. The language for C identifiers can be described as: letter_(letter_|digit)*, here *
indicates —————.
(a) Union (b) Zero or more instances
(c) Group of subexpressions (d) Intersection
9. The operation P* = Ui=0 Pi represents
¥
Answers
1. (c) 2. (d) 3. (c) 4. (a) 5. (a) 6. (a) 7. (b) 8. (b) 9. (a) 10. (b)
3
Specification of Programming
Languages
1. E
xplain context-free grammar (CFG) and its four components with the help of
an example.
Ans: The context-free grammar (CFG) was developed by Chomsky in 1965. A CFG is used to spec-
ify the syntactic structure of a programming language constructs like expressions and statements. The
CFG is also known as Backus-Naur Form (BNF). A CFG comprises four components, namely, non-
terminals, terminals, productions, and start symbol.
The non-terminals (also known as syntactic variables) represent the set of strings in a language.
The terminals (also known as tokens) represent the symbols of the language.
The productions or the rewriting rules represent the way in which the terminals and non-terminals
can be joined to form a string. A production is represented in the form of A ® a. This production
includes a single non-terminal A, known as the left hand side or head of the production, an arrow,
and a string of terminals and/or non-terminals a, known as the right hand side or body of the pro-
duction. The components of the body represent the way in which the strings of the non-terminal at
the head can be constructed. Productions of the start symbol are always listed first.
A single non-terminal is chosen as the start symbol which represents the language that is gener-
ated from the grammar.
Formally, CFG can be represented as:
G = {V, T, P, S}
where
V is a finite set of non-terminals,
T is a finite set of terminals,
P is a finite set of productions,
S is the start symbol.
For example, consider an if-else conditional statement which can be represented as:
if (expression) statement else statement
36 Principles of Compiler Design
2. Consider the following grammar for arithmetic expressions and write the precise form
of CFG using the shorthand notations.
statement ® statement + term
statement ® term
term ® term * factor
term ® factor
factor ® (statement)
factor ® id
Ans: The various shorthand notations used in grammars are as follows:
The symbols used as non-terminals include uppercase starting alphabets (A, B, C, . . .). The lower-
case names like expression, terms, factors, etc., are mostly represented as E, T, F, respectively, and
letter S mostly used as the start symbol.
The symbols used as terminals include lowercase starting alphabets (a, b, c, . . .), arithmetic
operators (/, *, +, -), punctuation symbols (parenthesis, comma), and numbers (0, 1, . . . , 9).
Lowercase alphabets like u, v, . . . , z are considered as strings of terminals. The boldface strings
like id or if are also considered as terminals.
Ending uppercase alphabets like X, Y, Z are used to represent either terminals or non-terminals.
Lowercase Greek letters like a, b, g are considered as string of terminal and non-terminals. A
generic production can hence be represented as A ® a, where A represents the left hand side of
the production and a represents a string of grammar symbols (the right hand side of the produc-
tion). A set of productions A ® a1, A ® a2, . . . , A ® an can be represented as A ® a1 êa2
| . . . |an. The symbol ‘|’ represent ‘or’.
Considering these notations, the grammar can be written as follows:
S ® S + T ê T
T ® T * F êF
F ® (S) ê id
3. What do you mean by derivation? What are its types? What are canonical derivations?
Ans: Derivation is defined as the replacement of non-terminal symbols in a particular string of ter-
minals and non-terminals. The basic idea behind derivation is to apply productions repeatedly to expand
the non-terminal symbol in that string. Consider the following productions:
+
symbol of a grammar G, then α is known as the sentential form of G. The symbol Þ is used to denote
derivation in one or more steps.
Based on the order of replacement of the non-terminals, derivation can be classified into two types,
namely, leftmost derivation and rightmost derivation. In leftmost derivation, the leftmost non-terminal
in each sentential is replaced with the equivalent production’s right hand side. The leftmost derivation
for α Þ β is represented as α Þ β.
lm
In rightmost derivation, the rightmost non-terminal in each sentential is replaced with the equiva-
lent production’s right hand side. The rightmost derivation for α Þ β is represented as α Þ β.
rm
For example, consider the following grammar:
S ® XY
X ® xxX
Y ® Yy
X ® Î
Y ® Î
The leftmost derivation can be written as:
S Þ XY Þ xxXY Þ xxY Þ xxYy Þ xxy
lm lm lm lm lm
Ans: An ambiguous grammar is a grammar that generates more than one leftmost or rightmost
derivation for some sentences. For example, consider the following grammar to produce the string
id - id/id.
E ® E - E ê E/E
E ® id
This grammar is ambiguous since it generates more than one leftmost derivation.
One derivation is as follows:
E ® E - E
E ® id - E/E
E ® id - id/E
E ® id - id/id
Another derivation is as follows:
E ® E/E
E ® E - E/E
E ® id - E/E
E ® id - id/E
E ® id - id/id
The demerit of an ambiguous grammar is that it generates more than one parse tree for a sentence and,
hence, it is difficult to choose the parse tree to be evaluated.
Ambiguity in grammars can be removed by rewriting the grammar. While rewriting the grammar, two
concepts must be considered, namely, operator precedence and associativity.
Operator precedence: Operator precedence indicates the priority given to the arithmetic opera-
tors like /, *, +, -. The operators, * and /, have higher precedence than + and -. Hence, a string
id - id/id is interpreted as id - (id/id).
Associativity of operators: The associativity of operators involves choosing the order in which the
arithmetic operators having the same precedence occur in a string. The arithmetic operators follow
left to right associativity. Hence, a string id + id - id is interpreted as (id + id) - id.
Some other operators like exponentiation and assignment operator = follow right to left associativ-
ity. Hence, a string id↑id↑id is interpreted as id↑(id↑id).
7. Discuss dangling else ambiguity.
Ans: Dangling else ambiguity is a form of ambiguity that occurs in grammar while representing
conditional constructs of programming language. For example, consider the following grammar for the
conditional statements:
statement ® if condition then statement
statement ® if condition then statement else statement
statement ® other statement
Now, consider the following string:
if C1 then if C2 then S1 else S2
Since this string generates two parse trees as shown in Figure 3.1, the grammar is said to be ambiguous.
This ambiguity can be eliminated by matching each else with its just preceding unmatched then.
It generates a parse tree for the string that relates each else with its closest previous unmatched then.
The unambiguous grammar is written as follows:
Specification of Programming Languages 39
statement statement
C1
C1 S2
C2 S1 S2 C2 S1
Figure 3.1 Parse Trees for Ambiguous Grammar
c onstructed. Suppose D is a DFA with n finite states, which accepts string of this language. For any
string of L with more than n number of starting x, DFA D must enter into some state, say Si, more
than once, since DFA has only n states. Further, assume that DFA D reaches Si after consuming first j
x’s (with j < m) and consumes all remaining x’s of the input string at this state. Since DFA accepts
strings of the form xmym, there must be a path from Si to the final state F that accepts ym. But, then there
is also a path from S0 to F through Si strings of the form xjym, which is not a string in the language L.
Hence, our assumption that DFA D accepts strings of the language L is wrong.
S0 Si f
Path labeled by xi Path labeled by yi
The context-free grammars are also useful in representing nested structures, such as nested if-
then-else, matching begin-end’s and matching parentheses, and so on. These constructs cannot
be represented using regular expressions.
10. Why the use of CFG is not preferred over regular expressions for defining the lexical
syntax of a language?
Ans: Regular expressions are preferred over CFG to describe the lexical syntax of a language due to
the following reasons:
Regular expressions provide a simple notation for tokens as compared to grammars.
The lexical rules provided by regular expressions are quite simple, and hence, a powerful notation
like CFG is not required.
Regular expressions are used to construct more efficient lexical analyzers.
The syntactic structure of a language when divided into lexical and non-lexical parts provides an
easy way to modularize the front end of a compiler.
The lexical constructs like identifiers, constants, keywords, etc., can be easily described using
regular expressions.
11. What do you mean by a left recursive grammar? Write an algorithm to eliminate left
recursion.
+
Ans: For a grammar G, if there exists a derivation A Þ Aα for some string α, then the grammar is
said to be left recursive. Left recursion causes problem while designing parsers (parsers are discussed
in the next chapter). When the parse tree for a left recursive grammar is constructed, the process gets
into an infinite loop. This looping results in an invalid string.
Left recursion can be eliminated by rewriting the offending production. Consider the production,
E ® E + T êT, where the non-terminal on the left hand side of the production is the same as the
leftmost symbol on the right hand side. Now, if in the production we try to expand E, it will eventu-
ally result in again expanding E without taking any input. So, left recursion can be eliminated by
replacing E ® E + T êT with E ® TE’ and E’® + TE êÎ. This process eliminates the imme-
diate left recursion; however, eliminating left recursion from the grammar involving derivations of
two or more steps is not possible. Hence, an algorithm is designed for such derivations as shown in
Figure 3.3. This algorithm is suitable for grammars with no cycles or Î productions.
Specification of Programming Languages 41
begin
for a grammar G with a non-terminal X, find the longest prefix α
common to two or more of its alternatives.
If α ≠ Î, then replace all of the X-productions
X ® αβ1 ê αβ2 ê . . . ê αβn êg, with
X ® αX’ ê g
X’ ® β1 êβ2 ê . . . êβn
where
g specifies the alternatives that do not begin with α and X’ is
a new non-terminal.
Repeat this process until no two alternatives for a non-
terminal have a common prefix.
end
S ® A ê Ù ê(T)
T ® T, S êS
In the above grammar, find the leftmost and rightmost derivations for
(a)(A,(A,A))
(b)(((A,A), Ù,(A)),A).
Ans: (a) The leftmost derivation for the string (A,(A,A)) can be written as follows:
S Þ (T) Þ (T,S) Þ (S,S) Þ (A,S)
lm lm lm lm
The rightmost derivation for the string (A, (A,A)) can be written as follows:
(b) The leftmost derivation for the string (((A,A), Ù,(A)),A) can be written as follows:
S Þ (T) Þ (T,S) Þ (S,S) Þ ((T),S) Þ ((T,S),S)
lm lm lm lm lm
Þ (((A,A),S,S),S) Þ (((A,A),Ù,S),S)
lm lm
Þ (((A,A),Ù,(T)),S) Þ (((A,A),Ù,(S)),S)
lm lm
Þ (((A,A),Ù,(A)),S) Þ (((A,A),Ù,(A)),A)
lm lm
The rightmost derivation for the string (((A,A), Ù,(A)),A) can be written as follows:
S Þ (T) Þ (T,S) Þ (T,A) Þ (S,A) Þ ((T),A)
rm rm rm rm rm
A ® aBcC ê aBb ê aB ê a
B ® Î
C ® Î
Ans: Applying left factoring, the grammar can be written as:
A ® aA’
A’® BcC êBb êB
B ® Î
C ® Î
44 Principles of Compiler Design
Multiple-Choice Questions
1. Which of the following grammar is also known as Backus-Naur form?
(a) Regular (b) Context-free
(c) Context-sensitive (d) None of these
2. In G = {V, T, P, S} representation of context-free grammar, ‘V’ stands for —————.
(a) A finite set of terminals (b) A finite set of non-terminals
(c) A finite set of productions (d) Is the start symbol
3. Which of these statements are correct for the productions in context-free grammar?
(a) Productions represent the way in which the terminals and non-terminals can be joined to form
a string.
(b) The left hand side of the production contains a single non-terminal.
(c) The right hand side of the production contains a string of terminals and/or non-terminals.
(d) All of these
4. ————— is defined as the replacement of non-terminal symbols in a particular string of termi-
nals and non-terminals.
(a) Production (b) Derivation
(c) Sentential form (d) Left factoring
5. In a derivation ————— are the intermediate strings that consists of terminals and non-
terminals.
(a) Sententials (b) Context-free language
(c) Context-sensitive language (d) None of these
6. A grammar generating more than one derivation for some sentences is known as —————.
(a) Regular (b) Context-free
(c) Context-sensitive (d) Ambiguous
7. A grammar contains —————.
(a) A non-terminal V that can be present in any sentential form
(b) A non-terminal V that cannot derive any string of terminals
(c) e as the only symbol in the left hand side of production
(d) None of these
8. Which of these are also known as canonical derivations?
(a) Leftmost derivations (b) Rightmost derivations
(c) Sentential form (d) None of these
9. Which of these statements is correct?
(a) Sentence of a grammar is a sentential form without any terminals.
(b) Sentence of a grammar should be derivable from the start state.
(c) Sentence of a grammar is a sentential form with no non-terminals.
(d) All of these
Specification of Programming Languages 45
10. Consider a grammar: A ® α S1 êα S2, the left factored productions for this grammar are:
(a) A’ ® α A (b) A ® α A’
A ® S1 êS2 A’ ® aS1 êaS2
(c) A ® α A’ (d) None of these
A’ ® S1 êS2
Answers
1. (b) 2. (b) 3. (d) 4. (b) 5. (a) 6. (d) 7. (a) 8. (b) 9. (c) 10. (c)
4
Basic Parsing Techniques
Tokens
Source Lexical Syntax
Analyzer Success/Failure
program Analyzer
Depending upon how the parse tree is built, parsing techniques are classified into three general
categories, namely, universal parsing, top-down parsing, and bottom-up parsing. The most com-
monly used parsing techniques are top-down parsing and bottom-up parsing. Universal parsing is
not used as it is not an efficient technique. The hierarchical classification of parsing techniques is
shown in Figure 4.2.
Role of a parser: A parser receives a string of tokens from lexical analyzer and constructs a parse
tree if the string of tokens can be generated by the grammar of the source language; otherwise, it reports
the syntax errors present in the source string. The generated parse tree is passed to the next phase of the
compiler, as shown in Figure 4.3.
The role of parser is summarized as follows:
Perform context-free syntax analysis.
Guides context-sensitive analysis.
Generate an intermediate code.
Report syntax errors in an intelligible manner.
Attempts error correction.
Basic Parsing Techniques 47
Parsing
Back tracking
Non-back tracking Operator precedence Table-driven LR
parsing
parsing parsing parsing
(Predictive parsing)
Tokens Parse
Source Intermediate
Lexical tree Intermediate
Parser
Analyzer code generator
Program code
Get next
token
Symbol
Table
3. What is top-down parsing? Explain with the help of an Figure 4.4 Parse Tree for –(a + a)
example. Name the different parsing techniques used for top-down parsing.
Ans: Top-down parsing is a strategy to find the leftmost derivation of an input string. In top-down
parsing, the parse tree is constructed starting from the root and proceeding toward the leaves (similar to
a derivation), generating the nodes of the tree in preorder.
For example, consider the following grammar:
E ® cDe
D ® ab|a
For the input string cae, the leftmost derivation is specified as:
E Þ cDe Þ cae
The derivation tree for the input string cae is shown in Figure 4.5.
E E E
c D e c D e c D e
b a
a
Non-backtracking parsing (Predictive parsing): Predictive parsing does not require backtrack-
ing in order to derive the input string. Predictive parsing is possible only for the class of LL(k)
grammar (context-free grammar). The grammar should be free from left recursion and left factor-
ing. Each non-terminal is combined with the next input signal to guide the parser to select the cor-
rect production rule that will lead the parser to match the complete input string.
There are two techniques of implementing top-down predictive parsers, namely, recursive-decent and
table-driven predictive parsings.
4. Define recursive predictive parsing or predictive parsing.
Ans: It is a top-down parsing method, which consists of a set of mutually recursive procedures to
process the input and handles a stack of activation records explicitly. The algorithm for predictive pars-
ing is given in Figure 4.6.
Repeat
Set A to the top of the stack and a the next input symbol
If (A is a terminal or $) then
If (A = a) then
pop A from the stack and remove a from the input
else
/* error occurred */
else /* A is a non-terminal */
if (M[A,a] = A ® B1B2B3 . . . Bk) then
/*M is the parsing table for grammar G*/
Begin
Pop A from the stack
Push Bk, Bk-1, . . . ,B1 onto the stack, B1 as top
End
Until(A = $) /* stack becomes empty */
In predictive parsing, a parsing table is constructed. To construct the table, we need two functions,
namely, FIRST() and FOLLOW(), that are associated with the grammar G. These two functions are
used to fill the proper entries in the table for G, if such a parsing table for G exists. The algorithm to
construct the predictive parsing table is given in Figure 4.7.
*
If β Þ Î, then a does not derive any string that starts with a terminal in FOLLOW(X). Similarly,
*
if a Þ Î, then β does not derive any string that starts with a terminal in FOLLOW(X).
For example, consider the following grammar to construct a LL (1) parse table:
S ® aABb
A ® c|Î
B ® d|Î
First, find out the FIRST and FOLLOW sets of all nonterminals.
FIRST(S) = {a}
FIRST(A) = {c,Î}
FIRST(B) = {d,Î}
FOLLOW(S) = {$}, since S is the start symbol.
FOLLOW(A) = FIRST(Bb)
= FIRST(B) – {Î} È FIRST (b)
= {d,Î} – {Î} È {b}
= {d, b}
FOLLOW(B)
= {b}
Now,
1. Considering the production S ® aABb
FIRST (S) = {a}
Since it does not contain any Î.
So, parse table [S, a] = S ® aABb (1)
2. Considering the production A ® c
FIRST(A) = FIRST(c) = {c}.
Since it does not contain any Î.
So, parse table [A, c] = A ® c (2)
3. Considering the production A ® Î
FIRST(A) = FIRST(Î) = {Î}.
Since it contains Î. Thus, we have to find out FOLLOW(A).
FOLLOW(A) = {d, b}
So, parse table [A, d] = A ® Î
Also, parse table [A, b] = A ® Î (3)
4. Considering the production B ® d
FIRST(B) = FIRST(d) = {d}.
Since it does not contain any Î.
So, parse table [B, d] = B ® d (4)
5. Considering the production B ® Î
FIRST(B) = FIRST(Î) = {Î}.
Since it contain Î. Thus, we have to find out FOLLOW(B).
FOLLOW(B) = {b}
So, parse table [B, b] = B ® Î (5)
Thus, the resultant parse table, from (1), (2), (3), (4), and (5) is shown in Table 4.1
52 Principles of Compiler Design
7. Write down the algorithm for recursive-decent parsing. Explain with an example.
Ans: A recursive-decent parser is a collection of procedures one for each non-terminal. Starting
from the start symbol, the parser continues its scanning until it stops and announces success if it scans
entire input string. The algorithm for recursive-decent parsing is shown in Figure 4.8.
void X()
Begin
Select an X-production, X ® A1 A2 . . . An
For(i = 1 to n) do
Begin
If(Ai is a non-terminal)
call procedure Ai( )
else if(Ai = a)/* a is the current input symbol*/
lead the input to the next symbol
else /* presence of some error */
End
End
X X X
a M b a M b a M b
c d c
Now, we get a match for the second symbol c, but as we proceed toward the third symbol a failure
occurs because symbol b does not match with d. Now we go back to M and select second production of
M ® c, and prepare its parse tree shown in Figure 4.9(c). The leaf c matches with the second symbol
of input string and leaf b with the third symbol of input string. Hence, we declare a successful parsing
for the input string.
Input buffer a * b $
C
Stack
$ Parsing table
/*error occurred*/
else if (M[A, a] is an error entry ) then
/*error occurred*/
else if (M [A, a] = A ® B1 B2 B3 . . . Bk then
Begin
output the production A ® B1 B2 B3 . . . Bk
pop the stack
push Bk, Bk-1, . . . , B1 onto the stack, B1 on the top
End
set A to the top of stack
End
Figure 4.11 Non-recursive Predictive Parsing Algorithm
9. What are the advantages and disadvantages of table-driven predictive parsing?
Ans: Advantages:
A table-driven parser can be easily generated from a given grammar. The parsing program is inde-
pendent of the grammar but the parsing table depends on the grammar. With the use of FIRST and
FOLLOW generation algorithms parsing table can be generated.
Some entries in the parsing table have entries that point to the error recovery and reporting rou-
tines, which makes error recovery and reporting an easier task.
Disadvantage:
Such type of parsers can work only on LL(1) grammars. Sometimes eliminination of left-factoring
and left-recursion may not be sufficient to transform a grammar into LL(1) grammar.
10. Explain the error recovery strategies in predictive parsing.
Ans: A top-down predictive parser can be implemented by recursive-decent parsing or by table-
driven parsing. The table-driven parsing predicts as to what terminals and non-terminals the parser
expects from the rest of the input. An error can occur in the following situations:
If the terminal on the top of a stack does not match the next input symbol.
If a non-terminal A is on the top of stack, x is the next input symbol and the parsing table entry M
[A, x] is empty.
Two commonly used error recovery schemes, namely, panic mode recovery and phrase level recovery.
Panic mode recovery is based on the idea that when an error occurs, the parser skips the input sym-
bols until it finds a synchronizing token (a semicolon, }, or any other token with an unambiguous and
clear role) in the input. A set of all synchronizing tokens is known as synchronizing set. The synchro-
nizing set should be enough effective to recover from errors and it depends on the choice of synchroniz-
ing set. Some guidelines for constructing a synchronizing set are as follows:
For a non-terminal A, place all the elements of FOLLOW(A) into the synchronizing set of A. Skip
the tokens until an element of FOLLOW(A) is found and then pop A from the stack.
For a non-terminal A, all the elements of FIRST(A) can also be added to the synchronizing set
of A. This will help the parser in resuming the parsing according to A, if a symbol in FIRST(A)
appears in the input.
The production that derives Î can be employed as a default production, if a non-terminal can pro-
duce the empty string. It may delay the error detection for some time but cannot cause an error to
Basic Parsing Techniques 55
be missed during error recovery. This approach is useful in reducing the number of non-terminals
to be parsed.
A terminal on the top of stack, which cannot be matched, is popped from the top of stack and a
warning message indicating that the terminal was inserted is issued. After issuing the warning mes-
sage, the parser can continue parsing as if the missing symbol is a part of the input. In effect, this
approach takes all other tokens into the synchronizing set for a token.
Phrase level recovery is based on the idea of filling the blank entries in the predictive parsing table
with pointers to error handling routines. The error handling routines can do the following:
They can insert, modify, or delete any symbols in the input.
They can also issue appropriate error messages.
They can pop elements from the stack.
Pushing a new symbol onto the stack and alteration of the existing symbol in stack is problematic due
to some reasons as given below:
The steps performed by the parser may result in the derivation of a word that does not correspond
to the derivation of any word in the language at all.
There is a possibility of infinite loop formation during the alteration of stack symbol.
11. Explain bottom-up parsing with an example. Also discuss reduction in bottom-up parsing
Ans: Bottom-up parsing is a parsing method to construct a parse tree for an input string beginning
at the leaves and growing towards the root. Bottom-up parsing can be considered as a process of reduc-
ing the input string to the start symbol of the grammar. At each reduction step, a particular substring
matching the right hand side of a production is replaced by the non-terminal symbol on the left hand side
of the production. Bottom-up parser can handle a large class of grammar.
For example, the steps to construct a parse tree of the token stream a + b with respect to a grammar
G, S ® S * S│S + S│S - S│a│b are shown in Figure 4.12.
a + b S + b S + S S
a a b
S + S
a b
Formally, if
*
S Þ αAγ Þ αβγ
rm rm
The parser performs a left-to-right scan through the input string to shift zero or more symbols onto
the stack until it locates a prefix of the symbol (handle) on the top of the stack that matches the right
hand side of a grammar rule. Then, the parser reduces the right hand side symbols on the top of the stack
with the non-terminal occurring on the left hand side of the grammar rule. The parser repeats the process
until it reports an error or a success message. The parsing is said to be successful, if the stack contains
the start symbol and the input is empty as shown below:
Input buffer a * b $
A Output
Operator precedence
B parsing program
Stack C
$
Operator precedence
relation table
There are three disjoint precedence relations that can exist between the pairs of terminals.
a <× b b has higher precedence than a.
a B b b has same precedence as a.
a ×> b b has lower precedence than a.
Table 4.4 Operator Precedence Relations
+ - * / ↑ id ( ) $
+ ×> ×> <× <× <× <× <× ×> ×>
- ×> ×> <× <× <× <× <× ×> ×>
* ×> ×> ×> ×> <× <× <× ×> ×>
/ ×> ×> ×> ×> <× <× <× ×> ×>
↑ ×> ×> ×> ×> <× <× <× ×> ×>
id ×> ×> ×> ×> ×> ×> ×>
( <× <× <× <× <× <× <× B
) ×> ×> ×> ×> ×> ×> ×>
$ <× <× <× <× <× <× <×
Basic Parsing Techniques 59
Input: An input string w$, a table holding precedence relations and the stack with initial symbol $.
Output: Parse tree.
Algorithm:
Set p to point to the first symbol of w$
Repeat forever
If ($ is on top of the stack and p points to $) then
accept and exit the loop
else
Begin
let terminal a is on the top of the stack and let
b be the symbol pointed to by p
If (a <× b or a B b) then /* Shift */
Begin
push b onto the stack
advance p to the next input symbol
End
else if (a ×> b) then /* Reduce */
Repeat
pop stack
Until (the top of stack terminal is related by <×
to the terminal most recently popped);
else /*error occurred*/
End
15. What are the advantages and disadvantages of operator precedence parsing?
Ans: Advantages:
Operator precedence parsing is simple and easy to implement.
Its parser is constructed by hand after knowing the grammar.
Debugging is simple.
Disadvantages:
Tokens like minus (-) are difficult to handle, as depending on whether it is being used as binary
operator or unary operator it has two different values of precedence.
It does take grammar as an input while generating a parser. This results in rewriting of the parser
in case of any additions or deletions in the production rules, which is very cumbersome and time-
consuming process.
Only a small class of grammars like operator grammars can be parsed by this parsing technique.
60 Principles of Compiler Design
17. Consider the following grammar and show the handle of each right sentential form for
the string (b, (b, b)).
E ® (A)│b
A ® A,E│E
Basic Parsing Techniques 61
Ans: The following sentential form will occur in reduction of (b,(b, b)) to S.
1. (b,(b, b)) (first b is the handle)
2. (E,(b, b)) (E is the handle)
3. (A,(b, b)) (first b is the handle)
4. (A,(E, b)) (E is the handle again)
5. (A,(A, b)) (b is the handle)
6. (A,(A, E)) ((A, E) is the handle)
7. (A,(A)) ((A) is the handle)
8. (A, E) (again (A, E) is the handle)
9. (A) ((A) is the handle)
10. E (finally string is reduced to starting non-terminal)
18. Consider the following grammar:
S ® SAS
S ® num
A ® +
A ® -
A ® *
A ® /
Explain why this grammar is not suitable to form the basis for a recursive-decent parser.
Use left-factoring and left-recursion removal to obtain an equivalent grammar which
can be used as the basis for recursive-decent parser.
Ans: Consider the following production:
S ® SAS
If we put the value of S in place of first S at the right hand side in this production, the new produc-
tion will be
S ® SASAS
If we again put the value of S in place of first S at the right hand side, the new production will be
S ® SASASAS
Thus, putting the value of S in place of first S at the right hand side again and again will result in
an infinite loop. It shows that the given grammar suffers from the problem of left recursion. Hence, it
cannot be the basis for recursive-decent parser.
If we put the value of A in the above production, we get
S ® S + S
S ® S - S
S ® S * S
S ® S/S
S ® num
It results in the following production:
S ® S + S│S - S│S * S│S/S│num
62 Principles of Compiler Design
It still suffers from left recursion, which can be removed by following the algorithm discussed in
Chapter 3. Now, we have the following productions:
S ® num S’
S ® +S S’│Î
S ® -S S’│Î
S ® *S S’│Î
S ® /S S’│Î
This grammar does not suffer from left recursion, and hence, can form the basis for recursive-decent
parser. The production will now become:
S ® num S’│+S S’│-S S’│*S S’│/S S’│Î│num
19. Show that the given grammar is not LL(1).
E ® iAcE│iAcEeE│a
A ® b
Ans: Step 1: This grammar suffers from left factoring, so after removing left factoring
E ® iAcEE’│a
E’ ® eE│Î
A ® b
Step 2: Compute FIRST and FOLLOW of all non-terminals.
FIRST (E) = {i, a}
FIRST (E’) = {e, Î}
FIRST (A) = {b}
FOLLOW (E) = {$, e}
FOLLOW (E’) = {$, e}
FOLLOW (A) = {c}
Now, to generate the parser table entries, follow these steps:
1. Considering the production E ® iAcEE’
FIRST (E) = FIRST (iAcEE’) = {i}
Since it does not contain any Î.
So, parse table [E, i] = E ® iAcEE’
2. Considering the production E ® a
FIRST (E) = FIRST (a) = {a}
Since it does not contain any Î.
So, parse table [E, a] = E ® a
3. Considering the production E’ ® eE
FIRST (E’) = FIRST (eE) = {e}
Since it does not contain any Î.
So, parse table [E’, e] = E’®eE
4. Considering the production E’ ® Î
FIRST (E’) = FIRST (Î) = {Î}
Since it contains an Î, we have to find out FOLLOW (E’).
FOLLOW (E’) = {$, e}
Basic Parsing Techniques 63
Multiple entries
The multiple entries in M [E’, e] field show that the the grammar is ambiguous and not LL(1).
Multiple-Choice Questions
1. Top-down parsing is a technique to find —————.
(a) Leftmost derivation (b) Rightmost derivation
(c) Leftmost derivation in reverse (d) Rightmost derivation in reverse
2. Predictive parsing is possible only for —————.
(a) LR(k) grammar (b) LALR(1) grammar
(c) LL(k) grammar (d) CLR(1) grammar
3. Which two functions are required to construct a parsing table in predictive parsing technique?
(a) CLOSURE() and GOTO () (b) FIRST() and FOLLOW()
(c) ACTION() and GOTO() (d) None of these
4. Non-recursive predictive parser contains —————.
(a) An input buffer (b) A parsing table
(c) An output stream (d) All of these
5. Which of these parsing techniques is a kind of bottom-up parsing?
(a) Shift-reduce parsing (b) Reduce-reduce parsing
(c) Predictive parsing (d) Recursive-decent parsing
6. Which of the following methods is used by the bottom-up parser to generate a parse tree?
(a) Leftmost derivation (b) Rightmost derivation
(c) Leftmost derivation in reverse (d) Rightmost derivation in reverse
7. Handle pruning forms the basis of —————.
(a) Bottom-up parsing (b) Top-down parsing
(c) Both (a) and (b) (d) None of these
64 Principles of Compiler Design
Answers
1. (a) 2. (c) 3. (b) 4. (d) 5. (a) 6. (d) 7. (a) 8. (c) 9. (b) 10. (a)
5
LR Parsers
sm
sm - 1 Output
LR
Parsing
·····
Program
$
Stack
Action Goto LR
Table Table Parser Table
2. Why LR parsing is good and attractive? Also explain its demerits, if any.
Ans: LR parsing method is good and attractive due to the following reasons:
q LR parsing is the most common non-backtracking shift-reduce parsing.
q It is possible to construct the LR parsers for recognition of almost all programming language con-
structs for which CFG can be written.
q The class of grammars that can be parsed with predictive parsers can also be parsed using LR pars-
ers. That is, the class of grammars that can be parsed with predictive parsers is a proper subset of
those that can be parsed using LR parsers.
q An LR parser scans the input from left to right and while scanning it can detect the syntax errors as
quickly as possible.
The main drawback of LR parsing is that for complex programming language grammars, the con-
struction of LR parsing tables requires too much manual work. To reduce this manual work, we require
an automated tool, known as LR parser generator that can generate an LR parser from a given gram-
mar. Some available generators are YACC, bison, etc. These generators take context-free grammars as
input and generate a parser for the input CFG. These generators also help in locating errors in the gram-
mar, if any and generate error messages.
3. Explain ACTION and GOTO function in LR parsing.
Ans: While constructing a parsing table, we consider two types of functions: a parsing-action func-
tion ACTION and a goto function GOTO.
q ACTION function: The ACTION function takes a state sm (the state on the top of stack) and a ter-
minal bi (the current input symbol) as input to take an action. The ACTION [sm, bi] can have one
of the four values:
l Shift S: The action of the parser is to shift input b to the stack. Here, the parser uses state s to
represent b.
l Reduce X ® α: The action of the parser is to reduce α on the top of the stack to head X.
LR Parsers 67
l Accept: The parser accepts the input and announces successful parsing for the input string.
l Error: The parser finds an error and calls an error handling routine.
q GOTO function: The function GOTO can be defined as a set of states that takes a state and a
grammar symbol as arguments and produces a new state. If GOTO [si, B] = sj, then GOTO maps
a state si and a non-terminal B to state sj.
The combination of sm (the state symbol on the top of the stack) and bi (the current input symbol)
decides the parser action by consulting the parsing action table. Initial stack contains only s0. A configu-
ration of an LR parsing represents the right sentential form
X1 . . . Xm, bi bi+1 . . . bn $
in the same way as that of shift-reduce parser. The only difference is that instead of grammar symbols,
the stack contains those states from which grammar symbols can be recovered. That is, the grammar
symbol Xj in right sentential form corresponds to state si in the configuration.
The configuration resulting after each of the four types of move is as follows:
q If ACTION[sm, bi] = shift s, the parser performs shift operation, that is, it shifts the s state (next
state) onto the stack. The configuration now becomes:
(s0 s1 . . . sm s, bi+1 . . . bn$)
Note that the current input symbol is bi+1 and there is no need to hold the symbol bi on the stack,
as it can be recovered from S if required.
q If ACTION[sm, bi] = reduce X ® α, the parser performs a reduce operation. The new configura-
tion is:
(s0 s1 . . . sm-p s, bi bi+1 . . . bn$)
where
p is the length of α (the body of the reducing production)
s = GOTO [sm-p, X]
The parser first pops the p state symbols from the stack, which exposes the state sm-p. Then, it
pushes the state s, which is the entry for GOTO [sm-p, X], onto the stack. Note that bi is still the
current input symbol, that is, the reduce operation does not alter the current input symbol.
q If ACTION[sm, bi] = accept, it indicates the completion of parsing and the string is accepted.
q If ACTION[sm, bi] = error, an error is encountered by the parser and an error recovery routine is
called.
68 Principles of Compiler Design
6. Define LR(0) items and LR(0) automaton. Also explain the canonical LR(0) collection of
items.
Ans: An LR(0) item (in short item) is defined as a production of grammar G having a dot at some
position of the right hand side of the production. Formally, an item describes how much of a production
has already been seen on the input at some point in the parsing process. For example, consider the fol-
lowing four items created by the production X ® ABC:
q X ® .ABC, which indicates that a string derivable from ABC is expected next on the input.
q X ® A.BC, which indicates that a string derivable from A has already been seen on the input and
now a string derivable from BC is expected.
q X ® AB.C, which indicates that a string derivable from AB has already been seen and now a
string derivable from C is expected on the input.
q X ® ABC., which indicates that the body of the production has already been seen, and now it is
time to reduce it to X.
The canonical LR(0) collection is a collection of sets of LR(0) items, which provides the basis
for constructing a DFA that is used to make parsing decisions. Such an automaton is called an LR(0)
automaton. The states in the LR(0) automaton correspond to the sets of items in the canonical LR(0)
collection. The canonical LR(0) collection for a grammar can be constructed by defining its augmented
grammar and two functions, CLOSURE and GOTO.
For a grammar G with start symbol S, G’ will be the augmented grammar of G, with a new start
symbol S’ and production S’® S. This new generated production indicates that when the parser
LR Parsers 69
should stop parsing and announce the acceptance of the input. Thus, we can say that acceptance occurs
only when the parser is about to reduce by S’ ® S.
q CLOSURE: Let I be the set of items for a grammar G, then we construct a set of items CLOSURE(I)
from I by following these steps:
l Add every item in I to CLOSURE(I).
l If A ® a.Bb is in CLOSURE(I), where B is a non-terminal and B ® g is a production in
G, then add the item B ® .g to CLOSURE(I), if it is not already present in it.
l Repeat step 2 until there are no more items to be added in CLOSURE(I).
In step 2, A ® a.Bb in CLOSURE(I)represents that a substring derivable from Bb is expected
to be seen as input at some point in the parsing process. The substring derivable from Bb will have
the prefix derivable from B by applying one of the productions of B. Thus, we include items for
all productions of B in CLOSURE(I). For this reason, we include B ® .g in CLOSURE(I).
q GOTO: If I is a set of items, X is a grammar symbol, and an item (A ® a.Xb) is in I, then the func-
tion GOTO(I, X) is defined as the closure of the set of all items (A ® aX.b). The GOTO function
basically defines the transitions in the LR(0) automaton. The states in LR(0) automaton are represented
as set of items, and GOTO(I, X) defines the transition from state for I for the given input X.
7. What is a Simple LR parser or SLR parser?
Ans: SLR parser is the simplest LR parser technique generating the parsing tables like LR(0) Parser.
But unlike LR(0) parser, it only performs a reduction with a grammar rule A ® w if the next input
symbol is in FOLLOW(A). This parser can prevent shift-reduce and reduce-reduce conflicts occurring
in LR(0) parsers. Therefore, it can deal with more grammars. A grammar that can be parsed by an SLR
parser is called an SLR grammar. For example, a grammar that can be parsed by SLR parser but not
by an LR(0) is given below:
E ® 1|E
E ® 1
8. Explain how to construct an LR(0) parser.
Ans: The various steps for constructing an LR(0) parser are given below:
1. For a grammar G with a start symbol S, construct an augmented grammar G’with a new start
symbol S’ and production S’® S.
2. Compute the canonical collection of LR(0) items of grammar G.
3. Find the state transitions for state i for all non-terminals X using the following GOTO function:
I1 = GOTO(I0, X).
4. Construct new states with the help of CLOSURE(I) and GOTO(I, X) functions, and for each
state construct new LR(0) items.
5. Repeat step 4 until there are no more transitions left on input for which state transitions can be
constructed.
6. With all computed states I0,I1,. . . ,In, construct the transition graph by keeping LR(0) items
of each Ii in single node and linking these nodes with suitable transitions evaluated with GOTO.
7. Constitute a parse table using SLR table construction algorithm.
8. Apply the shift-reduce action to verify whether the input string is accepted or any conflict has
occurred.
70 Principles of Compiler Design
We can deduce the information about whether to take shift action or reduce action from the fact that
X ® b1.b2 is valid for ab1 as follows:
q If b2 ¹ Î, it indicates that we have not yet shifted the handle onto the stack, so need to perform
a shift action.
q If b2 = Î, it indicates as if X ® b1 is the handle, and we should reduce by this production.
Thus, it is clear that for the same viable prefix, two valid items may indicate to do different things.
Such conflicts can be resolved by looking ahead the next input symbol.
Generally, an item can be valid for many viable prefixes. The set of items valid for a viable prefix g
can be computed by determining the set of items that can be reached from the initial state along the path
labeled g in the LR(0) automaton for the grammar.
12. What is the canonical LR parser?
Ans: A canonical LR (CLR) parser is more powerful than LR parser as it makes full use of one
or more lookahead symbols. It contains some extra information in the form of a terminal symbol,
as a second component in each item of state. Thus, in CLR parser, an item can be described as
follows:
[A ® a.b, a]
where
A ® ab is a production, and
a is the terminal symbol or right end marker $.
Such an item is defined as LR(1) item, where 1 refers to the length of second component, called the
lookahead of the item. If b ¹ Î, then the lookahead does not effect the item [A ® a.b, a]. How-
ever, if the item has the form [A ® a., a], then it calls for a reduction by A ® a only if the next
input symbol is a. That is, we are compelled to reduce A ® a only on those input symbols a for which
[A ® a., a] is an LR(1) item in the state on the top of the stack.
13. Write the algorithm for computation of sets of LR(1) items.
Or
Define CLOSURE(I) and GOTO(I, X) functions for LR(1) grammar.
Ans: The algorithm for computing the sets of LR(1) items is basically the same as that of the canoni-
cal sets of LR(0) items—only the procedures for computing the CLOSURE and GOTO need to be modi-
fied as shown in Figure 5.3.
In Figure 5.3, the function items() is the main function that calls the CLOSURE and GOTO func-
tions for constructing the sets of LR(1) items for grammar G’.
procedure CLOSURE(I)
Begin
Do
For (each item [A ® a.Bb, a] in I)
For (each production B ® g in G’)
For (each terminal b in FIRST(ba))
72 Principles of Compiler Design
add [B ® . g, b] to I
While there are some items to be added to set I
return I
End
procedure GOTO(I, X)
Begin
Initialize J to be the empty set
For (each item [A ® a.Xb, a] in I)
add item [A ® a.Xb, a] to set J
return CLOSURE(J)
End
void items(G’)
Begin
C = CLOSURE([S’ ® .S, $])
Do
For (each set of items I in C)
For (each grammar symbol X)
If (GOTO(I, X) is not empty and not in C)
add GOTO(I, X) to C
While there are some sets of items to be added to C
End
14. Give the algorithm for the construction of canonical LR parsing table.
Ans: Canonical LR parsing tables are constructed by the LR(1) ACTION and GOTO functions from
the set of LR(1) items. The ACTION and GOTO entries are constructed in the parsing table using the
following algorithm:
Step 1: For any augmented grammar G’, construct the collection of sets of LR(0) items
C’= {I0,I1, . . . ,In}.
Step 2: For each state in C, construct a row in CLR table and name the rows from 0 to n. Partition the
columns into ACTION and GOTO, where ACTION will have all terminal symbols of gram-
mar G along with symbol $, and GOTO will have all the non-terminals of G.
Step 3: For each Ii construct a state i. The action entries for state i in CLR parsing table are deter-
mined using the following rules:
q If [A ® a.ab, b] is in Ii and GOTO(Ii, a) = Ij, where a is a terminal, then
ACTION[i, a] = “shift j”.
q If [A ® a., a] is in Ii, and A ¹ S’then ACTION[i, a] = “reduce A ® a.”.
q If [S’ ® S., $] is in Ii, then ACTION[i, $] = “accept”.
If any conflicting actions occur from the above rules, we will not consider the grammar as
LR(1). In that case, this algorithm will not be able to produce a valid parser.
Step 4: The goto entries for state i in the CLR parsing table can be determined using the following rule:
If GOTO(Ii, A) = Ij, then GOTO[i, A] = j.
LR Parsers 73
q The LALR parser provides a good trade off between power and efficiency.
q Merging of items never introduces shift-reduce conflict unless the conflict is already present in
LR(1) configuration sets.
The demerits of LALR parser are as follows:
q The construction of parser table from the collection of LR(1) items requires too much space and time.
q Merging of items may introduce reduce-reduce conflict. In case reduce-reduce conflict arises, the
grammar is not considered as LALR(1).
17. Differentiate between SLR and LALR.
Or
Why LALR parser is considered over SLR?
Ans: In LALR parsing, the reduce entries are made using lookahead sets whereas in SLR, reduce
entries are made using succeed sets. The lookahead set for LR(0) item I consists of only those symbols
that are expected to be appeared after I’s right hand side has been parsed. On the other hand, the suc-
ceed set consists of all those symbols that are supposed to appear after I’s left hand side non-terminal.
The lookahead set is more specific to parsing context and provides a finer distinction than the succeed
set. In SLR parsing, shift-reduce conflict arises whereas merging of items does not introduce any shift-
reduce conflict in LALR parsing. Reduce-reduce conflict may occur in LALR parsing.
18. Discuss how YACC can be used to generate a parser?
Ans: YACC stands for yet another compiler-compiler. It is an LALR parser generator which is
basically available as a command on UNIX system. The first version of YACC was created by S.C.
Johnson in early 1970s. It is a tool that compiles a source program and translates it into a C program
that implements the parser.
For example, consider a file translate.y. The YACC compiler converts this file into a C program
y.tab.c using the LALR algorithm. The program y.tab.c is basically a representation of LALR
parser written in C language. This program is then compiled along with the ly library to generate the
object code a.out, which then performs the translation specified by the original YACC program. An
input/output translator constructed using YACC is given in Figure 5.4.
q Declaration section: This section consists of two optional sections. The first section contains
ordinary C declarations delimited by %{ and %}. For example, it may contain #include prepro-
cessors as given below:
%{
#include<conio.h>
%}
The second section contains the declarations of grammar tokens. For example, the following
statement declares DIGIT as a token:
%token DIGIT
q Translation rules section: This section includes the grammar productions along with their seman-
tic actions. For example, the productions of the following form:
The semantic action includes the values associated with the non-terminals of the head. The
symbol $$ is used to refer these values. For example, consider the productions,
E ® E * F ½ F
The YACC specification for the above productions can be written as follows:
expr : expr ‘*’ factor {$$ = $4 * $5;}
½ factor
;
q Supporting C-routines section: This section includes the lexical analyzer yylex() that pro-
duces pairs consisting of token name and their associated values. Th attribute values are commu-
nicated to the parser through the variable yylval already defined in YACC.
Stack Input
. . . if condition then statement else . . . $
Depending on what follows the else on the input, the parser may choose to reduce if con-
dition then statement to statement, or it may choose to shift else and then look
for another statement to complete the alternative if condition then statement else
statement. This gives rise to shift-reduce conflict since the parser cannot decide whether to
shift else onto the stack or to reduce if condition then statement.
This ambiguity can be resolved by matching each else with its just preceding unmatched
then. Thus, in our case the next action would be shift else onto the stack because it is associated
with the previous then.
q Reduce-reduce conflict: This conflict occurs when we know we have a handle but the next input
symbol and the stack’s contents are not enough to determine which production is to be used in a
reduction. For example, consider a language in which a procedure can be invoked by giving the
procedure name along with the parameters surrounded by parentheses, and array references are
also made using the same syntax. Some of the productions for the grammar of our language would
be as follows:
statement ® id(parameter_list) (1)
statement ® expression: = expression (2)
parameter_list ® parameter_list, parameter (3)
parameter_list ® paramater (4)
parameter ® id (5)
expression ® id(expression_list) (6)
expression_list ® expression_list, expression (7)
expression_list ® expression (8)
expression ®id (9)
Let us consider an input string A(X, Y). The token stream for the given input string for the
parser is id(id, id). Now, the configuration of the parser after shifting the initial three tokens
onto the stack is as follows:
Stack Input
. . . id(id ,id) . . . $
It is clear that we need to reduce the id that is on top of the stack, but it is not clear that
which production needs to be used for reduction. If A is a procedure name then production (5)
needs to be used, and if A is an array then production (9) needs to be followed. Thus, reduce-
reduce conflict occurs.
This conflict can be resolved by changing the token id in production (1) by procid, and by
using a more sophisticated lexical analyzer that returns token procid for an identifier which is a
procedure name, and id for an array name. Before returning a token, the lexical analyzer needs to
consult the symbol table. Now, if A is a procedure then after this modification of token stream the
configuration of the parser would be:
Stack Input
. . . procid(id ,id) . . . $
LR Parsers 77
Action Goto
Item Set ) ' ( { Num } $ P Q R
0 S3 1 2
1 acc
2 S4 S5
3 S7 6
(Continued )
80 Principles of Compiler Design
(Continued )
Action Goto
Item Set ) ' ( { Num } $ P Q R
4 r1 r1 r1 r1 r1 r1 r1
5 S7 8
6 S9
7 S10
8 r2 r2 r2 r2 r2 r2 r2
9 S7 11
10 S12
11 r3 r3 r3 r3 r3 r3 r3
12 S13
13 S14
14 r4 r4 r4 r4 r4 r4 r4
I4: E ® E + .E
E ® .E + E
E ® .E * E
E ® .(E)
E ® .id
I5: E ® E * .E
E ® .E + E
E ® .E * E
E ® .(E)
E ® .id
I6: E ® (E.)
E ® E. + E
E ® E. * E
I7: E ® E + E.
E ® E. + E
E ® E. * E
I8: E ® E * E.
E ® E. + E
E ® E. * E
I9: E ® (E).
Now, the parsing table for the above set of LR(0) items will be:
Action Goto
State id + * ( ) $ E
0 S3 S2 1
1 S4 S5 acc
2 S3 S2 6
3 r4 r4 r4 r4
4 S3 S2 8
5 S3 S2 8
6 S4 S5 S9
7 r1 S5 r1 r1
8 r2 r2 r2 r2
9 r3 r3 r3 r3
23. Every SLR(1) is unambiguous but there are few unambiguous grammars that are not
SLR(1). Verify this for the following productions.
S ® L = R
S ® R
L ® * R
L ® id
R ® L
Ans: The augmented grammar G’ for the above productions is as follows:
82 Principles of Compiler Design
S’ ® S
S ® L = R
S ® R
L ® *R
L ® id
R ® L
The canonical collection of LR(0) items for G are as follows:
Starting with closure (S’ ® .S), we get I0
S’ ® .S
S ® .L = R by Rule 1 of closure, A ® a . Bb
S ® .R B ® .g
L ® .* R closure(1), by Rule 3, B ® . g
L ® .id B ® .g in closure(2)
R ® .L B ® .g closure(2)
I1 = GOTO (I0, S)
= Closure (S’ ® S.), we obtain
S’® S.
I2 = GOTO (I0, L)
= Closure (S ® L .= R) È Closure (R ® L.), we obtain
S ® L. = R
R ® L.
I3 = GOTO (I0, R)
= Closure (S ® R.), we obtain
S ® R.
I4 = GOTO (I0, *)
= Closure (L ® *.R), we obtain
L ® *.R
R ® .L B ® .g in closure (L ® *.R)
L ® .*R B ® .g in closure (R ® .L)
L ® .id B ® .g in closure (L ® .id)
I5 = GOTO (I0, id)
= Closure(B ® id.), we obtain
L ® id.
I6 = GOTO (I2, =)
= Closure (S ® L = .R), we obtain
S ® L = .R
R ® .L B ® .g in closure (S ® L = .R)
L ® .* R B ® .g in closure (R ®.L)
L ® .id B ® .g in closure (R ® .L)
I 7 = GOTO (I4, R)
= Closure (L ® *R.), we obtain
L ® *R.
LR Parsers 83
I8 = GOTO (I4, L)
= Closure (R ® L.), we obtain
R ® L.
I9
= GOTO (I6, R)
= Closure (S ® L = R.), we obtain
S ® L = R.
Thus, we get the canonical LR(0) items, now to verify whether this grammar is SLR(1) or not by
applying rule (3) of SLR parsing table algorithm.
q Consider the production, S ® L. = C in I2.
q Compare it with A ® a. ab, we obtain that a is =.
q We know that GOTO(I2, =) = I6; therefore, by applying rule 3(a) of SLR parser table algo-
rithm, we obtain ACTION(2, =) = S6.
q But one more production exists in I2, that is, R ® L. By applying rule 3(b) of SLR table
algorithm, and comparing R ® L. with A ® a. and FOLLOW(R) contains ‘=’, therefore,
ACTION(2, =) = r5, that is, reduce by R ® L.
q Thus, we have two actions shift and reduce for (2, =) in SLR table, which means shift-reduce
conflct occurs. Therefore, the grammar is not SLR(1), even if the grammar is unambiguous.
The parsing table for the above grammar is designed below:
Action Goto
State id * = $ S L R
0 S5 S4 1 2 3
1 acc
2 S6 r5 r5
3 r2
4 S5 S4 8 7
5 r4 r4 8 9
6 S5 S4
7 r3 r3
8 r5 r5
9 r1
(i) The item sets for the new grammar G’ will be determined as follows:
Item set number 0:
E’® .E
+E ® .E + T
+E ® .T
+T ® .T * F
+T ® .F
+F ® .(E)
+F ® .id
In I0, symbols just after the dot are E, T, F, (, id.
Item set number 1, I1 (for the symbol E of I0), we have:
E’® E.
E ® E. + T
Item set number 2, I2 (for the symbol T of I0), we have:
E ® T.
T ® T. * F
Item set number 3, I3 (for the symbol F of I0), we have:
T ® F.
Item set number 4, I4 (for the symbol ( of I0), we have:
F ® (.E)
+E ® .E + T
+E ® .T
+T ® .T * F
+T ® .F
+F ® .(E)
+F ® .id
Item set number 5, I5 (for the symbol id of I0), we have:
F ® id.
In I1, symbol just after the dot is +.
Thus, item set number 6, I6 (for the symbol ‘+’ of I1), we have:
LR Parsers 85
E ® E + .T
+T ® .T * F
+T ® .F
+F ® .(E)
+F ® .id
E ® E + T.
Item Set id + * ( ) E T F
0 5 4 1 2 3
1 6
2 7
3
4 5 4 8 2 3
5
6 5 4 9 3
7 5 4 10
8 11
9
10
11
Action Goto
State id + * ( ) $ E T F
0 S5 S4 1 2 3
1 S6 acc
2 r2 r2 S7, r2 r2 r2 r2
3 r4 r4 r4 r4 r4 r4
4 S5 S4 8 2 3
5 r6 r6 r6 r6 r6 r6
6 S5 S4 9 3
7 S5 S4 10
8 S11
9 r1 r1 r1 r1 r1 r1
10 r3 r3 r3 r3 r3 r3
11 r5 r5 r5 r5 r5 r5
It is clear from the table that action entry for state 2 contains shift-reduce conflict; thus, it is not LR(0).
(ii) For SLR parsing table:
FOLLOW(E) = {$} as it is the start symbol
Follow(E) = FIRST(+T) = {+}
Follow(E) = FIRST( ) ) = { ) }
Therefore, FOLLOW(E) = {+, ), $}
And in state 2, reduction r2 is valid only in the columns {+, ), $}
Now, FOLLOW(T) = FOLLOW(E) = {+, ), $}
FOLLOW(T) = FIRST(*F) = {*}
Therefore, FOLLOW(T) = {+, *,), $}
And in state 3, reduction r4 is valid only in the columns {+, *,), $}
Now, FOLLOW(F) = FOLLOW(T) = {+, *,), $}
LR Parsers 87
Therefore, in state 5 reduction r6 is valid only in the columns {+, *,), $}, in state 9 r1 is valid only
in the columns {+,), $}, in state 10 r3 is valid only in the columns {+, *,), $} and in state 11 r5
is valid only in the columns {+, *,), $}.
Now, after solving shift-reduce conflict, the SLR parsing table will be:
Action Goto
State id + * ( ) $ E T F
0 S5 S4 1 2 3
1 S6 Acc
2 r2 S7 r2 r2
3 r4 r4 r4 r4
4 S5 S4 8 2 3
5 r6 r6 r6 r6
6 S5 S4 9 3
7 S5 S4 10
8 S6 S11
9 r1 S7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
(iii) Moves of parser to accept the input string id * id + id are shown below:
25. Construct the LR(1) items and the CLR parsing table for the following grammar:
S ® CC
C ® cC
C ® d
Ans: The augmented grammar G’ for the above grammar will be:
S’® S
S ® CC
C ® cC
C ® d
Apply ITEMS(G’) procedure to construct LR(1) items.
For I0, Closure (S’ ® .S, $)
S’ ® e . S e , $
A ® a . B b , a
Here B = S, b = e
FIRST($) = {$} and we have the production S ® CC which is of the form B ® g so add
[B ® .g, b] to I, for each b in FIRST(bg), that is, S ® .CC, $ and then again computing
the closure.
Closure[S ® .CC, $]
S ® e . C C , $
A ® a . B b , a
Here B = C and b = C, a = $
FIRST(ba) = FIRST(C$) = {c, d} and we have B ® g, that is, C ® cC and C ® d
so, we add the following productions for each FIRST(ba):
C ® .cC, c
C ® .cC, d
C ® .d, c
C ® .d, d
Now, we can write LR(1) items for I0 as:
S’® .S, $
S ® .CC, $
C ® .cC, c/d
c ® .d, c/d
I1 = GOTO (I0, s)
= Closure (S’ ® S., $)
Thus, I1 will have: S’ ® S., $
I2 = GOTO (I0, C)
= Closure (S ® C.C, $)
LR Parsers 89
I7 = GOTO (I2, d)
= Closure (C ® d., $)
Thus, I7 will have: C ® d., $
I8 = GOTO (I3, C)
= Closure (C ® cC., c/d)
Thus, I8 will have:
C ® cC., c/d
And GOTO (I3, C) = Closure (C ® c.C, c/d)
= I3
And GOTO (I3, d) = Closure (C ® d., c/d)
= I4
No transition on I4 and I5.
So, I9 = GOTO (I6, C)
= Closure (C ® cC., $)
Thus, I9 will have:
C ® cC., $
90 Principles of Compiler Design
Action GOTO
State c d $ S C
0 S3 S4 1 2
1 acc
2 S6 S7 5
3 S3 S4 8
4 r3 r3
5 r1
6 S6 S7 9
7 r3
8 r2 r2
9 r2
26. Discuss algorithms for computation of sets of LR(1) item. Also, show that the following
grammar is LR(1) but not LALR(1).
G: S ® Aa / bAc / BC / bBa
A ® d
B ® d
Ans: The augmented grammar G’ for the above grammar will be:
S’® S
S ® Aa
S ® bAc
S ® Bc
S ® bBa
A ® d
B ® d
S’® .S, $
+S ® .Aa, $
+S ® .bAc, $
+S ® .Bc, $
+S ® .bBa, $
+A ® .d, a
+N ® .d, c
LR Parsers 91
Action Goto
State a b c d $ S A B
0 S3 S5 1 2 4
1 acc
2 S6
3 S9 7 8
4 S10
59 r5, r6 r6, r5
6 r1
7 S11
8 S12
10 r3
11 r2
12 r4
Multiple-Choice Questions
1. The most common non-backtracking shift-reduce parsing technique is known as —————.
(a) LL parsing (b) LR parsing
(c) Top-down parsing (d) Bottom-up parsing
LR Parsers 93
Answers
1. (b) 2. (b) 3. (c) 4. (a) 5. (c) 6. (a) 7. (a) 8. (d)
10
6
Syntax-directed
Translations
Evaluation order
Input string Parse tree Dependency Graph
for semantic rules
its properties, and the rules are associated with productions. If A is a symbol and i is one of its attributes,
then we can write A.i to denote the value of i at a particular parse tree node. An attribute has a name
and an associated value which can be a string, a number, a type, a memory location, or an assigned
register. The attribute value may depend on its child nodes or siblings or its parent node information.
The syntax-directed definition is partitioned into two subsets, namely, synthesized and inherited
attributes. Semantic rules are used to set up dependencies between attributes that will be represented
by a graph. An evaluation order for the semantic rules can be derived from the dependency graph. The
values of the attributes at the nodes in parse tree are defined by the evaluation of the semantic rules.
Inherited attributes: An inherited attribute for a non-terminal A at a parse tree node I is defined by
a semantic rule associated with the production at the parent of I, and the production must have A as a
symbol in its body. The value of an inherited attribute at node I can only be defined in terms of attribute
values of I’s parents, I’s siblings, and I itself. Inherited attributes are convenient for expressing the
dependence of a programming language construct on the context on which it appears. For example, an
inherited attribute can be used to keep track of whether an identifier appears on the left side or right
side of an assignment operator in order to determine whether the address or the value of the identifier
is required.
For example, consider the following grammar:
E ® AB
A ® int
B ® B’,id
B ® id
The syntax-directed definitions that use inherited attributes can be written as:
The non-terminal symbol A in the productions has the synthesized attribute type whose value can
be obtained by the keyword in the declaration. The semantic rule B.inh = A.type sets inherited
attributes B.inh to the type in the declaration.
The parse tree with the attributes values at the parse tree nodes, for an input string int id1,id2,id3
is shown in Figure 6.4. The type of identifiers id1, id2, and id3 is determined by the value of B.inh
at the three B nodes. These values are obtained by computing the value of the attribute A.type at the
left child of the root and then evaluating B.inh in top-down at the three B nodes in the right subtree
of the root. We also call the procedure enter at each B node to insert into the symbol table, where the
identifier at the right child of this node is of type int.
Syntax-directed Translations 97
id1
Figure 6.4 Parse Tree for String Int id1, id2, id3 with Inherited Attributes
5. Define annotated parse tree with example.
Ans: An annotated parse tree is a parse tree that displays the values of the attributes at each node.
It is used to visualize the SDD specified translations. To construct an annotated parse tree, first the
SDD rules are applied to construct a parse tree and then the same SDD rules are applied to evaluate the
attributes at each node of the parse tree. For example, if all the attributes are synthesized then we must
evaluate the attribute values of all the children of a node before evaluating the attribute value of the node
itself.
For example, an annotated parse tree for an expression 3 + 5 * 2 by considering the productions
and semantic rules of Figure 6.2, is shown in Figure 6.5.
E.val = 13
E.val = 13
M.val = 3 + P.val = 10
digit.lexval = 3 digit.lexval = 3
6. What is dependency graph? Write the algorithm to construct a dependency graph for a
given parse tree.
98 Principles of Compiler Design
Ans: A dependency graph represents the flow of information between the attribute instances in a
parse tree. It is used to depict the interdependencies among the inherited and synthesized attributes at
the nodes in a parse tree. When the value of an attribute needs to compute value of another attribute,
then an edge from first attribute instance to another is created to indicate the dependency among the
attribute instances. That is, if an attribute x at a node in a parse tree depends on an attribute y, then
the semantic rule for y at that node must be evaluated before the evaluation of the semantic rule that
defines x.
To construct a dependency graph for a parse tree, we first make each semantic rule in the form
x: = f(y1,y2,y3, . . . ,yk) by introducing a dummy synthesized attribute x for each semantic rule
that consists of a procedure call. We then create a node for each attribute in the graph and an edge from
node y to the node x, if attribute x depends on attribute y.
The algorithm for constructing the dependency graph for a given parse tree is shown in Figure 6.6.
7. Construct a dependency graph for the input string 7 + 8, by considering the following
grammar:
A ® BA’
A’® + BA1’| Î
B ® digit
Ans: The semantic rules for the given grammar productions can be written as:
The SDD in Figure 6.7 are used to compute 7 + 8, and the parsing begins with the production
A ® BA’. Here, B generates the digit 7, but the operator + is generated by A’. As the left operand
7 appears in a different subtree of the parse tree from +, so an inherited attribute is used to pass the
operand to the operator.
Syntax-directed Translations 99
B.val = 7 A'.inh = 7
A'.syn = 15
+ A1'.inh = 15
digit.lexval = 7 B.val = 8 A1'.syn = 15
digit.lexval = 8 Є
A 9 val
inh 5 8 syn
B 3 val A'
B 4 7 syn
+ val inh 6 A'
digit 1 lexval
digit 2 lexval Є
Figure 6.9 Dependency Graph for the Annotated Parse Tree of Figure 6.8
100 Principles of Compiler Design
The two leaves digit are associated with attribute lexval and are represented by nodes 1 and 2.
The two nodes labeled B are associated with the attribute val and are represented by the nodes 3 and 4.
The edges from node 1 to node 3 and from node 2 to node 4 use the semantic rule that defines B.val
in terms of digit.lexval.
Each occurrence of non-terminal A’ is associated with the inherited attribute A’.inh, and are
represented by nodes 5 and 6. The edge from 3 to 5 is due to the rule A’.syn = A’.inh. The edge from
node 5 to node 6 is for A’.inh and from node 4 to node 6 is for B.val, because these values are added
to calculate the attribute inh at node 6.
The synthesized attribute syn associated with the occurrences of A’ is represented by nodes 7 and
8. The edge from node 6 to node 7 is due to the semantic rule A’.syn = A’.inh associated with the
production 3. The edge from node 7 to node 8 is due to the semantic rule associated with production
2. Node 9 represents the attribute A.val. The edge from node 8 to node 9 is due to the semantic rule,
A.val = A’.syn, associated with production 1.
8. What are S-attributed definitions and L-attributed definitions?
Ans: S-Attributed definitions: A syntax-directed translation is called S-attributed if all its attributes
are synthesized. The attributes of an S-attributed SDD can be evaluated using the bottom-up approach
of the traversal of the parse tree in which the attributes of the parse tree are evaluated by performing a
post-order traversal of the parse tree. In post-order traversal, we evaluate the attributes at a node N when
the traversal leaves N for the last time. That is, we, apply the post-order function given in Figure 6.10
to the root of the parse tree.
postorder(N)
Begin
For each child C of N, from left to right
postorder(C);
Evaluate the attributes associated with node N
End
L-attributed definition: An L-attributed definition is another class of SDD in which the dependency
graph edges can only go from left to right and not vice-versa. Each attribute in L-attributed definition
must be either synthesized or inherited. If the attributes are inherited, then they must follow some rules.
Assume we have a production A→Y1Y2 . . . Yn, and Yi.a is an inherited attribute evaluated by a rule
associated with the given production. Then the rule may only use:
q Inherited attributes that are related with the head A.
q The attributes (either synthesized or inherited) that are related with the occurrences of the symbols
Y1,Y2, . . .,Yi−1 (that is, the symbols to the left of Yi in the production).
q Synthesized or inherited attributes related with Yi in such a way that there are no cycles in the
dependency graph formed by the attribute of Yi.
For example, consider the syntax-directed definitions given in Figure 6.7. To prove that the SDD in
Figure 6.7 is L-attributed, consider the semantic rules for inherited attributes as shown in Figure 6.11.
The syntax-directed definition above is an example of the L-attributed definition, because the inherited
attribute A’.inh using only B.val, and B is appearing to the left of A’ in the production A ® BA’.
Similarly, the inherited attribute A1’.inh in the second rule is defined using the inherited attribute
Syntax-directed Translations 101
A’.inh related with the head, and B.val, where B appears on the left of A1’ in the production
A’ ® + BA1’. In both cases, the rules for L-attributed definitions are followed, and the remaining
attributes are synthesized (as shown in Figure 6.7). Therefore, this SDD is L-attributed.
9. Discuss the applications of syntax-directed translation.
Ans: Syntax-directed translations are applied in the following techniques:
q Construction of syntax tree: Syntax tree is used as an intermediate representation in some
compilers and, hence, a common form of SDD converts its input string into the syntax tree. To
construct the syntax tree for expressions, we can use either an S-attributed or an L-attributed
definition. The S-attributed definitions are suitable to use in bottom-up parsing, whereas the
L-attributed definitions are suitable to use in top-down parsing.
q Type checking: Type checking is used to catch errors in the program by checking type of each
variable, constant, functions, and expressions. Type checking eliminates the need for dynamic
checking for type errors.
q Intermediate code generation: Intermediate codes are machine-dependent codes and are close to
the machine instructions. Syntax-directed translation, postfix notation, and syntax tree can be used
as an intermediate code.
10. What is a syntax tree? Explain the procedure for constructing a syntax tree with the
help of an example.
/
Ans: A syntax tree or an abstract syntax tree (AST) is a
tree representation showing the syntactic structure of the source
program. It is a compressed form of a parse tree representing the
* s
hierarchical construction of a program, where the nodes represent
operators and the children of any node represent the operands that p
are to be operated by that operator. For example, the syntax tree +
for the expression p * (q + r)/s is shown as in Figure 6.12.
The construction of syntax tree for an expression can be q r
considered as the translation of the expression into post-fix form.
The subtrees are constructed for the subexpressions by creating a Figure 6.12 A Simple (Abstract) Syntax Tree
node for each operator and operand. The children of an operator
node are the roots of the nodes representing the subexpressions constituting the operands of that operator.
The nodes of a syntax tree are implemented as objects having several fields. Each node is labeled by
the op field, which is often called the label of the node. When used for translation, the nodes in a syntax
tree may have additional fields to hold the values of attributes attached to the node, which are as follows:
q For a leaf node, an additional field is required to hold the lexical value of the leaf. A constructor
function Make-leaf(num, val) or Make-leaf(id, entry) is used to create a leaf
object.
q If the node is an interior node, a constructor function Make-node(op,left,right) is used
to create an object with first field op and two additional fields for its left and right children.
102 Principles of Compiler Design
For example, consider the expression x - 7 + z. In this, we need the following functions to create
the nodes of syntax trees for expressions with binary operators.
q Make-node (op, left, right) creates an operator node with label op and two fields containing
pointers to left and right children.
q Make-leaf(id, entry) creates an identifier node with label id and a field containing entry,
a pointer to the symbol table entry for the identifier.
q Make-leaf(num, val) creates a number node with label num and a field containing val, the
value of the number.
Consider the S-attributed definition shown in Figure 6.13 that constructs syntax tree for the
expressions involving only binary operators + and –. All the non-terminals have only one synthesized
attribute node that represents a node of the syntax tree.
A.node
A.node B.node
+
-
A.node B.node id
+
Num
B.node
id
-
id
id Num 7 to
entry for z
to
entry for x
To create the syntax tree for the expression x - 7 + z, we need a sequence of function calls,
where p1, p2, p3, p4, p5 are the pointers to nodes, and entry x and entry z are pointers to the symbol
table entries for identifiers x and z, respectively.
1. p1 : = Make-leaf(id , entry x);
2. p2 : = Make-leaf(num , 7);
3. p3 : = Make-leaf(‘ - ‘, p1 , p2);
4. p4 : = Make-leaf(id , entry z);
5. p5 : = Make-leaf(‘+’ , p3 , p4);
Syntax-directed Translations 103
The tree is constructed using bottom-up approach. The function calls Make-node(id, entry x)
and Make-node(num,7) construct the leaves for x and 7, the pointers to these nodes are saved
using p1 and p2.
The function call Make-node(‘-’, p1, p2) constructs an interior node with the leaves
for x and 7 as children and we follow the same procedure for pointer p4 and p5, which finally results in
p5 pointing to the root of the constructed syntax tree.
The edges of the syntax tree are shown as solid lines. The underlying parse tree is shown with dotted
lines and the dashed lines represent the values of A.node and B.node, each line points to appropriate
node of the syntax tree.
Multiple-Choice Questions
1. Which of the following is not true for SDT?
(a) It is an extension of CFG.
(b) Parsing process is used to do the translation.
(c) It does not permit the subroutines to be attached to the production of a CFG.
(d) It generates the intermediate code.
2. A parse tree with attribute ————— at each node is known as an annotated parse tree.
(a) Name (b) Value
(c) Label (d) None of these
3. Which of the following is true for a dependency graph?
(a) The dependency graph helps to determine how the attribute values are computed.
(b) It depicts the flow of information among the attribute instances in a parse tree.
(c) Both (a) and (b)
(d) None of these
4. An SDD is S-attributed if every attribute is —————.
(a) Inherited (b) Synthesized
(c) Dependent (d) None of these
104 Principles of Compiler Design
5. In L-attributed definitions, the dependency graph edges can go from ————— to —————.
(a) Left to right (b) Right to left
(c) Top to bottom (d) Bottom to top
6. Which of the following is not true for an abstract syntax tree?
(a) It is a compressed form of a parse tree.
(b) It represents the syntactic structure of the source program.
(c) The nodes of the tree represent the operands.
(d) None of these
7. Which of the following is not true for syntax-directed translation schemes?
(a) It is a CFG with program fragments embedded within production bodies.
(b) The semantic actions appear at a fixed position within a production body.
(c) They can be considered as a complementary notation to syntax-directed definitions.
(d) None of these
Answers
1. (c) 2. (b) 3. (c) 4. (b) 5. (a) 6. (c) 7. (b)
7
Intermediate Code
Generation
l It
retains the program structure as it is nearer to the source program.
l It
can be constructed easily from the source program.
l It
is not possible to break the source program to extract the levels of code sharing due to
which the code optimization in this representation becomes a bit complex.
q Low-level intermediate representation: This representation is closer to the target machine where
it represents the low-level structure of a program. It is appropriate for machine-dependent tasks
like register allocation and instruction selection. A typical example of this representation is three-
address code. The critical features of low-level representation are given as follows:
l It is near to the target machine.
l It makes easier to generate the object code.
l High effort is required by the source program to generate the low-level r epresentation.
High-level Low-level
Source Target (Object)
intermediate ... intermediate
program code
representation representation
7. What is a three-address code? What are its types? How it is implemented?
Ans: A string of the form X: = Y OP Z, in which op is a binary operator, Y and Z are the addresses
of the operands, and X is the address of the result of the operation, is known as three-address statement.
The operator op can be a fixed or floating-point arithmetic operator, or a logical operator. X, Y, and Z
can be considered either as constants or as predefined names by the programmer or temporary names
generated by the compiler. This statement is named as the “three-address statement” because of the
usage of three addresses, one for the result and two for the operands. The sequence of such three-address
statements is known as three-address code. The complicated arithmetic expressions are not allowed
in three-address code because only a single operation is allowed per statement. For example, consider
an expression A + B * C, this expression contains more than one operator so the representation of
this expression in a single three-address statement is not possible. Hence, the three-address code of the
given expression is as follows:
T1: = B * C
T 2: = A + T 1
where, T1 and T2 are the temporary names generated by the compiler.
q Address and pointer assignment statements: These statements can be represented in the
following forms:
l X: = addr Y defines that X is assigned the address of Y.
l X: = *Y defines that X is assigned the content of location pointed to by Y.
l *X: = Y sets the r-value of the object pointed to by X to the r-value of Y.
q Jump statements: Jump statements are of two types—conditional and unconditional that works
with relational operators and are represented in the following forms:
l The unconditional jump is represented as goto L, where L being a label. This instruction
means that the Lth three-address statement is the next to be executed.
l The conditional jumps such as if X relop Y goto L, where relop signifies the rela-
tional operator (£, =, >, etc.) applied between X and Y. This instruction implies that if the
result of the expression X relop Y is true then the statement labeled L is executed. Otherwise,
the statement immediately following the if X relop Y goto L is executed.
q Procedure call/return statements: These statements can be defined in the following forms:
l param X and call P, n, where they are represented and typically used in the three-
address statement as follows:
aram X1
p
param X2
.
.
.
param Xn
call P, n
Here, the sequence of three-address statements is generated as a part of call of the procedure P(X1,
X2, . . . , Xn), and n in call P, n is defined as an integer specifying the total number of actual
parameters in the call.
l Y = call p, n represents the function call.
l return Y, represents the return statement, where Y is a returned value.
t1: = x + y
t 2: = a * t1
t 3: = - z
t 4: = t3/t2
S: = t4
The quadruple representation of this three-address code is shown in Figure 7.3.
9. Define triples and indirect triples. Give suitable examples for each.
Ans: Triples: A triple is also defined as a record structure that is used to represent a three-address
statement. In triples, for representing any three-address statement three fields are used, namely,
operator, operand 1 and operand 2, where operand 1 and operand 2 are pointers to either symbol table
or they are pointers to the records (for temporary variables) within the triple representation itself. In this
representation, the result field is removed to eliminate the use of temporary names referring to symbol
table entries. Instead, we refer the results by their positions. The pointers to the triple structure are
represented by parenthesized numbers, whereas the symbol-table pointers are represented by the names
themselves. The triples representation of the expression (in Question 7) is shown in Figure 7.4.
In triple representation, the ternary operations X[I] : = Y and X : = Y[I] are represented by
using two entries in the triple structure as shown in Figure 7.5(a) and (b) respectively. For the operation
X[I] : = Y, the names X and I are put in one triple, and Y is put in another triple. Similarly, for the
operation X : = Y[I], we can write two instructions, t: = Y[I], and X: = t. Note that instead
of referring the temporary t by its name, we refer it by its position in the triple.
Indirect triples: An indirect triple representation consists of an additional array that contains the
pointers to the triples in the desired order. Let us define an array A that contains pointers to triples in
desired order. Indirect triple representation for the statement S given in the previous question is shown
in Figure 7.6.
110 Principles of Compiler Design
The main advantage of indirect triple representation is that an optimizing compiler can move an
instruction by simply reordering the array A, without affecting the triples themselves.
10. Explain Boolean expressions. What are the different methods available to translate
Boolean expression?
Ans: Boolean operators, like AND(&&), OR(||) and NOT(!), play an important role in
constructing the Boolean expressions. These operators are applied to either relational expressions or
Boolean variables. In programming languages, Boolean expressions serve two main functions, given
as follows:
q Boolean expressions can be used as conditional expressions in the statements that alter the flow of
control, such as in while- or if-then-else statements.
q Boolean expressions are also used to compute the logical values.
or zero number indicates false. Expressions will be calculated from left to right like arithmetic
expressions. For example, consider a Boolean expression X and Y or Z, the translation of this
expression into three-address code is as follows:
t1: = X and Y
t2: = t1 or Z
Now, consider a relational expression if X > Y then 1 else 0, the three-address code
translation for this expression is as follows:
1. if X > Y goto (4)
2. t1: = 0
3. goto (5)
4. t1: = 1
5. Next
Here, t1 is a temporary variable that can have the value 1 or 0 depending on whether the
condition is evaluated to true or false. The label Next represents the statement immediately
following the else part.
q Control-flow representation: In the second method, the Boolean expression is translated into
three-address code based on the flow of control. In this method, the value of a Boolean expression
is represented by a position reached in a program. In case of evaluating the Boolean expressions
by their positions in program, we can avoid calculating the entire expression. For example, if a
Boolean expression is X and Y, and if X is false, then we can conclude that the entire expression is
false without having to evaluate Y. This method is useful in implementing the Boolean expressions
in control-flow statements such as if-then-else and while-do statements. For example, we
consider the Boolean expressions in context of conditional statements such as
l If X then S1 else S2
l while X do S
In the first statement, if X is true, the control jumps to the first statement of the code for S1, and if X is
false, the control jumps to the first statement of S2 as shown in Figure 7.7(a). In case of second statement,
when X is false, the control jumps to the statement immediately following the while statement, and if
X is true, the control jumps to the first statement of the code for S as shown in Figure 7.7(b).
Code for X
Code for X
False
{ {
X[0][0] X[0][0]
First column
First row X[1][0] X[0][1]
X[0][1] X[0][2]
{
{
Second column
X[1][1] X[1][0]
Second row X[0][2]
X[1][2]
X[1][1]
X[1][2] { Third column
If the elements of a two-dimensional array X[n1][n2] are stored in a row-major form, the relative
address of an array element X[i1][i2] is calculated as follows:
base + (i1 * n2 + i2) * w
On the other hand, if the elements are stored in a column-major form, the relative address of X[i1]
[i2] is calculated as follows:
base + (i1 + i2 * n1) * w
The row-major and column-major forms can be generalized to k-dimensions. If we generalize row-
major form, then elements are stored in such a way that when we scan down a block of storage, the
rightmost scripts appear to vary fastest. On the other hand, in case of column-major form, the leftmost
scripts appear to vary fastest.
Intermediate Code Generation 113
In general, the array elements in one-dimensional array need not be numbered as 0,1, . . . , n - 1,
rather they can be numbered as low, low + 1, . . . , high. Now, the address of an array reference
X[i] can be calculated as base + (i - low) * w. Here, base is the relative address of A[low].
13. Explain the translation of array references.
Ans: The main problem while translating and generating the intermediate code for an array
references is to relate the address calculation formulas to a grammar for array references. Consider the
following grammar, where the non-terminal M generates an array name followed by a sequence of index
expressions:
M ® M[A] ½ digit[A]
Figure 7.9 shows the translation scheme that generates three-address code for expressions with array
references. It comprises the productions and semantic routines for generating three-address code for
expressions incrementally. In addition, it comprises the productions involving the non-terminal M. We
have also assumed that the addresses are calculated using the formula (1), which is based on the width
of the array elements.
S ® digit = A ; {gen (top.get(digit.lexeme)’=’ A.addr);}
½M = A; {gen(M.addr.base’[’M.addr’]’’=’E.addr);}
A ® A1 + A2 {A.addr = newTemp();
gen(A.addr’=’A1.addr’+’A2.addr);}
½digit {A.addr = top.get(digit.lexeme);}
½M {A.addr = new Temp();
gen(A.addr’=’M.array.base’[’M.addr’]’);}
M ® digit[A] {M.array = top.get(digit.lexeme);
M.type = M.array.type.elem;
M.addr = new Temp();
gen(M.addr’=’A.addr’*’M.type.width);}
½M1[A] {M.array = M1.array;
M.type = M1.type.elem;
t = new Temp();
M.addr = new Temp();
gen(t’=’A.addr’*’M.type.width);}
gen(M.addr’=’M1.addr’+’t);}
In Figure 7.9, the non-terminal M has three synthesized attributes M.addr, M.array, and M.type.
Here M.type represents a temporary to be used during computation of the offsets of ij * wj terms,
M.array denotes a pointer pointing to the symbol table entry for an array name, and M.type is the
type of the subarray generated by M.
In the semantic actions of Figure 7.9, the first production S ® digit = A shows an assignment to
a non-array variable. In the second production S ® M = A, an index copy instruction is generated by
the semantic action to assign the value of expression A to the location of array reference M. As discussed
earlier, the symbol table entry for the array is obtained by the attribute M.array. The attribute M.array.
base gives the base address of the array which is the address of its 0th element and the attribute M.addr
114 Principles of Compiler Design
is a temporary variable that holds the offset for array reference generated by M. Thus, M.array.base
[M.addr] gives the location for the array reference. The r-value from address A.addr is copied into
the M’s location by the generated instruction.
The production A ® M has a semantic rule associated with it that generates a code which copies value
at location M into a temporary variable M.array.base[M.addr].
S ® double doublelist
½float floatlist
doublelist ® id, doublelist
½id
floatlist ® id, floatlist
½id
This approach is based on the assumption that the LR parser would be able to decide whether
to reduce the first id to doublelist or floatlist. This approach is not desirable for large
numbers of attributes because as the number of attributes increases, the number of productions also
increases. This would create a decision problem for the parser.
q In the second approach, we can simply rewrite the grammar rules by considering the translation of
names just as a list of names. Now, the above rules can be rewritten as follows:
S ® S,id
½double id
½float id
Now, we can define the semantic actions for these rules as follows:
S ® double id {Enter(id.place, double); S.attr: = double}
S ® float id {Enter(id.place, float); S.attr: = float}
S ® S1, id {Enter(id.place, S1.attr);
S.attr: = S1.attr}
These semantic actions can now enter the appropriate attribute into the symbol table for each name
on the list. Here, S.attr is the translation attribute for non-terminal S, the procedure enter(p, x)
associates attribute x to the symbol table entry pointed to by p, and id.place points the symbol table
entry for the name represented by the token id.
switch E
begin
case V1: S1
case V2: S2
.
.
.
case Vn-1: Sn-1
default: Sn
end
where
E is an expression to be evaluated;
V1, V2, . . . , Vn-1 are the distinct values and are known as case labels, and Vn is a default statement;
S1, S2, . . . , Sn-1 are the statements that will be executed when a particular value is matched.
The case values are constants and are selected by the selector expression E. First, E is evaluated and the
resultant value is matched with these constant values. Then the associated statement sequence of the matched
case value is executed. There is a default expression which is always executed if no other value is matched.
Syntax directed translation of case statements: A simple syntax directed translation scheme
translates the case statements into an intermediate code as shown in Figure 7.10.
When the switch keyword is encountered, two labels SAMPLE and REF, and a temporary variable t
are generated. As we start parsing, we find the expression E and now we generate a code to evaluate this
expression in the temporary t. When E is processed, we generate the jump goto REF.
Intermediate Code Generation 117
On the occurrence of each case keyword, a new label Ai is created and entered into the symbol table.
The cases are stored on a stack, which contains a pointer to this symbol entry along with the value Vi
of each case constant. The evaluated expression in temporary t is matched with the available values V1,
V2, . . . , Vn-1 and if a value match occurs then the corresponding statements are executed. If no value is
matched, then the default case An is executed.
Note that all the test expressions appear at the end. This enables a simple code generator to recognize the
multiway branch and to generate an efficient code for it. If the branching conditions are placed at the beginning
then the compiler would have to perform extensive analysis to generate the most efficient implementation.
17. What is backpatching? Explain.
Ans: The syntax directed definitions can be easily implemented by using two passes. In the first pass,
we construct a syntax tree for the input, and in the second pass, we traverse the tree in depth first order to
complete the translations in the given definition. Generating code for flow of control statements and Boolean
expressions is difficult in single pass. This is because we may not be able to know the labels that the control
must goto during the generation of jump statements. Thus, the generated code would be a series of branching
statements in which the targets of the jumps are temporarily left unspecified. To overcome this problem, we
use back patching, which is a technique to solve the problem of replacing symbolic names into the goto
statements by the actual target addresses. However, some languages do not permit to use symbolic names in
the branches, for this we maintain a list of branches that have the same target labels and then replace them
once they are defined. To manipulate the lists of jumps, the following three functions are used:
q makelist(i): This function creates a new list containing an index i into the array of statements
and then returns a pointer pointing to the newly created list.
q merge(p1, p2): This function concatenates the two lists pointed to by the pointers p1 and p2
respectively, and then returns a pointer to the concatenated list.
q backpatch(p, i): This function inserts i as the target labels for each of the statements on the
list pointed to by p.
18. Translate the expression a: = -b * (c + d)/e into quadruples and triple
representation.
Ans: The three-address code for the given expression is given below:
t 1: = -b (here, ‘-’ represents unary minus)
t 2: = c + d
t 3: = t 1 * t2
t 4: = t3/e
a: = t4
The quadruple representation of this three-address code is shown in Figure 7.11.
The triple representation for the given expression is shown in Figure 7.14.
20. Generate the three-address code for the following program segment.
main()
{
int k = 1;
int a[5];
while (k <= 5)
{
a[k] = 0;
k++;
}
}
Ans: The three-address code for the given program segment is given below:
1. k: = 1
2. if k <= 5 goto (4)
3. goto (8)
4. t1: = k * width
5. t2: = addr(a)-width
6. t2[t1]: = 0
7. t3: = k + 1
8. k: = t3
9. goto (2)
10. Next
21. Generate the three-address code for the following program segment
while(x < z and y > s) do
if x = 1 then
z = z + 1
else
while x <= s do
x = x + 10;
Ans: The three-address code for the given program segment is given below:
1. if x < z goto (3)
2. goto (16)
3. if y > s goto (5)
4. goto (16)
5. if x = 1 goto (7)
6. goto (10)
7. t1: = z + 1
8. z: = t1
9. goto (1)
120 Principles of Compiler Design
22. Consider the following code segment and generate the three-address code for it.
for (k = 1; k <= 12; k++)
if x < y then a = b + c;
Ans: The three-address code for the given program segment is given below:
1. k: = 1
2. if k <= 12 goto (4)
3. goto (11)
4. if x < y goto (6)
5. goto (8)
6. t1: = b + c
7. a: = t1
8. t2: = k + 1
9. k: = t2
10. goto (2)
11. Next
23. Translate the following statement, which alters the flow of control of expressions, and
generate the three-address code for it.
while(P < Q)do
if(R < S) then a = b + c;
Ans: The three-address code for the given statement is as follows:
1. if P < Q goto (3)
2. goto (8)
3. if R < S goto (5)
4. goto (1)
5. t1: = b + c
6. a: = t1
7. goto (1)
8. Next
Intermediate Code Generation 121
24. Generate the three-address code for the following program segment where, x and y are
arrays of size 10 * 10, and there are 4 bytes/word.
begin
add = 0
a = 1
b = 1
do
begin
add = add + x[a,b] * y[a,b]
a = a + 1
b = b + 1
end
while a <= 10 and b <= 10
end
Ans: The three-address code for the given program segment is given below:
1. add:= 0
2. a: = 1
3. b: = 1
4. t1: = a * 10
5. t1: = t1 + b
6. t1: = t1 * 4
7. t2: = addr(x) - 44
8. t3: = t2[t1]
9. t4: = b * 10
10. t4: = t4 + a
11. t4: = t4 * 4
12. t5: = addr(y) - 44
13. t6: = t5[t4]
14. t7: = t3 * t6
15. t7: = add + t7
16. t8: = a + 1
17. a: = t8
18. t9: = b + 1
19. b: = t9
20. if a <= 10 goto (22)
21. goto (23)
22. if b <= 10 goto(4)
23. Next
122 Principles of Compiler Design
Multiple-Choice Questions
1. Which of the following is not true for the intermediate code?
(a) It can be represented as postfix notation.
(b) It can be represented as syntax tree, and or a DAG.
(c) It can be represented as target code.
(d) It can be represented as three-address code, quadruples, and triples.
2. Which of the following is true for intermediate code generation?
(a) It is machine dependent.
(b) It is nearer to the target machine.
(c) Both (a) and (b)
(d) None of these
3. Which of the following is true in the context of high-level representation of intermediate languages?
(a) It is suitable for static type checking.
(b) It does not depict the natural hierarchical structure of the source program.
(c) It is nearer to the target program.
(d) All of these
4. Which of the following is true for the low-level representation of intermediate languages?
(a) It requires very few efforts by the source program to generate the low-level representation.
(b) It is appropriate for machine-dependent tasks like register allocation and instruction selection.
(c) It does not depict the natural hierarchical structure of the source program.
(d) All of these
5. The reverse polish notation or suffix notation is also known as —————.
(a) Infix notation
(b) Prefix notation
(c) Postfix notation
(d) None of above
6. In a two-dimensional array A[i][j], where i is a element of width w1 and j is of width w2, the
relative address of A[i][j] can be calculated by the formula —————.
(a) i * w1 + j * w2
(b) base + i * w1 + j * w2
(c) base + i * w2 + j * w1
(d) base + (i + j) * (w1 + w2)
Answers
1. (c) 2. (c) 3. (a) 4. (b) 5. (c) 6. (b)
10
8
Type Checking
1. What is a type system? List the major functions performed by the type systems.
Ans: A type system is a tractable syntactic framework to categorize different phrases according to
their behaviors and the kind of values they compute. It uses logical rules to understand the behavior
of a program and associates types with each compound value and then it tries to prove that no type
errors can occur by analyzing the flow of these values. A type system attempts to guarantee that only
value-specific operations (that can match with the type of value used) are performed on the values. For
example, the floating-point numbers in C uses floating-point specific operations to be performed over
these numbers such as floating-point addition, subtraction, multiplication, etc.
The language design principle ensures that every expression must have a type that is known (at the
latest, at run time) and a type system has a set of rules for associating a type to an expression. Type
system allows one to determine whether the operators in an expression are appropriately used or not. An
implementation of type system is called type checker.
There are two type systems, namely, basic type system and constructed type system.
q Basic type system: Basic type system contains atomic types and has no internal structure. They
contain integer, real, character, and Boolean. However, in some languages like Pascal, they can
have subranges like 1 . . . 10 and enumeration types like orange, green, yellow, amber, etc.
q Constructed type system: Constructed type system contains arrays, records, sets, and structure
types constructed from basic types and/or from other constructed types. They also include pointers
and functions.
Type system provides some functions that include:
q Safety: A type system allows a compiler to detect meaningless or invalid code which does not
make a sense; by doing this it offers more strong typing safety. For example, an expression 5/“Hi
John” is treated as invalid because arithmetic rules do not specify how to divide an integer by a
string.
q Optimization: For optimization, a type system can use static and/or dynamic type checking, where
static type checking provides useful compile-time information and dynamic type checking verifies
and enforces the constraints at runtime.
q Documentation: The more expressive type systems can use types as a form of documentation to
show the intention of the programmer.
Type Checking 125
q Abstraction (or modularity): Types can help programmers to consider programs as a higher level
of representation than bit or byte by hiding lower level implementation.
2. Define type checking. Also explain the rules used for type checking.
Ans: Type checking is a process of verifying the type correctness of the input program by using
logical rules to check the behavior of a program either at compile time or at runtime. It allows the
programmers to limit the types that can be used for semantic aspects of compilation. It assigns types to
values and also verifies whether these values are consistent with their definition in the program.
Type checking can also be used for detecting errors in programs. Though errors can be checked
dynamically (at runtime) if the target program contains both the type of an element and its value, but a
sound type system eliminates the need for dynamic checking for type errors by ensuring that these errors
would not arise when the target program runs.
If the rules for type checking are applied strongly (that is, allowing only those automatic type
conversions which do not result in loss of information), then the implementation of the language is said
to be strongly typed; otherwise, it is said to be weakly typed. A strongly typed language implementation
guarantees that the target program will run without any type errors.
Rules for type checking: Type checking uses syntax-directed definitions to compute the type of
the derived object from the types of its syntactical components. It can be in two forms, namely, type
synthesis and type inference.
q Type synthesis: Type synthesis is used to build the type of an expression from the types of its sub-
expressions. For type synthesis, the names must be declared before they are used. For example, the
type of expression E1 * E2 depends on the types of its sub-expressions E1 and E2. A typical rule is
used to perform type synthesis and has the following form:
if expression f has a type s ® t and expression x has a type s,
then expression f(x) will be of type t
Here, s ® t represents a function from s to t. This rule can be applied to all functions with
one or more arguments. This rule considers the expression E1 * E2 as a function, mul(E1,E2)
and uses E1 and E2 to build the type of E1 * E2.
q Type inference: Type inference is the analysis of a program to determine the types of some or all
of the expressions from the way they are used. For example,
public int mul(int E1, int E2)
return E1 * E2;
Here, E1 and E2 are defined as integers. So by type inference, we just need definition of E1 and
E2. Since the resulting expression E1 * E2 uses * operation, which would be taken as integer
because it is performed on two integers E1 and E2. Therefore, the return type of mul must be an
integer. A typical rule is used to perform type inference and has the following form:
if f(x) is an expression,
then for some type variables a and b, f is of type a ® b and
x is of type a
4. Define these terms: static type checking, dynamic type checking, and strong typing.
Ans: Static type checking: In static type checking, most of the properties are verified at compile
time before the execution of the program. The languages C, C++, Pascal, Java, Ada, FORTRAN, and
many more allow static type checking. It is preferred because of the following reasons:
q As the compiler uses type declarations and determines all types at compile time, hence catches
most of the common errors at compile time.
q The execution of output program becomes fast because it does not require any type checking
during execution.
The main disadvantage of this method is that it does not provide flexibility to perform type conversions
at runtime. Moreover, the static type checking is conservative, that is, it will reject some programs that
may behave properly at runtime, but that cannot be statically determined to be well-typed.
Dynamic type checking: Dynamic type checking is performed during the execution of the program
and it checks the type at runtime before the operations are performed on data. Some languages that
support dynamic type checking are Lisp, Java Script, Smalltalk, PHP, etc. Some advantages of the
dynamic type checking are as follows:
q It can determine the type of any data at runtime.
q It gives some freedom to the programmer as it is less concerned about types.
q In dynamic typing, the variables do not have any types associated with them, that is, they can refer
to a value of any type at runtime.
Type Checking 127
q It checks the values of all the data types during execution which results in more robust code.
q It is more flexible and can support union types, where the user can convert one type to another at
runtime.
The main disadvantage of this method is that it makes the execution of the program slower by
performing repeated type checks.
Strong typing: A type checking which guarantees that no type errors can occur at runtime is
called strong type checking and the system is called strongly typed. The strong typing has certain
disadvantages such as:
q There are some checks like array bounds checking which require dynamic checking.
q It can result into performance degradation.
q Generally, these type systems have holes in the type systems, for example, variant records in
Pascal.
2. Identifying the language constructs with associated types: Each programming language consists
of some constructs and each of them is associated with a type as discussed below:
l Constants: Every constant has an associated type. A scanner identifies the types and associated
lexemes of a constant.
l Variables: A variable can be global, local, or an instance of a class. Each of these variables
must have a declared type, which can either be one of the base types or the supported compound
types.
l Functions: The functions have a return type, and the formal parameters in function defini-
tion as well as the actual arguments in the function call also have a type associated with
them.
l Expressions: An expression can contain a constant, variable, functional call, or some other
operators (like unary or binary operators) that can be applied in an expression. Hence, the type
of expression depends on the type of constant, variable, operands, function return type, and on
the type of operators.
3. Identifying the language semantic rules: The production rules to parse variable declarations can
be written as:
The parser stores the name of an identifier lexeme as an attribute attached to the token. The name
associated with the identifier symbol, and the type associated with the identifier and type symbol are
used to reduce the variable production. A new variable declaration is created by declaring an identifier
of that type and that variable is stored in the symbol table for further lookup.
128 Principles of Compiler Design
Name equivalence: If type names are treated as standing for themselves, then the first two conditions
of structural equivalence lead to another equivalence of type expressions called name equivalence. In
other words, name equivalence considers types to be equal if and only if the same type names are used
and one of the first two conditions of structure equivalence holds.
For example, consider the following few types and variable declarations.
typedef double Value
. . .
. . .
Value var1, var2
Sum var3, var4
In these statements, var1 and var2 are name equivalent, so are var3 and var4, because their
type names are same. However, var1 and var3 are not name equivalent, because their type names
are different.
7. Explain type conversion.
Ans: Type conversion refers to the conversion of a certain type into another by using some semantic
rules. Consider an expression a + b, where a is of type int and b is of float. The representations of
floats and integers are different within a computer, and an operation on integers and floats uses different
machine instructions. Now, the primary task of the compiler is to convert one of the operands of + to
make both of the operands to same type. For example, an expression 5 * 7.14 has two types, one is
of float type and other one is of type int. To convert integer type constant into float type, we use
a unary operator (float) as shown here:
x = (float)5
y = x * 7.14
The type conversion can be done implicitly or explicitly. The conversion from one type to another is
called implicit, if it is automatically done by the computer. Usually, implicit conversions of constants
Type Checking 129
can be done at compile time and it results in an improvement in the execution time of the object
program. Implicit type conversion is also known as coercion. A conversion is said to be explicit if the
programmer must write something to cause the conversion. For example, all conversions in Ada are
explicit. Explicit conversions can be considered as a function applications to a type checker. Explicit
conversion is also known as casts.
Conversion in languages can be considered as widening conversions and narrowing conversions, as
shown in Figure 8.2(a) and (b), respectively.
double double
float
float
long
long
int
int
byte
The rules used for widening is given by the hierarchy in Figure 8.2(a) and are used to preserve the
information. In widening hierarchy, any lower type can be widened into a higher type like a byte
can be widened to a short or to an int or to a float, but a short cannot be widened to a
char.
The narrowing conversions, on the other hand, may result in loss of information. The rules used for
narrowing is given by the hierarchy in Figure 8.2(b), in which a type x can be narrowed to type y if and
only if there exists a path from x to y. Note that char, short, and byte are pairwise convertible to
each other.
Multiple-Choice Questions
1. Which of the following is true for type system?
(a) It is a tractable syntactic framework.
(b) It uses logical rules to determine the behavior of a program.
(c) It guarantees that only value specific operations are allowed.
(d) All of these
2. A type system can be ————— type system or ————— type system.
(a) Basic, constructed (b) Static, dynamic
(c) Simple, compound (d) None of these
130 Principles of Compiler Design
Answers
1. (a) 2. (a) 3. (a) 4. (c) 5. (b) 6. (d) 7. (d) 8. (b) 9. (d)
9
Runtime Administration
1. Define runtime environment. What are the issues in runtime environment?
Ans: The source language definition contains various abstractions such as names, data types, scopes,
bindings, operators, parameters, procedures, and flow of control constructs. These abstractions must
be implemented by the compiler. To implement these abstractions on target machine, compiler needs
to cooperate with the operating system and other system software. For the successful execution of the
program, compiler needs to create and manage a runtime environment, which broadly describes all the
runtime settings for the execution of programs.
In case of compilation of a program, the runtime environment is indirectly controlled by generating
the code to maintain it. However, in case of interpretation of the program, the runtime environment is
directly maintained by the data structures of the interpreter.
The runtime environment deals with several issues which are as follows:
q The allocation and layout of storage locations for the objects used in the source program.
q The mechanisms for accessing the variables used by the target program.
q The linkages among procedures.
q The parameter passing mechanism.
q The interface to input/output devices, operating systems and other programs.
block of memory known as activation record. Activation record can be created statically or
dynamically. Statically, a single activation record can be constructed, which is common for any
number of activations. Dynamically, number of activation records can be constructed, one for each
activation. The activation record contains the memory for all the local variables of the procedure,
depending on the way by which activation record is created, the target code has to be generated
accordingly to access the local variables.
q Procedure calling and return sequence: Whenever a procedure is invoked or called, certain
sequence of operations need to be performed, which include evaluation of function arguments,
storing it at a specified memory location, transferring the control to the called procedure, etc. This
sequence of operations is known as calling sequence and procedure calling. Similarly, when
the activated procedure terminates, some other operations need to be performed such as fetching
the return value from a specified memory location, transferring the control back to the calling
procedure, etc. This sequence of operations is known as return sequence. The calling sequence
and return sequence differ from one language to another, and in some cases even from one compiler
to another for the same language.
q Parameter passing: The functions used in a program may accept one or more parameters. The
values of these parameters may or may not be modified inside the function definition. Moreover,
the modified values may or may not be reflected in the calling procedure depending on the language
used. In some languages like PASCAL and C++, some rules are specified which determine
whether the modified value should be reflected in the calling procedure. In certain languages like
FORTRAN77 the modified values are always reflected in the calling procedure. There are several
techniques by which parameters can be passed to functions. Depending on the parameter passing
technique used, the target code has to be generated.
3. Give subdivision of runtime memory.
Or
What is storage organization?
Or
Explain stack allocation and heap allocation?
Or
What is dynamic allocation? Explain the techniques used for dynamic allocation (stack
and heap allocation).
Ans: The target program (already compiled) is executed in the runtime environment within its own
logical address space known as runtime memory, which has a storage location for each program
value. The compiler, operating system, and the target machine share the organization and management
of this logical address. The runtime representation of the target program in the logical address space
comprises data and program areas as shown in Figure 9.1. These areas consist of the following
information:
q The generated target code
q Data objects
q Information to keep track of procedure activations.
Since the size of the target code is fixed at compile time, it can be placed in a statically determined
area named Code (see Figure 9.1), which is usually placed in the low end of memory. Similarly, the
memory occupied by some program data objects such as global constants can also be determined at
Runtime Administration 133
compile time. Hence, the compiler can place them in another Code Bottom
statically determined area of memory named Static. The main Static
reason behind the static allocation of as many data objects as
Heap
possible is that the compiler could compile the addresses of these
objects into the target code. For example, all the data objects in
FORTRAN are statically allocated.
The other two areas, namely, Stack and Heap are used to Free Memory
maximize the utilization of space of runtime. The size of these
areas is not fixed, that is, as the program executes their size can
change. Hence, these areas are dynamic in nature.
Stack Top
Stack allocation: The stack (also known as control stack
or runtime stack) is used to store activation records that are
Figure 9.1 Subdivision of Runtime Memory
generated during procedure calls. Whenever a procedure is
invoked, the activation record corresponds to that procedure is pushed onto the stack and all local
items of the procedure are stored in the activation record. When the execution of procedure is
completed, the corresponding activation record is popped from the stack and the values of locals
are deleted.
The stack is used to manage and allocate storage for the active procedure such that
q On the occurrence of a procedure call, the execution of the calling procedure is interrupted, and
the activation record for the called procedure is constructed. This activation record stores the
information about the status of the machine.
q On receiving control from the procedure call, the values in the relevant registers are restored and
the suspended activation of the calling procedure is resumed, and then the program counter is
updated to the point immediately after the call. The stack area of runtime storage is used to store
all this information.
q Some data objects which are contained in this activation and their relevant information are also
stored in the stack.
q The size of the stack is not fixed. It can be increased or decreased according to the requirement
during program execution.
Runtime stack is used in C and Pascal.
Heap allocation: The main limitation of stack area is that it is not possible to retain the values of
non-local variables even after the activation record. This is because of last-in-first-out nature of stack
allocation. To retain the values of such local variables, heap allocation is used. The heap allocates a
contiguous memory locations as and when required for storing the activation records and other data
elements. When the activation ends, the memory is deallocated, and this free space can be further
used by the heap manager. The heap management can be made efficient by creating a linked list of
free blocks. Whenever some memory is deallocated, the free block is appended in the linked list, and
when memory needs to be allocated, the most suitable (best-fit) memory block is used for allocation.
The heap manager dynamically allocates the memory, which results into a runtime overhead of taking
care of defragmentation and garbage collection. The garbage collection enables the runtime system to
automatically detect unused data elements and reuse their storage.
q The binding of names is performed during compilation and no runtime support package is required.
q The binding remains same at runtime as well as compile time.
q Each time a procedure is invoked, the names are bounded to the same storage locations. The values
of local variables remain unchanged before and after the transfer of controls.
q The storage requirement is determined by the type of a name.
Limitations of static allocation are given as follows:
q The information like size of data objects and constraints on their memory position needs to be
present during compilation.
q Static allocation does not support any dynamic data structures, because there is no mechanism to
support run-time storage allocation.
q Since all the activations of a given procedure use the same bindings for local names, recursion is
not possible in static allocation.
5. Explain in brief about control stack.
Ans: A stack representing procedure calls, return, and flow of control is called a control stack or
runtime stack. Control stack manages and keeps track of the activations that are currently in progress.
When the activation begins, the corresponding activation node is pushed onto the stack and popped out
when the activation ends. The control stack can be nested as the procedure calls or activations nest in
time such that if p calls q, then the activation of q is nested within the activation of p.
6. Define activation tree.
Ans: During the execution of program, activation of the
procedures can be represented by a tree known as activation tree. main()
It is used to depict the flow of control between the activations.
Activations are represented by the nodes in activation tree where P1
P2
each node corresponds to one activation, and the root node
represents the activation of the main procedure that initiates the
program execution. Figure 9.2 shows that the main() activates P3 P4
two procedures P1 & P2. The activations of procedures P1 & P2
are represented in the order in which they are called, that is, from Figure 9.2 Activation Tree
left to right. It is important to note that the left child node must
finish its execution before the activation of right node can begin. The activation of P2 further activates
two procedures P3 and P4. The flow of control between the activations can be depicted by performing a
depth first traversal of the activation tree. We start with the root of the tree. Each node is visited before
its child nodes are visited and the child nodes are visited from left to right. When all the child nodes of
a particular node have been visited, we can say that the procedure activation corresponding to a node
is completed.
7. Discuss in detail about activation records.
Ans: The activation record is a block of memory on the control stack used to manage information
for every single execution of a procedure. Each activation has its own activation record with the root of
activation tree at the bottom. The path from one activation to another in the activation tree is determined
by the corresponding sequence of activation records on the control stack.
Different languages have different activation record contents. In FORTRAN, the activation records
are stored in the static data area while in C and Pascal, the activation records are stored in stack area.
The contents of activation records are shown in Figure 9.3.
Runtime Administration 135
Here, operand a is a multiplicand, and is in the odd register of an even/odd register pair and b is
the multiplier, and it can be stored anywhere. After multiplication, the entire even/odd register pair is
occupied by the product.
The division instruction is written as x = a – b x = a – b
D a,b x = x * c x = x – c
x = x/d x = x/d
Here, dividend a is stored in the even register of an even/
odd register pair and the divisor b can be stored anywhere. (a) (b)
After division, remainder is stored in the even register and Figure 9.4 Two Three-address Code Sequences
quotient is stored in the odd register.
Now, consider the two three-address code sequences
given in Figure 9.4(a) and (b). L R1, a L R 0, a
These three-address code sequences are almost same; S R1, b S R 0, b
the only difference is the operator in the second statement. M R0, c S R 0, c
The assembly code sequences for these three-address code D R0, d SRDA R0, 32
sequences are given in Figure 9.5(a) and (b).
ST R1, x D R 0, d
Here, L, ST, S, M, and D stand for load, store, subtract,
ST R 1, x
multiply, and divide respectively. R0 and R1 are machine reg-
isters and SRDA stands for Shift-Right-Double-Arithmetic. (a) (b)
SRDA R0, 32 shifts the dividend into R1 and clears R0 to Figure 9.5 Assembly Code (Machine Code) Sequences
make all bits equal to its sign bit.
9. Explain the various parameter passing mechanisms of a high-level language.
Or
What are the various ways to pass parameters in a function?
Ans: When one procedure calls another, the communication between the procedures occurs through
non-local names and through parameters of the called procedure. All the programming languages have
two types of parameters, namely, actual parameters and formal parameters. The actual parameters
are those parameters which are used in the call of a procedure; however, formal parameters are those
which are used during the procedure definition. There are various parameter passing methods but most
of the recent programming languages use call by value or call by reference or both. However, some
older programming languages also use another method call by name.
q Call by value: It is the simplest and most commonly used method of parameter passing. The actual
parameters are evaluated (if expression) or copied (if variable) and then their r-values are passed
to the called procedure. r-value refers to the value contained in the storage. The values of actual
parameters are placed in the locations which belong to the corresponding formal parameters of the
called procedure. Since the formal and actual parameters are stored in different memory locations,
and formal parameters are local to the called procedure, the changes made in the values of formal
parameters are not reflected in the actual parameters. The languages C, C++, Java, and many more
use call by value method for passing parameters to the procedures.
q Call by reference: In call by reference method, parameters are passed by reference (also known
as call by address or call by location). The caller passes a pointer to the called procedure, which
points to the storage address of each actual parameter. If the actual parameter is a name or an
expression having an l-value, then the l-value itself is passed (here, l-value represents the address
of the actual parameter). However, if the actual parameter is an expression like a + b or 2, having
Runtime Administration 137
no l-value, then that expression is calculated in a new location, and the address of that new location
is passed. Thus, the changes made in the calling procedure are reflected in the called procedure.
q Call by name: It is a traditional approach and was used in early programming languages, such
as ALGOL 60. In this approach, the procedure is considered as a macro, and the body of the
procedure is substituted for the call in the caller and the formals are literally substituted by the
actual parameters. This literal substitution is called macro expansion or in-line expansion.
The names of the calling procedure are kept distinct from the local names of the called
procedure. That is, each local name of the called procedure is systematically renamed into a
distinct new name before the macro expansion is done. If necessary, the actual parameters are
surrounded by parentheses to maintain their integrity.
10. What is the output of this program, if compiler uses following parameter passing methods?
Call by reference: In call by reference, both formal parameters y and z have the same reference that
is, a. Thus, in function A the following values are passed.
x = 5
y = 2
z = 2
After the execution of y = y + 1, the value of y becomes
y = 2 + 1 = 3
Since y and z are referring to the same memory location, z also becomes 3. Now after the execution
of statement z = z + x, the value of z becomes
z = 3 + 5 = 8
138 Principles of Compiler Design
When control returns to main(), the value of a will now become 8. Hence, output will be 8.
Call by name: In this method, the procedure is treated as macro. So, after the execution of the
function
x = 5
y = y + 1 = 2 + 1 = 3
z = z + x = 2 + 5 = 7
When control returns to main(), the value of a becomes 7. Hence, output will be 7.
Multiple-Choice Questions
1. What are the issues that the runtime environment deals with?
(a) The linkages among procedures
(b) The parameter passing mechanism
(c) Both (a) and (b)
(d) None of these
2. The elements of runtime environment include —————.
(a) Memory organization
(b) Activation records
(c) Procedure calling, return sequences, and parameter passing
(d) All of these
3. Which of the following area in the memory is used to store activation records that are generated
during procedure calls?
(a) Heap
(b) Runtime stack
(c) Both (a) and (b)
(d) None of these
4. ————— are used to depict the flow of control between the activations of procedures.
(a) Binary trees
(b) Data flow diagrams
(c) Activation trees
(d) Transition diagram
5. The ————— is a block of memory on the control stack used to manage information for every
single execution of a procedure.
(a) Procedure control block
(b) Activation record
(c) Activation tree
(d) None of these
6. ————— is the process of selecting a set of variables that will reside in CPU registers.
(a) Register allocation
(b) Register assignment
(c) Instruction selection
(d) Variable selection
Runtime Administration 139
Answers
1. (c) 2. (d) 3. (b) 4. (c) 5. (b) 6. (a) 7. (c)
10
Symbol Table
1. What is symbol table and what kind of information it stores? Discuss its capabilities and
also explain the uses of symbol table.
Ans: A symbol table is a compile time data structure that is used by the compiler to collect and
use information about the source program constructs, such as variables, constants, functions, etc. The
symbol table helps the compiler in determining and verifying the semantics of given source program.
The information in the symbol table is entered in the lexical analysis and syntax analysis phase, however,
is used in later phases of compiler (semantic analysis, intermediate code generation, code optimization,
and code generation). Intuitively, a symbol table maps names into declarations (called attributes), for
example, mapping a variable name a to its data type char.
Each time a name is encountered in the source program, the compiler searches it in the symbol table.
If the compiler finds a new name or new information about an existing name, it modifies the symbol
table. Thus, an efficient mechanism must be provided for retrieving the information stored in the table
as well as for adding new information to the table. The entries in the symbol table consists of (name,
information) pair. For example, for the following variable declaration statement,
char a;
The symbol table entry contains the name of the variable along with its data type.
More specifically, the symbol table contains the following information:
q The character string (or lexeme) for the name. If the same name is assigned to two or more
identifiers which are used in different blocks or procedures, then an identification of the block or
procedure to which this name belongs to must also be stored in the symbol table.
q For each type name, the type definition is stored in the symbol table.
q For each variable name, its type (int, char, or real), its form (label, simple variable, or array),
and its location in the memory must also be stored. If the variable is an array, then some other
attributes such as its dimensions, and its upper and lower limits along each dimension are also stored.
Other attributes such as storage class specifier, offset in activation record, etc. can also be stored.
q For each function and procedure, the symbol table contains its formal parameter list and its return
type.
q For each formal parameter, its name, type and type of passing (by value or by reference) is also stored.
Symbol Table 141
2. What are the symbol table requirements? What are the demerits in the uniform structure
of symbol table?
Ans: The basic requirements of a symbol table are as follows:
q Structural flexibility: Based on the usage of identifier, the symbol table entries must contain all
the necessary information.
q Fast lookup/search: The table lookup/search depends on the implementation of the symbol table
and the speed of the search should be as fast as possible.
q Efficient utilization of space: The symbol must be able to grow or shrink dynamically for an
efficient usage of space.
q Ability to handle language characteristics: The characteristic of a language such as scoping and
implicit declaration needs to be handled.
For example, the function object_lookup(s) returns an index of the entry for the string s; if s
is not found, it returns 0.
q Search/Insert: This operation searches for a given name in the symbol table, and if not found, it
inserts it into the table.
q begin_scope () and end_scope (): The begin_scope() begins a new scope, when a new block
starts, that is, when the token { is encountered. The end_scope() removes the scope when
the scope terminates, that is, when the token } is encountered. After removing a scope, all the
declarations inside this scope are also removed.
q Handling reserved keywords: Reserved keywords like ‘PLUS’, ‘MINUS’, ‘MUL’, etc., are
handled by the symbol table in the following manner.
insert (“PLUS”, PLUS);
insert (“MINUS”, MINUS);
insert (“MUL”, MUL);
The first ‘PLUS’, ‘MINUS’, and ‘MUL’ in the insert operation indicate lexeme and second one
indicate the token.
4. Explain symbol table implementation.
Ans: The implementation of a symbol table needs a particular data structure, depending upon the
symbol table specifications. Figure 10.1 shows the data structure for implementation of a symbol table.
The character string forming an identifier is stored in a separate array arr_lexeme. Each string is
terminated by an EOS (end of string character), which is not a part of identifiers. Each entry in the symbol table
arr_symbol_table is a record having two or more fields, where first field named lexeme_pointer
Array arr_symbol_table
points to the beginning of the lexeme, and the second field Token consists of the token name. Symbol
table also contains two more fields, namely attribute, which holds the attribute values, and position,
which indicates the position of a lexeme in the symbol table are used.
Note that the 0th entry in the symbol table is left empty, as a lookup operation returns 0, if the symbol
table does not have an entry for a particular string. The 1st, 3rd, 5th and 7th entries are for the x, y, m,
and n respectively. The 2nd, 4th and 6th entries are reserved keyword entries for MINUS, AND and PLUS
respectively.
Whenever the lexical analyzer encounters a letter in an input string, it starts storing the subsequent
letters or digits in a buffer named lex_buffer. It then scans the symbol table using the object_
lookup() operation to determine whether the collected string is in the symbol table. If the lookup
operation returns 0, that is, there is no entry for the string in lex_buffer, a new entry for the identifier
is created using insert(). After the insert operation, the index n of symbol table entry for the entered
string is passed to the parser by setting the tokenval to n, and an entry in the Token field of the
token is returned.
a int 2
b char 1
Space
c float 4
d long 4
To access a particular name, the whole table is searched sequentially from its beginning until it
is found. For a symbol table having n entries, it will take on average n/2 comparisons to find a
particular name.
q Self-organizing list: We can reduce the time of searching the symbol table at the cost of a little
extra space by adding an additional LINK field to each record or to each array index. Now, we
search the list in the order indicated by links. A new name is inserted at a location pointed to by
space pointer, and then all other existing links are managed accordingly. A self-organizing list is
shown in Figure 10.3, where the attributes id1 is related to id2 and id3 is related to id1, and
are linked by the LINK pointer.
Variable Information
id1 Info 1
id2 Info 2
id3 Info 3
Space
The main reason for using the self-organizing list is that if a small set of names is heavily used
in a section of program, then these names can be placed at the top while that section is being
processed by the compiler. However, if references are random, then the self-organizing list will
cost more time and space.
Demerits of self-organizing list are as follows:
l It is difficult to maintain the list if a large set of names is frequently used.
l It occupies more memory as it has a LINK field for each record.
l As self-organizing list organizes it itself, so it may cause problems in pointer movements.
q Hash Table: A hash table is a data structure that associates keys with values. The basic hashing
scheme has two parts:
l A hash table consisting of a fixed array of k pointers to table entries.
l A storage table with the table entries organized into k separate linked lists and each record in
the symbol table appears on exactly one of these lists.
To add a name in the symbol table, we need to determine the hash value of that name with the
help of a hash function, which maps the name to the symbol table by assigning an integer between
0 to k - 1. To search a given name into the symbol, a hash function is applied to that name. Thus,
we need to search only that list to determine whether that name exists in the symbol table or not.
There is no need to search the entire symbol table. If the name is not present in the list, we create a
new record for that name and then insert that record at the head of the list whose index is computed
by applying the hash function to the name.
A hash function should be chosen in such a way that it distributes the names uniformly among
the k lists, and it can be computed easily for the names comprising character strings. The main
advantage of using hash table is that, we can insert or delete any name in O(n) time and search any
name in O(1) time. However, in the worst case it can be as bad as O(n).
Symbol Table 145
1. Name
1. Data
1. Link
Name •
2. Name
h
2. Data
2. Link
Hash table
3. Name
3. Data
3. Link
Available •
•
•
Storage table
q Search Tree: Search tree is an approach to organize symbol table by adding two link fields, LEFT
and RIGHT, to each record. These two fields are used to link the records into a binary search tree.
All names are created as child nodes of root node that always follow the properties of a binary
search tree.
l The name in each node is a key value, that is, no two nodes can have identical names.
l The names in the nodes of left sub tree, if exists, is smaller than the value in the root node.
l The names in the nodes of right sub tree, if exists, is greater than the value in the root node.
l The left and right sub trees, if exists, are also binary search trees.
For example: name < name_i and name_i < name. These two statements show that all name
smaller than name_i must be left child of name_i; and all name greater than name_i must be
right child of name_i. To insert, search and delete in search tree, the binary search tree insert,
search and deletion algorithms are followed respectively.
6. Create list, search tree and hash table for given program.
int i, j, k;
int mul (int a, int b)
{
i = a * b;
Return (i)
}
main ()
{
int x;
x = mul (2, 3);
}
146 Principles of Compiler Design
Ans: List
Hash Table
i j k \n
mul \n
x \n
a b \n
Search Tree
x
i a
j b
k mul
Figure 10.7 Search Tree Symbol Table for the Given Program
Symbol Table 147
q Scope in class declaration (scope of declaration): The portion of the program in which a
declaration can be applied is called the scope of that declaration. In a procedure a name is said
to be local to the procedure if it is in the scope of declaration within the procedure, otherwise the
name is said to be non-local.
148 Principles of Compiler Design
9. Explain error detection and recovery in lexical phase, syntactic phase, and semantic phase.
Ans: The classification of errors is given in Figure 10.8.
These errors should be detected during different phases of compiler. Error detection and recovery is
one of the main tasks of a compiler. The compiler scans and compiles the entire program, and errors
detected during scanning need to be recovered as soon as they are detected.
Usually, most of the errors are encountered during syntax and semantic analysis phase. Every
phase of a compiler expects the input to be in particular format, and an error is returned by the
Errors
compiler whenever the input is not in the required format. On detection of an error, the compiler
scans some of the tokens ahead of the point of error occurrence. A compiler is said to have better
error-detection capability if it needs to scan only a few numbers of tokens ahead of the point of
error occurrence.
A good error detection scheme reports errors in an intelligible manner and should possess the
following properties.
q The error message should be easily understandable.
q The error message should be produced in terms of original source program and not in any internal
representation of the source program. For example, each error message should have a line number
of the source program associated with it.
q The error message should be specific and properly localize the error. For example, an error message
should be like, “A is not declared in function sum” and not just “missing declaration”.
q The same error message should not be produced again and again, that is, there is no redundancy in
the error messages.
Error detection and recovery in lexical Phase: The errors where the remaining input characters do
not form any token of the language are detected by the lexical phase of compiler. Typical lexical phase
errors are spelling errors, appearance of illegal characters and exceeding length of identifier or numeric
constant.
Once an error is detected, the lexical analyzer calls an error recovery routine. The simplest
error recovery routine skips the erroneous characters in the input until the lexical analyzer finds a
synchronizing token. But this scheme causes the parser to have a deletion error, which would result in
several difficulties for the syntax analysis and for the rest of the phases.
The ability of lexical analyzer to recover from errors can be improved by making a list of legitimate
tokens (in the current context) which are accessible to the error recovery routine. With the help of
this list, the error recovery routine can decide whether the remaining input characters match with a
synchronizing token and can be treated as that token.
Error detection and recovery in syntactic phase: The errors where the token stream violates the
syntax of the language and the parser does not find any valid move from its current configuration are
detected during the syntactic phase of the compiler. The LL(1) and LR(1) parsers have valid prefix
property capability, that is, they report an error as soon as they read an input character which is not a
valid continuation of the previous input prefix. In this way, these parsers reduce the amount of erroneous
output to be passed to next phases of the compiler.
To recover from these errors, panic mode recovery scheme or phrase level recovery scheme (discussed
in chapter 04) can be used.
Error detection and recovery in semantic phase: The language constructs that have the right
syntactic structure but have no meaning to the operation involved are detected during semantic
analysis phase. Undeclared names, type incompatibilities and mismatching of actual arguments with
formal arguments are the main causes of semantic errors. When an undeclared name is encountered
first time, a symbol table entry is created for that name with an attribute that is suitable to the current
context.
For example, if semantic phase detects an error like “missing declaration of A in function sum”, then
a symbol table entry is created for A with an attribute that is suitable to the current context. To indicate
that an attribute has been added to recover from an error and not in response to the declaration of A, a
flag is set in the A symbol table record.
150 Principles of Compiler Design
Multiple-Choice Questions
1. Which of the following is not true in context of a symbol table?
(a) It is a compile time data structure.
(b) It maps name into declarations.
(c) It does not help in error detection and recovery.
(d) It contains formal parameter list and return type of each function and procedure.
2. The information in the symbol table is entered during —————.
(a) Lexical analysis
(b) Syntax analysis
(c) Both (a) and (b)
(d) None of these
3. Which of these operations can be performed on a symbol table?
(a) Insert
(b) Lookup
(c) begin_scope and end_scope
(d) All of these
4. Which of the following data structure is not used to implement symbol tables?
(a) Linear list
(b) Hash table
(c) Binary search tree
(d) AVL tree
5. Which of the following is not true for scope representation in symbol table?
(a) Declarations have same scope in different languages.
(b) The scope of a name is a single subroutine in FORTRAN.
(c) Symbol table keeps different declaration of the same identifier distinct.
(d) In ALGOL, the scope of a name is the section or procedure in which it declared.
6. Which of the following is not true for error detection and recovery?
(a) Error detection and recovery is the main task of the compiler.
(b) Most of the errors are detected during lexical phase.
(c) A compiler returns an error, if the input is not in the required format.
(d) None of these
Answers
1. (c) 2. (b) 3. (c) 4. (b) 5. (a) 6. (c) 7. (b)
11
Code Optimization
and Code Generation
Ans: The various factors that affect the code generation process are as follows:
q Input: The intermediate code produced by the intermediate code generator or code optimizer of
the compiler is given as input to the code generator. At the time of code generation, the source
program is assumed to be scanned, parsed, and translated into a relatively low-level intermediate
representation. Type conversion operators are assumed to be inserted wherever required, and
that semantic errors have also been detected. The code generation phase, therefore, proceeds on
the assumption that the input to the code generator is free from errors. We also assume that the
operators, data types, and the addressing modes appearing in the intermediate representation can be
directly mapped to the target machine representation. If such straightforward mappings exist, then
the code generation is simple, otherwise a significant amount of translation effort is required.
q Structure of target code: The efficient construction of a code generator depends mainly on the
structure of the target code which further depends on the instruction-set architecture of the target
machine. RISC (reduced instruction set computer) and CISC (complex instruction set computer)
are the two most common target machine architectures. The target program code may be absolute
machine language code, relocatable machine language code, or assembly language code.
l If the target program code is absolute machine language code, then it can be placed in a fixed
memory location and can be executed immediately. The fixed location of program variables
and code makes the absolute code generation relatively easier.
l If the target program code is relocatable machine language code (also known as object
module), then the code generation becomes a bit difficult as relocatable code may or may
not be supported by the underlying hardware. In case the target machine does not support
relocation automatically, it is the responsibility of compiler to explicitly insert the code for
ensuring smooth relocation. However, producing a relocatable code requires subprograms to
152 Principles of Compiler Design
be compiled separately. After compilation, all the relocatable object modules can be linked
together and loaded for execution by a linking loader.
l If the output is assembly language program, then it can be converted into an executable version
by an assembler. In this case, the code generation can be made simpler by utilizing the features
of assembler. That is, we can generate symbolic instruction code and use the macro facilities of
the assembler to help the code generation process.
q Selection of instruction: The nature of the instruction set of the target machine is an important
factor to determine the complexity of instruction selection. The uniformity and completeness of
the instruction set, instruction speed, and machine idioms are the important factors that are to be
considered. If we are not concerned with the efficiency of the target program, then instruction
selection becomes easier and straightforward. The two important factors that determine the quality
of the generated code are its speed and size.
For example, the three-address statement of the form,
A = B + C
X = A + Y
can be translated into a code sequence as given below:
LD R0,B
ADD R0,R0,C
ST A,R0
LD R0,A
ADD R0,R0,Y
ST X,R0
The main drawback of this statement by statement code generation is that it produces redundant
load and store statements. For example, the fourth step in the above code is redundant as the
value that has been stored just before is loaded again. If the target machine provides a rich set of
instructions then there will be several ways of implementing a given instruction. For example,
if the target machine has an increment instruction, X = X + 1, then instead of multiple load and
store instructions, we can have simple instruction INC X. Note that deciding which machine-code
sequence is suitable for a given set of three-address instructions may require knowledge about the
context in which those instructions appear.
q Allocation of registers: Assigning the values to the registers is the key problem during code
generation. So, generation of a good code requires the efficient utilization of registers. In general,
the utilization of registers is subdivided into two phases, namely, register allocation and register
assignment. Register allocation is the process of selecting a set of variables that will reside in
CPU registers. Register assignment refers to the assignment of a variable to a specific register.
Determining the optimal assignment of registers to variables even with single register values is
difficult because the allocation problem is NP-complete. In certain machines, even/odd register
pairs are required for some operands and results which make the problem further complicated.
In integer multiplication, the multiplicand is placed in the odd register, however, the multiplier
can be placed in any other single register, and the product (result) is placed in the entire even/odd
register pair. Register allocation becomes a nontrivial task because of these architecture-specific
issues.
q Evaluation order: The performance of the target code is greatly affected by the order in which
computations are performed. For some computation order, only a fewer registers are required to
hold the intermediate results. Hence, deciding the optimal computation order is again difficult since
Code Optimization and Code Generation 153
the problem is NP–complete. The problem can be avoided initially by generating the code for the
three-address statements in the same order as that of produced by the intermediate code generator.
2. Define basic block.
Ans: A basic block is a sequence of consecutive three-address statements in which the flow of
control enters only from the first statement of the basic block and once entered, the statements of
the block are executed without branching, halting, looping or jumping except at the last statement.
The control will leave the block only from the last statement of the block. For example, consider the
following statements.
t1: = X * Y
t 2: = 5 * t 1
t3: = T1 * t2
In the above sequence of statements, the control enters only from the first statement, t1: = X * Y.
The second and third statements are executed sequentially without any looping or branching and the
control leaves the block from the last statement. Hence, the above statements form a basic block.
3. Write the steps for constructing leaders in a basic block.
Or
How can you find leaders in basic blocks?
Ans: The first statement in the basic block is known as the leader. The rules for finding leaders are
as follows:
(i) The first statement in the intermediate code is leader.
(ii) The target statement of a conditional and unconditional jump is a leader.
(iii) The immediate statement following an unconditional or conditional jump is a leader.
4. Write an algorithm for partitioning of three-address instructions into a basic block.
Give an example also.
Ans: A sequence of three-address instructions is taken as input and the following steps are performed
to partition the three-address instructions into basic blocks:
Step 1: Determine the set of leaders.
Step 2: Construct the basic block for each leader that consists of the leader and all the instructions till
the next leader (excluding the next leader) or the end of the program.
The instructions that are not included in a block are not executed and may be removed, if
desired.
For example, consider the following code segment that computes a dot product between
two integer arrays X and Y.
begin
PRODUCT: = 0
j: = 1
do
begin
PRODUCT: = PRODUCT + X[j] * Y[j]
j: = j + 1
end
while j <= 20
end
154 Principles of Compiler Design
The corresponding three-address code for the above code segment is given as follows:
1. PRODUCT: = 0
2. j: = 1
3.
t1: = 4 * j /* assuming that the elements of integer array
take 4 bytes*/
4.
t2: = X[t1] /* computing X[j] */
1. PRODUCT:= 0
5. t3: = 4 * j Block B1
2. j: = 1
6. t4: = Y[t2] /* computing Y[j] */ 3. t 1: = 4 * j
7.
t5: = t2 * t4 /* computing 4. t2: = X[t1]
X[j] * Y[j] */ 5. t3: = 4 * j
6. Explain code optimization. What are the objectives of the code optimization?
Code Optimization and Code Generation 155
Ans: Code optimization is an attempt of compiler to produce a better object code (target code)
with high execution efficiency than the input source program. In some cases, code optimization can
be so simple that it can be carried out without much difficulty. However, in some other cases, it may
require a complete analysis of the program. Code optimization may require various transformation of
the source program. These transformations should be carried out in such a way that all the translations of
a source program are semantic equivalent, as well as the algorithm should not be modified in any case.
The efficiency and effectiveness of a code optimization technique is determined by the time and space
required by the compiler to produce the target program. Code optimization can be machine dependent
or machine independent (discussed in Question 15).
if (i > Min + 2)
{
sum = sum + x[i];
}
In this code segment, the expression Min + 2 is evaluated each time the loop is executed,
however, it always produces the same result irrespective of the iteration of the loop. Thus, we can
place this expression at the entry point of the loop as follows:
n = Min + 2
if (i > n)
{
sum = sum + x[i];
}
Since in loop-invariant expression elimination, the expression from inside the loop is moved
outside the loop, this method is also known as loop-invariant code motion.
q Induction variable elimination: A variable is said to be induction variable if its value gets
incremented or decremented by some constant every time the loop is executed. For example,
consider the following for loop statement:
for (j = 1;j <= 10; j++)
Here, the value of j is incremented every time the loop is executed. Hence, j is an induction
variable. If there are more than one induction variables in a loop then, it is possible to get rid of all
but one. This process is known as induction variable elimination.
q Strength reduction: Replacing an expensive operation with an equivalent cheaper operation
is called strength reduction. For example, the * operator can be replaced by a lower strength
operator +. Consider the following code segment:
for (j = 1; j <= 10; j++)
{
. . .
cnt = j * 5;
. . .
}
After strength reduction, the code can be written as follows:
temp = 5;
for (j = 1; j <= 10; j++)
{
. . .
cnt = temp;
temp = temp + 5;
. . .
}
q Loop unrolling: The number of jumps can be reduced by replicating the body of the loop if the
number of iterations is found to be constant (that is, the number of iterations is known at compile
time). For example, consider the following code segment:
Code Optimization and Code Generation 157
int j = 1;
while (j <= 50)
{
X[j] = 0;
j = j + 1;
}
This code segment performs the test 50 times. The number of tests can be reduced to 25 by
replicating the code inside the body of the loop as follows:
int j = 1;
while (j <= 50)
{
X[j] = 0;
j = j + 1;
X[j] = 0;
j = j + 1;
}
The main problem with loop unrolling is that if the body of the loop is big, then unrolling may
increase the code size, which in turn, may affect the system performance.
q Loop fusion: It is also known as loop jamming in which the bodies of the two loops are merged
together to form a single loop provided that they do not make any references to each other. For
example, consider the following statements:
int i,j;
for(i = 1;i <= n; i++)
A[i] = B[i];
for(j = 1;j <= n; j++)
C[j] = A[j];
int i,j;
for(k = 1; k <= n; k++)
{
A[k] = B[k];
C[k] = A[k];
}
10. Define DAG (Directed Acyclic Graph). Discuss the construction of DAG of a given basic
block.
Ans: A DAG is a directed acyclic graph that is used to represent the basic blocks and to implement
transformations on them. It represents the way in which the value computed by each statement in
a basic block is used in the subsequent statements in the block. Every node in a flow graph can be
represented by a DAG. Each node of a DAG is associated with a label. The labels are assigned by using
these rules:
158 Principles of Compiler Design
q The leaf nodes are labeled by unique identifiers which can be either constants or variable names.
The initial values of names are represented by the leaf nodes, and hence they are subscripted with
0 in order to avoid confusion with labels denoting current values of names.
q An operator symbol is used to label the interior nodes.
q Nodes are also labeled with an extra set of identifiers where the interior nodes represent the
computed values and the identifier labeling that node contains the computed value.
The main difference between a flow graph and a DAG is that a flow graph consists of several nodes
where each node stands for a basic block, whereas a DAG can be constructed for each node (or the basic
block) in the flow graph.
Construction of DAG
While constructing a DAG, we consider a function node(identifier) which returns the most
recently created node associated with an identifier. Assume that there are no nodes initially and node( )
is undefined for all arguments. Let the three-address statements be either
(i) X: = Y op Z or
(ii) X: = op Y or
(iii) X: = Y
The steps followed for the construction of DAG are as follows:
1. Create a leaf labeled Y if node(Y) is undefined and let that node be node(Y). If node(Z) is
undefined for three-address statement (i) then, create a leaf labeled Z and let it be node(Z).
2. Determine if there is any node labeled op, where node(Y) as its left child and node(Z) as its
right child for three-address statement (i). If such a node is not found, then create such a node and
let it be n. For three-address statement (ii), determine if there is any node labeled op, whose only
child is node(Z). If such a node is not found then, create a node and let it be n. Similarly, create
a node n for node(Y) for three-address statement (iii).
3. Delete X from the list of attached identifiers for node(X). Append X to the list of attached identifiers
for node n created or found in step 2 and set node(X) to n.
For example, consider the block B2 shown in Figure 11.2. For the first statement, t1: = 4 * j,
leaves labeled 4 and j0 are created. In the next step, a node labeled * is created and t1 is attached to it
as an identifier. The DAG representation is shown in Figure 11.3(a).
For the second statement t2: = X[t1], a new leaf labeled X is created. Since, we have already
created node(t1) in the previous step, we will not create a new node t1. However, we create a new
node for [], and attach X and t1 as its child nodes. Now, the third statement t3: = 4 * j is same as
that of first statement, therefore, we will not create any new node; rather we give the existing * node an
additional label t3. The DAG representation for this is shown in Figure 11.3(b).
For the fourth statement, t4: = Y[t3] we create a node[] and attach Y as its left child node. The
corresponding DAG representation is shown in Figure 11.3(c). For the fifth statement t5: = t2 + t4,
we create a new node + and attach the already created nodes labeled t4 and t2 as its left and right child
respectively. The resultant DAG is shown in Figure 11.3(d).
For the sixth statement, we create a new node labeled +, and attach a leaf labeled PRODUCT0 as
its left child. The already created node(*) is attached as its right child. For the seventh statement,
PRODUCT: = t6., we assign the additional label PRODUCT to the existing + node. The resultant DAG
is shown in Figure 11.3(e).
Code Optimization and Code Generation 159
t1 t2 t4 t2
[]
* [] []
t1,t3 t1,t3
4 * Y
j0 X *
X
(a)
4 j0 4 j0
(b)
(c)
t5 t6,PRODUCT
+
+
t4 t2
[] []
PRODUCT T0
t1,t3 t5,j
*
Y
*
X
t4 t2
4 j0 [] []
(d) t1,t3
Y
*
X
t6,PRODUCT
4 j0
+ (e)
PRODUCT0
t5,j
*
(1)
t4 t2 <=
[] []
t7,j
t1,t3
20
Y +
*
X
4 1
j0
(f)
For the eighth statement t7: = j + 1, we create a new node labeled +, and make j0 as its left child.
Now, we create a new leaf labeled 1 and make this leaf as its right child. For the ninth statement, we will
not create any new node; rather we give node + the additional label j. Finally, for the last statement we
160 Principles of Compiler Design
create a new node labeled <= and attach an identifier (1) with it. Now, we create a new leaf labeled
20 and make this node as the right child of node(<=). The left child of this node is node(+). The
final DAG is shown in Figure 11.3(f).
11. What are the advantages of DAG?
Or
Discuss the applications of DAG.
Ans: The construction of DAG from three-address statements, serves the following purposes:
q It helps in determining the common subexpressions (expressions computed more than once).
q It helps in determining the instructions that compute a value which is never used. It is referred to
as dead code elimination.
q It provides a way to determine those names which are evaluated outside the block, however, are
used inside the blocks.
q It helps in determining those statements of the block which could have their computed values used
outside the block.
q It helps in determining those statements which are independent of one another and hence, can be
reordered.
12. Give the primary structure-preserving transformations on basic blocks.
Ans: The primary structure-preserving transformations on basic blocks are as follows:
q Common subexpression elimination: Transformations are performed on basic blocks by
eliminating the common subexpressions. For example, consider the following basic block:
X: = Y * Z
Y: = X + A
Z: = Y * Z
A: = X + A
In the given basic block, the right side of the first and third statement appears to be same,
however, Y * Z is not a common subexpression because the value of Y has been modified in the
second statement. The right side of the second and fourth statement is also same, and the value of
X is not modified, so we can replace the X + A by Y in the fourth statement. Now, the equivalent
transformed block can be written as follows:
X: = Y * Z
Y: = X + A
Z: = Y * Z
A: = Y
q Dead code elimination: A variable is said to be dead (useless) at a point in a program if its value
cannot be used subsequently in the program. Similarly, a code is said to be dead if the value
computed by the statements is never get used. Elimination of dead code does not affect the program
behavior. For example, consider the following statements:
flag: = false
If (flag) print some information
Here, the print statement is dead as the value of flag is always false and hence the control
never reaches to the print statement. Thus, the complete if statement (test and the print
operation) can be eliminated easily from the object code.
Code Optimization and Code Generation 161
Y: = X + 0
Y: = X * 1
Y: = X * 0
After algebraic transformations, the expensive addition and multiplication operations involved
in these statements can be replaced by cheaper assignment operations as given below:
Y: = X
Y: = X
Y: = 0
q Induction variables and strength reduction: Refer Question 9
13. Discuss in detail about a simple code generator with the appropriate algorithm.
Or
Explain code generation phase with simple code generation algorithm.
Ans: A simple code generator generates the target code for the three-address statements. The main
issue during code generation is the utilization of registers since the number of registers available is
limited. The code generation algorithm takes the sequence of three-address statements as input and
assumes that for each operator, there exists a corresponding operator in target language. The machine
code instruction takes the required operands in registers, performs the operation and stores the result
in a register. Register and address descriptors are used to keep track of register contents and addresses.
q Register descriptors are used to keep track of the contents of each register at a given point of time.
Initially, we assume that a register descriptor shows that all registers are empty and as the code
generation proceeds, each register holds the value of zero or more names at some point.
q Address descriptors are used to trace the location of the current value of the name at run time. The
location may be memory address, register, or a stack location, and this information can be stored
in the symbol table to determine the accessing method for a name.
The code generation algorithm for a three-address statement X: = Y op Z is given below:
1. Call getreg() to obtain the location L where the result of Y op Z is to be stored. L can be a
register or a memory location.
2. Determine the current location of Y by consulting the address descriptor for Y and let it be Y’.
If both the memory and register contains the value of Y, then prefer the register for Y’. If the
value is not present in L, then generate an instruction MOV Y’, L.
3. Determine the current location of Z, say, Z’ and generate the instruction OP Z’,L. In this case
also, if both the memory and the register hold the value of Z, then prefer the register. Update the
address descriptor of X to indicate that X is in L and if L is a register then, update its descriptor
indicating that it holds the value of X. Delete X from other register descriptors.
4. If the current values of Y and/or Z are in registers, and if they have no further uses and are not
live at the end of the block, then alter the register descriptor. This alteration indicates that Y and/
or Z will no longer be present in those registers after the execution of X: = Y op Z.
For the three-address statement X: = OP Y, the steps are analogous to the above steps. However,
for the three-address statement of the form X: = Y, some modifications are required as discussed
here.
q If Y is in a register then the register and address descriptors are altered to record that from now
onwards the value of X is found only in the register that holds the value of Y.
Code Optimization and Code Generation 163
q If Y is in the memory, the getreg() function is used to determine a register in which the value
of Y is to be loaded, and that register is now made as the location of X.
Thus, the instruction of the form X: = Y could cause the register to hold the value of two or more
variables simultaneously.
MOV R0, X
MOV X, R0
The second instruction can be deleted since the first instruction ensures that the value of X is
already loaded into register R0. However, it cannot be deleted in a situation when, it has a label
which makes it difficult to identify that whether the first instruction is always executed before
the second. To ensure that this kind of transformation in the target code would be safe, the two
instructions must be in the same basic block.
q Unreachable code elimination: Removing an unlabeled instruction that immediately follows an
unconditional jump is possible. This process eliminates a sequence of instructions when repeated.
Consider the following intermediate code representation:
if error == 1 goto L1
goto L2
164 Principles of Compiler Design
= 1 id To entry for P
2 num 5
+ 3 + 1 2
4 = 1 3
P 5 . . .
5
(a) An Example DAG (b) Array Representation
In this array, the nodes are referred to by giving the integer index (called the value number) of the
record for that node within the array. For instance, in Figure 11.4(b), the node labeled = has value
number 4.
The value number method can also be used to implement certain optimizations based on algebraic
laws (like commutative, associative, and distributed laws). For example, if we want to create a DAG
node with its left child p and right child q, and operator *, we first check whether such a node exists
by using value number method. As multiplication is commutative in nature, therefore, we also need to
check the existence of a node labeled *, with its left child q and right child p.
The associative law can also be applied to improve the already generated code from a DAG. For
example, consider the following statements:
P: = Q + R
S: = R + T + Q
166 Principles of Compiler Design
t2 t1,P
+ t1,P + + T
Q R T Q R
(a) DAG without Associative Law (b) DAG after Applying Associative Law
Figure 11.5 Use of Associative Law
17. What is global data flow analysis?
Ans: Global data flow analysis is a process to analyze how global data is processed and how
analysis of the global data is useful in optimizations. Basically, the data flow analysis process collects
the information about the program as a whole and then it distributes this information to each block of
the flow graph. Data flow information is defined in terms of some data flow equations and then solving
those equations to get the data flow information.
Ud-chaining: A global data flow analysis of the flow graph is performed in order to compute
ud-chaining information. It answers the following question:
If a given identifier is used at point y, then at what point the value of X used at y would be defined?
Here, the use of X means that X occurs as an operand, and definition of X means either an assignment
to X or the reading of a value for X. A point refers to a position before and after any intermediate code
statement. Within a graph, by assuming that all edges in the graph are traversable, we can say that, a
definition of a variable X reaches a point y if there exists a path in flow graph from X’s definition to y
and no other definitions of X appear on the path.
Data flow equations: A data flow equation has the following form:
Out[BB] = in[BB] - Kill[BB] È Gen[BB] (1)
where,
BB = Basic block
Gen[BB] = The set of all definitions generated in basic block BB.
Kill[BB] = The set of all definitions outside basic block BB that define the same variable as are
defined in basic block BB.
in[BB] = È out[P] (2)
where,
P refers to the predecessor of BB.
The algorithm to find out the solutions of data flow equations is shown in Figure 11.6.
Code Optimization and Code Generation 167
18. Consider the following graph, and compute in and out of each block by using global data
flow analysis.
Here, d1, d2, d3, d4, and d5 are the definitions, and BB1, BB2, BB3, BB4, and BB5 are the
basic blocks.
d1 a: = 2
BB1
d2 b: = a + 1
d3 a: = 1 BB2
d4 b: = b + 1 BB3
d5 b: = j + 1 BB4
BB5
168 Principles of Compiler Design
Ans: First, we need to compute in and out of each block, and for this we begin by computing Gen
and Kill in BB1. Both a and b are defined in block BB1 hence, Kill contains all definitions of a and
b outside the block BB1.
Kill[BB1] = {d3, d4, d5}
As, d1 and d2 is the last definitions of their respective variables in BB1, hence,
Gen[BB1] = {d1, d2}
In BB2, d3 kills all definitions of a outside BB2. Hence,
Kill[BB2] = {d1}
Gen[BB2] = {d3}
The complete list of Gen’s and Kill’s including their bit-vector representation is as follows:
Basic Block Gen [BB] Bit Vector kill [BB] Bit Vector
BB1 {d1, d2} 11000 {d3, d4, d5} 00111
BB2 {d3} 00100 {d1} 10000
BB3 {d4} 00010 {d2, d5} 01001
BB4 {d5} 00001 {d2, d4} 01010
BB5 Ø 00000 Ø 00000
Now, after performing steps 1–3 of algorithm given in Figure 11.6, we get the following initial
iteration:
Basic Block in [BB] out [BB]
BB1 00000 11000
BB2 00000 00100
BB3 00000 00010
BB4 00000 00001
BB5 00000 00000
= 11000
Flag = true
in[BB2] = innew
= 11000
out[BB2] = in[BB2] – Kill[BB2] È Gen[BB2]
= 11000 – 10000 È 00100
= 11000 Ù Ø10000 È 00100
= 11000 Ù 01111 È 00100
= 01100
For basic block BB3
innew = out[BB2]
= 01100
Flag = true
in[BB3] = innew
= 01100
out[BB3] = in[BB3] – Kill[BB3] È Gen[BB3]
= 01100 – 01001 È 00010
= 01100 Ù Ø01001 È 00010
= 01100 Ù 10110 È 00010
= 00100 È 00010
= 00110
For basic block BB4
innew = out[BB3]
= 00110
in[BB4] = innew
= 00110
Flag = true
out[BB4] = in[BB4] – Kill[BB4] È Gen[BB4]
= 00110 – 01010 È 00001
= 00110 Ù Ø01010 È 00001
= 00110 Ù 10101 È 00001
= 00100 È 00001
= 00101
For basic block BB5
innew = out[BB4] È out[BB3]
= 00101 È 00110
= 00111
Flag = true
in[BB5] = innew
= 00111
out[BB5] = in[BB5] – Kill[BB5] È Gen[BB5]
= 00111 – 00000 È 00000
= 00111 Ù Ø 00000 È 00000
= 00111 Ù 11111 È 00000
= 00111 È 00000
= 00111
170 Principles of Compiler Design
Flag = true
For basic block BB1
innew = out[BB2]
= 01100
Flag = true
in[BB1] = innew
= 01100
out[BB1] = in[BB1] – Kill[BB1] È Gen[BB1]
= 01100 – 00111 È 11000
= 01100 Ù Ø 00111 È 11000
= 01100 Ù 11000 È 11000
= 01000 È 11000
= 11000
For basic block BB2
innew = out[BB5] È out[BB1]
= 00111 È 11000
= 11111
Flag = true
in[BB2] = innew
= 11111
out[BB2] = in[BB2] – Kill[BB2] È Gen[BB2]
= 11111 – 10000 È 00100
= 11111 Ù Ø 10000 È 00100
= 11111 Ù 01111 È 00100
= 01111 È 00100
= 01111
For basic block BB3
innew = out[BB2]
= 01111
Flag = true
in[BB3] = innew
= 01111
out[BB3] = in[BB3] – Kill[BB3] È Gen[BB3]
= 01111 – 01001 È 00010
= 01111 Ù Ø 01001 È 00010
= 01111 Ù 10110 È 00010
Code Optimization and Code Generation 171
= 00110 È 00010
= 00110
For basic block BB4
innew = out[BB4]
= 00110
in[BB4] = innew
= 00110
out[BB4] = in[BB4] – Kill[BB4] È Gen[BB4]
= 00110 – 01010 È 00001
= 00110 Ù Ø 01010 È 00001
= 00110 Ù 10101 È 00001
= 00100 È 00001
= 00101
For basic block BB5
innew = out[BB3] È out[BB4]
= 00110 È 00101
= 00111
in[BB5] = innew
=
00111
out[BB5] = in[BB5] – Kill[BB5] È Gen[BB5]
= 00111 – 00000 È 00000
= 00111 Ù Ø 00000 È 00000
= 00111 Ù 11111 È 00000
= 00111 È 00000
= 00111
Therefore, after pass 2, we have:
Flag = false
For basic block BB1
innew = out[BB2]
= 01111
Flag = true
in[BB2] = innew
= 01111
out[BB1] = in[BB1] – Kill[BB1] È Gen[BB1]
= 01111 – 00111 È 11000
= 01111 Ù Ø 00111 È 11000
= 01111 Ù 11000 È 11000
172 Principles of Compiler Design
= 01000 È 11000
= 11000
For basic block BB2
innew = out[BB1] È out[BB5]
= 11000 È 00111
= 11000
in[BB2] = innew
= 11111
out[BB2] = in[BB2] – Kill[BB2] È Gen[BB2]
= 11111 – 10000 È 00100
= 11111 Ù Ø 10000 È 00100
= 11111 Ù 01111 È 00100
= 01111 È 00100
= 01111
For basic bloc BB3
innew = out[BB3]
= 01111
in[BB3] = innew
= 01111
out[BB3] = in[BB3] – Kill[BB3] È Gen[BB3]
= 01111 – 01001 È 00010
= 01111 Ù Ø 01001 È 00010
= 01111 Ù 10110 È 00010
= 00110 È 00010
= 00110
For basic block BB4
innew = out[BB3]
= 00110
in[BB4] = innew
= 00110
out[BB4] = in[BB4] – Kill[BB4] È Gen[BB4]
= 00100 – 01010 È 00001
= 00100 Ù Ø 01010 È 00001
= 00100 Ù 10101 È 00001
= 00100 È 00001
= 00101
For basic block BB5
innew = out[BB3] È out[BB4]
= 00110 È 00101
= 00111
in[BB5] = innew
= 00111
out[BB5] = in[BB5] – Kill[BB5] È Gen[BB5]
= 00111 – 00000 È 00000
= 00111 Ù Ø 00000 È 00000
Code Optimization and Code Generation 173
In next pass, the value for in and out will be same, and hence these in and out values are final
and correct.
Multiple-Choice Questions
1. An optimizing compiler —————.
(a) is optimized to occupy less space
(b) is optimized to take less time for execution
(c) optimizes the code
(d) None of these
2. A basic block can be analyzed by a —————.
(a) DAG
(b) Flow graph
(c) Graph which may involve cycles
(d) All of these
3. Reduction in strength means —————.
(a) Replacing runtime computation
(b) Removing loop-invariant computation
(c) Removing common subexpressions
(d) Replacing a costly operation by a cheaper one
4. Which of the following is not true for a DAG?
(a) DAG cannot implement transformations on basic blocks.
(b) The nodes of DAG correspond to the operations in the basic block
(c) Each node of a DAG is associated with a label.
(d) None of these
5. Which of the following comments about peephole optimization?
(a) It is applied to a small part of the code.
(b) It can be used to optimize intermediate code.
(c) It can be applied to a portion of the code that is not contiguous.
(d) All of these
174 Principles of Compiler Design
6. A variable is said to be ————— if its value gets incremented or decremented every time by
some constant.
(a) Induction variable
(b) Dead
(c) Live
(d) None of the above
7. ————— is the process of selecting a set of variables that will reside in CPU registers.
(a) Register assignment
(b) Register allocation
(c) Instruction selection
(d) None of these
8. Which of the following outputs can be converted into executable version by an assembler?
(a) Absolute machine language
(b) Relocatable machine language
(c) Assembly language
(d) None of the above
9. In ————— the bodies of the two loops are merged to form a single loop.
(a) Loop unrolling
(b) Strength reduction
(c) Loop concatenation
(d) Loop fusion
10. ————— are used to trace the location of the current value of the name at runtime.
(a) Register descriptors
(b) Address descriptors
(c) Both (a) and (b)
(d) None of these
Answers
1. (c) 2. (a) 3. (d) 4. (a) 5. (d) 6. (a) 7. (b) 8. (c) 9. (d) 10. (b)
Index
LR parsing, 66 patterns, 14
ambiguity in, 75–76 peephole optimization, 163–164
configurations in, 67 phase, 4
error recovery in, 77 phrase level error recovery, 77
LR(0) automaton, 68 phrase level recovery, 54
LR(0) item, 68 postfix notation, 106
LR(0) parser, 69 process of evaluation of, 106
construction of, 69 postfix translation, 112
LR(1) parsing, 65 predictive parsing, 49
error recovery strategies in, 54–55
M prefix, 16
preprocessors, 2
machine-dependent optimizations, 165
role of, 2f
machine-independent optimizations, 165
procedure call/return statements, 108
macro definition, 2
translation of, 114
macro name, 2
macros, 2
memory organization, 131 Q
multi-pass complier, 7–8
quadruple, 108–109
N
R
name equivalence, 128
NFA. See non-deterministic finite automata (NFA) recursive predictive parsing, 49
non-backtracking parsing, 49 recursive-decent parser, 52
non-deterministic finite automata (NFA), 20 redundant-instruction elimination, 163
non-recursive predictive parsing, 53 register allocation, 135–136
nullable(n), 25 register assignment, 152
numerical representation, 110–111 register descriptors, 162
regular definition, 16
O regular expression, 17
construction, of, 16
object (or target) program, 1
properties of, 17
execution of, 2f
renaming temporary variables, 161
operator grammar, 58–59
return sequence, 114, 132
operator precedence parsing, 59
runtime administration, 131–138
operator precedence, 38
runtime environment, 131
optimization, 131
elements of, 131–132
optimizing transformations, 155
runtime memory, 132–133
P
S
panic mode error recovery, 77
panic mode recovery, 149 S-attributed definitions, 100
parameter passing, 140 scanner generators, 8
parse tree, 50–51 scanning. See lexical analysis phase
derivation of, 50–51 SDD. See Syntax-directed definition (SDD)
parse tree. See syntax tree, 5 SDT. See syntax-directed translations (SDT)
parser generators, 8 self-organizing list, 144
parsing, 5 semantic actions, 94
pass, 7 semantic analysis, 5
178 Index
T Diagram Representation, 8f
table-driven predictive parsing, 49
Y
advantages of, 49–50 YACC. See yet another compiler-compiler (YACC)
disadvantages of, 50 yet another compiler-compiler (YACC), 74–75