0% found this document useful (0 votes)
319 views

BCS - Compiler Construction - Notes

The document discusses the key phases of a compiler: 1) Lexical analysis breaks the source code into tokens like identifiers, keywords, and punctuation. 2) Syntax analysis verifies that the tokens form valid statements based on the language's grammar rules, constructing a parse tree and then condensed syntax tree. 3) Semantic analysis checks that statements are meaningful by type checking. The compiler then generates intermediate code like Three Address Code (TAC) before final code optimization and generation of assembly language code.

Uploaded by

Saadat Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
319 views

BCS - Compiler Construction - Notes

The document discusses the key phases of a compiler: 1) Lexical analysis breaks the source code into tokens like identifiers, keywords, and punctuation. 2) Syntax analysis verifies that the tokens form valid statements based on the language's grammar rules, constructing a parse tree and then condensed syntax tree. 3) Semantic analysis checks that statements are meaningful by type checking. The compiler then generates intermediate code like Three Address Code (TAC) before final code optimization and generation of assembly language code.

Uploaded by

Saadat Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 60

1

Chapter 1
Introduction to compiler Construction

H:\COMPILER\COMPILER.DOC
2

What is Compiler?
DEF: “ A compiler is a program which takes a HLL ( High Level
Language ) program as an input and translates it into an
equalent program in Assembly Language. ”
Objective: To translate a HLL program into Assembly Language.
HLL Program Assembly Language Program

Source Program
Compiler Target Program
Or Source Code

During this process of translation the compiler also reports errors in


the source program.

Q1 – Why do we write programs in HLL instead of machine language


or Assembly language?
A compiler can understand machine language only, in the language
program statements which consists of bit patterns ( strings of 0’s or 1’s), but
due to obvious reasons, in practice, programs are written in HLL, but
computer doesn’t understand a HLL program. To make the computer
understand to HLL program it must be translated into machine language.
This job is partially done by a compiler.

Q2 – Why does compiler stop at Assembly language?


Or why does a compiler not produce a machine language?
There are two reasons, i.e.,
1- Translation from any language to machine language is very difficult. To
avoid this effect or to save this effort the compiler stops at Assembly
language.
2- Well establish and reliable assemblers are available that have been in
use since early 1950’s.

H:\COMPILER\COMPILER.DOC
3

A Typical Language Processing System


Source Program
Jobs of Pre-Processor
Pre-Processor
1 – It stripes off the comments.
2 – It processes the directive like Neat, clean and detailed
#include directive. source program
3 – It expands macros. Compiler

Assembly Language
Assembler
Object Code

Linker
Executable Program

Large Program: are developed by breaking it into components.


Development of large program is based on Divide & Conquer strategy.
In the case of compiler its components are called phases.
A compiler is a large program, to conquer its complexity a compiler is
divided into components called phases.
1 – Lexical Analysis: Phases of a Compiler
This phase read the source Source Program ( HLL )
program character by character and
Lexical Analysis
combine the characters to form tokens.
Token: “A token is a meaningful word of a
language”. Syntax Analyzer
A token is of four types:
1 – Punctuation Symbols (like {,},[,”). Semantic Analysis
2 – Reserved Words (like void, for).
3 – Identifiers (like num1, temp).
4 – Numbers (like 1, 2423, -234). Intermediate Code Generation

Lexeme: “The string of characters Code Optimization


constituting a token”.
A lexeme is used to differentiate
Code Generation
between the tokens of
Assembly Language
the same type.

H:\COMPILER\COMPILER.DOC
4

There is a 1 –1 (One to One) relationship Tokens Lexeme


between tokens and lexeme in the case of{ “{“
punctuation symbols and reserved words and ; “;”
there is a 1 – M (One to Many) relationship for “for”
between tokens and lexemes in the case of identifier num1
identifiers and numbers. temp
number 0
Example 1
Consider the following statement: 2
A relation between centigrade and .
Fahrenheit temperature is: .
A lexical
F.H = 1.8 *analyzer breaks
centigrade + 32 this statement into the following tokens:
.
Identifier : F.H
Assignment Statement: =
Numbers: 1.8
Multiplication Symbol: *
Identifier : centigrade
Plus Symbol: *
Numbers: 32

Usually blanks and punctuation symbols acts as representation


between the tokens.
After recognizes the tokens Lexical Analyzer passes it on to the next
phase of the compiler i.e. Syntax Analysis phase.

H:\COMPILER\COMPILER.DOC
5

2 – Syntax Analyzer:
DEF “A program which performed syntax analysis”.
In this phase tokens are collected from the lexical analyzer one by one
until a statement is complete. Then it is verified that the statement is
complete , i.e. statement has been build according to the
rules / syntax / grammar of the language. During this process usually a tree
called parse tree is constructed.
To illustrate the construction of the parse tree consider the following
simple rules in the language.
Rule 1 – An identifier is an expression.
Rule 2 - A number is an expression.
Rule 3 – Sum of two expression is an expression.
Rule 4 – Product of two expression is an expression.
Rule 5 – expression is equal to expression is an assignment statement.
expr  id | num | expr + expr | expr * expr
asgn  expr = expr
The parse tree is saved in main memory for subsequent processing.
To save space this tree is compress, i.e. redundant information is removed.
The compressed form of a parse tree is called syntax tree. In a syntax tree
an operator appears as an interior node and operand appears as a children.
Syntax Tree: A compressed parse tree is called Syntax Tree. In a
syntax tree an operator appears as an interior node and operand appears as
children. (as shown in Fig II).
Operator
The parse tree of the statement F.H = 1.8 * centigrade +32
2*3 is illustrated in figure I and the syntax tree is shown in
figure II.
Operands
asgn =

expr = expr F.H +

id expr + expr * 32

F.H expr * expr id 1.8 centigrade


Fig I
Parse Tree Fig II Syntax Tree
id id 32

1.8 centigrade

H:\COMPILER\COMPILER.DOC
6

As shown in the previous figures, the parse tree consumes more


space then syntax tree. Therefore, in practice we usually consider syntax
tree instead of parse tree.
Yield: While finding the yields of the syntax tree interior nodes will also
consider. The yield of the parsing and syntax tree is shown below:
yield: F.H = 1.8 * centigrade + 32

3 – Semantic Analysis:
In this phase it is verified that a grammatically correct statement is
meaningful.
In programming, meaningful would be confined to the type checking
only.

4 – Intermediate Code Generation:


Q1 – Is it possible to translate HLL to Assemble language?
Yes, it is possible.
Q2 – Why is the indirect approach is used?
There are some advantages of using this approach.

There are many intermediate languages of intermediate notations,


such as, syntax tree, postfix notation, Directed Acyclic Graph (DAG), Three
Address Code (TAC).
The most commonly intermediate language is Three Address Code
(TAC). In TAC a statement can have the following forms:
form 1: x = y b-op z
form 2: x = u-op y

where b-op is a binary operator i.e. an operator which requires two


operands, u-op is a unary operator i.e. an operator which requires one
operand.
In TAC there can be only one operator (other then assignment)
and at the most three addresses (variables). One or two variables act as
operand and one is used to store the result. As there can be at most three
addresses this is why this code is called Three Address Code (TAC). While
generating TAC, compiler creates and uses some temporary variables to
store the result of intermediate evaluation. The following is the TAC of the
statement:
Statement: F.H = 1.8 * centigrade + 32
TAC: temp1 = 1.8 * centigrade
temp2 = temp1 + 32
F.H = temp2
Exercise Generate TAC for the following statements: a = b *-c +
d*c?

H:\COMPILER\COMPILER.DOC
7

5 – Code Optimization:
In this phase the number of statements in TAC are reduced. For e.g.
Statement: a = b * c + c
TAC: temp1 = b * c
Optimized Code
temp2 = temp1 + c
a = temp1 + c
a = temp2

6 – Code Generation:
In this phase, the optimized TAC translated into Assembly language.
Exercise Transform the following input statements by different phases
of the compiler:
1–a=b*c+c
2 – F.H = 1.8 * centigrade + 32
3 – v = 49 – 9.8 * t
4 – s = 49 * t – 4.9 * t * t
5 – s = vi * t +  * a * t * t
6 – rad = deg / 180 * ¶

Front End & Back End of the Compiler:


The phases of the compiler are grouped together to form front and
back end of the compiler.
The first five phases of the compiler are included in the Front End of
the compiler. The last two phases are included in the Back End of the
compiler.

Q What are advantages of partitioning a compiler into front and back


ends?
1 – Portability, If the machine changes we need to change only the
backend of the compiler and if language changes we need to change
only the front end of the compiler.
2 – Machine independent code optimization can be done.

The six phases of the compiler lexical analysis to code generation are
called Formal Phases. In addition to these there are two Informal Phases
which interacts with all the formal phases of the compiler, i.e.

H:\COMPILER\COMPILER.DOC
8

Informal Phases:
I Symbol Table Manager
II Error Handler

I – Symbol Table Manager:


It is a data structure where a token and their attributes are stored.
Symbol table interacts with all the phases. Information into the symbol table
is stored during different phases.
II – Error Handler:
Error can occur in any phase, so error handler handles the errors, so,
it also interacts with all the phases of the compiler.

H:\COMPILER\COMPILER.DOC
9

Chapter 2
Lexical Analysis

H:\COMPILER\COMPILER.DOC
10

In this phase the source program is tokenized.


Basic Definitions
Alphabet: Alphabet or character class of a language consists of a finite set of
symbols. OR
The set of basic building blocks use to build structure of a language is
called an Alphabet. It is symbolized by .
For example, {0,1} is a binary alphabets. {A,B,C, …,Z} are alphabets of
English language.
Word: The strings of characters in a language are called Word of that
language.
Length: The total number of characters in a string is termed as Length of that
string. It is denoted by |s|.
Empty String: A string of length zero is called Null String or Empty String and
is symbolized is  or є sign.
String: of a language is a finite sequence of alphabets (characters) of that
language. A string is denoted by s, e.g.
s = COLLEGE
|s| = 7
If s is any string then єs = sє = s
Concatenation: Placing two strings adjacent to each other is called
Concatenation of the strings.
Let, s = FARM
t = HOUSE
then, concatenation of s and t is denoted by st and is give by:
st = FARMHOUSE
Language: A specified set of strings of characters of alphabet is called a
Language. It is symbolized by L. OR
A language is a set of string. According to this definition { } is a
language and { є } is also a language.
Note: { }  { є }
Languages can be combine to produce more languages for this
purpose operators are used. There are many operators available in the
language, but we shall consider four only.
Suppose L and M are two languages then,
Sr.# Operations………………… Meanings
1…. Union LUM = { s/s is in L  is in M }
2 Concatenation LM = { st/s is in L  t is in M },
where s and t are strings.
3 Kleene Closure L* = Ui=0 i=~ Li. zero or more repetitions
of a string from L.
4 Positive Closure L+ = Ui=1 i=~ one or more
repetitions of a string from L.

H:\COMPILER\COMPILER.DOC
11

Example Consider the following sets:


L = { A, B, C, … ,Z }
D = { 0, 1, 2, …. ,9 }
These sets could be interpreted in two ways
I – L is a set of letters, D is a set of digits
II – L is a set of characters / letters each letter can be regarded as a string of
length 1,
 L is a set of strings and a set of string represents a language.
 L represents a language.
And same is a case with D. (with digits also a characters).

The languages can be combined to produce the following languages:


1 – LUD = { A, B, C, …Z0,1,2, … 9 }, this represents a set of letters and
digits.
2 – LD = { A1, A2,...,A9, B1, B2,...,B9, C1, C2,...,C9,.…….,Z1, Z2,...,Z9 },
this represents a set of strings each consisting of two characters in
which first character is a letter and second is a digit.
3 – L4 represents a set of strings each of length 4, e.g. THIS, THAT,…
4 – L* represents all the strings obtained by different combination of the
strings of language L.
5 – D+ is a set consisting of all the natural numbers.

Q – How to specify tokens?


Method of Specifying Tokens:
Tokens are of four types, i.e.
I – Punctuation Symbols: The list of the punctuation symbols provided to
the lexical analyzer.
II –Reserved Words : The list of the reserved words provided to the
lexical analyzer.
III–Identifiers: A string of characters starting with a letter. Rule for
making identifiers is provided (e.g. in most languages an identifiers is
a string of characters starting with a letters other than reserved words
and punctuation symbols.)
IV–Numbers: Rule for making numbers is provided.

Regular Expressions are an important notation for specifying patterns. For


e.g. using this notation an identifier (in many languages) may be defined as,
letter(letter/digit)*. ‘/’ is used as Logical OR
A regular expression represents a set of strings. a regular
expression represents a language. A language represented by a regular
expression is called Regular Language and it is denoted by L(r).

H:\COMPILER\COMPILER.DOC
12

Rules For Defining Regular Expression: (over an alphabet)


1 – є is a regular expression denoting { є }.
2 – If ‘a’ is a symbol in alphabet  then ‘a’ is a regular expression
denoting { a }, ‘a’ can be a token, string or a regular expression, ‘for’ is
a regular expression denoting { for }.
3 – Suppose r and s are two regular expression denoting languages L(r) and
L(s). Then,
I (r) | (s) is a regular expression denoting L(r) U L(s).
II (r)(s) is a regular expression denoting L(r)L(s).
(Concatenation)
III r* or (r)* is a regular expression denoting L(r)*.
As we have seen, regular expression can be combined to produce
more regular expressions. For this purpose the operator commonly used are
*, concatenation, and |. The precedence of three operators is the following:
1– The unary operator * is left associative and has highest precedence.
2– The concatenation is also left associative and has 2 nd highest
precedence.
3– | (OR) is also left associative and has lowest precedence.
Example Let  = { a, b }
I- The regular expression a | b denotes the set { a, b }
II- (a | b)(a | b) = { a, b }{ a, b } = { a, b, aa, ab, ba, bb }
III- The regular expression a* denotes the set { є, a, aa, aaa, … }
IV- The regular expression (a | b)* = { a | b }*
 { є, { a | b }, { a | b }{ a | b }, …} = { є, a, aa, ab, ba, aba, …}
V- a|a*b denotes { a, ab, b, aaaab,… }

If two regular expressions r and s denotes the same language then r = s, for
e.g. { a | b } = { b | a }.

H:\COMPILER\COMPILER.DOC
13

How To Form Regular Expression’s:


Regular expressions can be formed by combining characters with or
without operators. The following operators are used for this purpose,
Operator Meanings
+ The preceding expression is repeated one or more times. For
e.g. a+ matches a, aa, aaa, …
* The preceding expression is repeated zero or more times. For
e.g. a* matches є, a, aa, aaa, …
? The preceding expression is repeated zero or one time. For
e.g. +7, in this case + sign is optional, we use the following
term: +?7.
| Logical OR, example is ab | cd matches either ab or cd at a
time, but not the none or both at the same time.
{} Repetition, e.g. a{1,3} matches a, aa and aaa only.
[] Matches any one character from the string of characters
specify within [ ], e.g. [abcd] matched either a or b or c or d, a
single character at a time.
[0123456789] matches any digit, it defines a digit.
- Represents range, i.e. [A-Z] can be any upper case letter.
Similarly [0-9] define a digit, [A-Za-z] defines a letter that can
be an uppercase or lowercase.
^x Matches any character other than x.
\ Stops the operator behaviour of a character. if we want to
represents \, we can use \\ (first \ is used to stop the operator
behaviour of the second \).

How To Develop A Lexical Analyzer:


There are two methods:
1- Write your own program in C or C++.
2- Generate lexical analyzer using a tool. (lex in UNIX and FLEX in DOS
environment.)

How To Generate Lexical Analyzer Using A Tool:


There are three steps, i.e.
1 – Write specification for the lexical analyzer using lex language and
save it in a file having extension .l, (say first .l).
2 – Process the lexical specification using the lex tool, for this purpose
use the following command in UNIX environment:
$ lex first.l
where,
$ is prompt in UNIX environment.
lex is an executable program converting lexical file into C code,

H:\COMPILER\COMPILER.DOC
14

having a function named yylex() and by default it saves it in a


file lex.yy.c
3 – Compile this C program to generate an executable file. For this
purpose use the following command:
$ cc lex.yy.c -ll
where,
cc stands for c compiler.
-ll is a switch, where stands for libraries lex.
This produces an executable program and by default places it in the
file a.out. (you can rename it for your simplicity i.e. $mv a.out lex1).
To invoke lexical analyzer type $ lex1
To come out from this executable program, press Ctrl + d.

How To Write Lex Specification:


A lex specification has three components, i.e.
I– Definition & Declarations. -- Optional
%%
II – Rules. -- Compulsory
%%
III – User defined routines. -- Optional

I– Definition & Declarations: This part contains:


– include directive.
– definition of constants and declaration variables.
II – Rules: consists of rules, it has two components, i.e.
– The first component is regular expressions or patterns.
– Action, it consists of one or more statements in C language,
enclosed within { and }.
This part looks like follows:
pattern 1 { action 1 }
pattern 2 { action 2 }
: :
pattern n { action n }
III – User defined routines: It consists of definition of function used in the
rules part.
Example(a) Write lex specification (using rule part only) to
recognize:
1 – Positive digit.
2 – Positive Number.
3 – Negative digit.
4 – Negative Number.
5 – Reserved Word.
6 – Identifier.

H:\COMPILER\COMPILER.DOC
15

//lex specification
%%
//rules part
\+?[1-9] { printf(“Positive digit”); }
\+?[1-9] [1-9]* { printf(“Positive Number”); }
-[1-9] { printf(“Negative digit”); }
-[1-9] [1-9]* { printf(“Negative Number”); }
[Ff][Oo][Rr] { printf(“Reserved Word for”); }
[A-Za-z] [A-Za-z0-9]* { printf(“Identifier”); }
%%
Q – Why is identifier define at the end of a lex specification?
Identifiers must be defined at the end, otherwise reserved words would
also be recognized as an identifier.
Example(b) Write lex specification (using definition part also) to
recognize:
1 – Positive digit.
2 – Positive Number.
3 – Negative digit.
4 – Negative Number.
5 – Reserved Word.
6 – Identifier.

//lex specification
//definition part
L [A-Za-z]
NZD [1-9]
D [0-9]
%%
//rules part
\+?{NZD} { printf(“Positive digit”); }
\+?{NZD}{D}* { printf(“Positive Number”); }
-{NZD} { printf(“Negative digit”); }
-{NZD}{D}* { printf(“Negative Number”); }
[Ff][Oo][Rr] { printf(“Reserved Word for”); }
{L}({L}|{D})* { printf(“Identifier”); }

Representations of Tokens:
A token is a meaningful word of the language.
In practice a token is represented by a numeric constant associated
with it. The value of this numeric constant should not be less than 256,
, 0 - 255 are used for ASCII code of the characters and 256 is used for
error condition.

H:\COMPILER\COMPILER.DOC
16

Example Consider the following specification or prototyping of a


function written in Ada:
function Add ( x : Complex; y: Complex ) return Complex;
Identifying tokens in this function and write lex specification to recognize
them?

Tokens Token Type


function, return Reserved word
(, :, ;, ) Punctuation Symbols
Add, x, Complex, y Identifiers
%{
/* Lex Specification */
#define FN_T 257
#define RET_T 258
#define OPENB_T 259
#define CLOSEB_T 260
#define IS_T 261
#define SEMICOL_T262
#define ID_T 263
%}
/* Rules Part */
[Ff][Uu][Nn][Cc][Tt][I i][Oo][Nn] { printf(“FN_T”); return FN_T; }
[Rr][Ee][Tt][Uu][Rr][Nn] { printf(“RET_T”); return RET_T; }
‘(‘ { printf(“OPENB_T”); return OPENB_T; }
‘:’ { printf(“IS_T”); return IS_T; }
‘;’ { printf(“SEMICOL_T”); return SEMICOL_T; }
’)’ { printf(“CLOSEB_T”); return CLOSEB_T;}
[A-Za-z] [A-Za-z0-9]* { printf(“ID_T”); return ID_T; }
% } // printf is used just for the testing purpose

H:\COMPILER\COMPILER.DOC
17

Chapter 3
Syntax Analysis

H:\COMPILER\COMPILER.DOC
18

DEF “Tokens are collected one by one to form statement and checks for its
grammar.”
Q – How to specify the syntax of the language?
For this purpose a specialized notation called Context Free Grammar
(CFG) is used. CFG can be used to specify the hierarchical structure of the
programming construct, e.g. using this notation ‘if‘ statement in C can be
defined as:
If_stmt  if(expr1) stmt1 else stmt2
where,
 means ‘is defined by’ or ‘can have the form’.
The above form of representation is called Production. In this definition
if, (, ), else are indivisible lexical units and are called Terminals.
If_stmt, expr1, stmt1, stmt2 are composite entities and are called Non-
Terminals. Non-terminal consists of terminals and non-terminals in terms of
it’s component. And these components can be terminals as well as non-
terminals. The symbol on LHS of a production must be a non-terminal.
Backus Norman Form (BNF) is used to specify the syntax/grammar of
a language. This grammar is used in the front end of the compiler.
Definition Of Context Free Grammar (CFG):
CFG is a notation used to specify the syntax of a language. CFG has
four components, i.e.
1 – A set of terminals.
2 – A set of non-terminals.
3 – A set of productions.
4 – Starting symbol. ( one of the non-terminal is designated as starting
symbol)
The syntax of a language is specified by defining all the non-terminals
of the language or by writing production for all the non-terminals.
The production for the starting symbol is listed first.
Example Write CFG for expressions consisting of digits, separated
by + and/or - ?
expr  expr + digit
expr  expr - digit
expr  digit
digit  0
digit  1
digit  2
: :
digit  9

Equivalent compact notation which is widely used is:


expr  expr + digit

H:\COMPILER\COMPILER.DOC
19

| expr - digit
| digit
digit  0 |1 |2 |3 … |9
Using this grammar it can be proved that 8 – 6 + 2 is an expression.
Proof(a)
To prove: expr = 8 – 6 + 2 expr
8 – 6 + 2
8 – 6 + digit expr + digit
8 – digit + digit
 digit – digit + digit expr - digit 2
 expr – digit + digit
 expr + digit digit 6
 expr Fig I
8 Bottom Up Parsing
yield: 8 – 6 + 2 = input = 8 – 6 + 2
The yield matched the input, input is correct.
In this case constructing of the parse tree started from the leaves and
it went up to the root, this type of construction is called Bottom Up
construction of the tree and this type of parsing is called Bottom Up Parsing.
Proof(b)
To prove: expr = 8 – 6 + 2 expr
expr
 expr + digit expr + digit
 expr – digit + digit
 digit – digit + digit expr - digit 2
8 – digit + digit
8 – 6 + digit digit 6
 8 – 6 + 2 Fig II
8 Top Down Parsing

In this case input string has been derived from the starting symbol, input
string is correct. While deriving this string the non-terminals are processed
from left to right or non-terminals on the left side are processed first. The
steps involved in this derivation can also be represented in the form of a tree
shown in figure II.
Note: Every signals on the left side of a CFG is a non-terminal.

H:\COMPILER\COMPILER.DOC
20

Ambiguous & Unambiguous Grammar:


Unambiguous Grammar:
A grammar is un-ambiguous if one and only one parse tree can be
constructed for each expression/statement using this grammar. E.g. The
following grammar is un-ambiguous as shown in the fig I and fig II (above)
i.e.
expr  expr + digit
| expr - digit
| digit
digit  0 |1 |2 |3 … |9
Ambiguous Grammar:
DEF “A grammar is ambiguous if more then one parse trees can be
constructed for each expression/statement using this grammar.” OR
“A grammar is ambiguous If a non-terminal exists more than one on
RHS of the production.”

The following grammar is ambiguous:


expr  expr + expr
| expr + expr
| digit
digit  0 |1 |2 |3 … |9
expr expr

expr + expr expr - expr

expr - expr digit digit expr + expr

digit digit 2 8 digit digit

8 6 6 2
Fig I Fig II
yield: (8 – 6) + 2 yield: 8 – (6 + 2)

which shows from yields that fig I is correct.

Associativity of Operator:
Suppose an operand has the same operators on its left and right side
then:
I- If the operator on the left applies first, the operator is left
associative.
II - If the operator on the right applies first, the operator is right
associative.

For e.g.

H:\COMPILER\COMPILER.DOC
21

1- All the four basic arithmetic operators +,-,*,/ are left associative
8 - 6 - 2 means (8 – 6) – 2.

2- Right associative means e.g.


a = b = c means a = (b = c)

Example Write down the grammars of left and right associative?


Grammar of left associative is: right
expr  expr – digit
| digit letter = right
digit  0 |1 |2 |3 … |9
Grammar of right associative is: a letter = right
right  letter = right
| letter b letter
letter  a |b |c |d … |z
c
using this grammar the parse tree of the
Right Associative Parse Tree
statement a = b = c is shown in the following
with yield: a = b = c
figure.

Note: Tree of the grammar for left associative operator grows towards left
and tree of the grammar for right associative operator grows towards right.

Syntax Analysis (Parsing):


Repeat: “Tokens are collected one by one to form statement and checks
for its grammar.”
Two types of parsing, i.e.
1 – Top Down Parsing,
2 – Bottom-Up Parsing.

Comparison between Top Down and Bottom-Up Parsing


Sr.# Top Down Parsing Bottom-Up Parsing
1 Construction of trees starts from Construction of trees starts from
the root and it proceeds up to the leaves and it proceeds up to
the leaves. the root.
2 Suitable for hand written parser. Parsers generated by tools.
Left recursion is dangerous.
3 Left recursion is suitable.

Top Down Parsing:


To illustrate Top Down parsing consider the following grammar:
type  simple
| ^ id

H:\COMPILER\COMPILER.DOC
22

| array[simple] of type
simple  integer
| char
| num..num assume that num is a positive integer
This grammar defines subset of types in Pascal language.
C Pascal
int x x: integer
int *ptr = &x ptr: integer
ptr: ^x
int A[20] A: array[0..19] of integer

Non-terminals: type, simple


Terminals: ^, id, array, [, ], of, integer, char, num, .. (dotdot).
Starting Symbol: type
The parsing process starts from the root labeled with starting symbol. The
following two steps are repeatedly applied:
1- At a node A labeled with a non-terminal A, select one of the production
of A from the grammar and constructs the children for all the symbols
RHS of this production. Which production will be selected is guided by
look ahead symbol, that RHS will be selected which matches current
input token.
2- Process a children from left to right. If a child is a terminal and it
doesn’t match the current input token then report system error. If a
child is a terminal and it matched the current input token, advance the
input and advance in the tree. If a child is non-terminal then construct
and process its tree.
Example Consider the following input: Type
type: array[num..num] of integer
Parsing the input successfully. Accept the array [ simple ] of type
input. It is correct.
num dotdot num
Defect in this method of parsing is that if
look ahead symbol doesn’t guide us about the production to be selected then
parser is lost, and it tries all the possibilities which is time consuming. The
method of trying all the possibilities is called Hit & Trial Method.
Problem: If a non-terminal has many RHS then which out of these would
be used.

In general, current input token guides us about the production to be


Applied. But the parser gets confused if the current input token can’t guide
us.

H:\COMPILER\COMPILER.DOC
23

Predictive Parsing \ Recursive Decent Parsing:


It is a top-down parsing in which a recursive procedure is executed
whenever we come across a non-terminal. Predictive parsing is a special
case of recursive decent parsing, in which current input token
unambiguously decides about the production to be applied and this
scheme is not applicable if look ahead symbol can’t guide us. The parsing
process start with a call to the procedure with the starting symbol.
The input has been processed without encountering any error,
therefore input is correct.

Recursive Grammar:
DEF “A production is recursive if it defines a non-terminals in terms of
itself.” OR
“A grammar is recursive if at least one production of its is non-
terminal.”
For e.g.
expr  expr + digit --- left recursive
expr  digit + expr --- right recursive
expr  digit + expr - digit --- ____________ ?
Note: A recursive production which is not left recursive is right recursive.
Left Recursion:
DEF “If a production occur on the extreme left of the RHS of the
production.”
Left recursion is dangerous for top-down parsing. It can be proved as follows:
Proof Prove that left recursion is dangerous for top-down parsing?
(a) For this purpose consider a left recursive grammar:
expr  expr + digit
| digit
In this case the parser can select any one of the two production, which
production will be selected, its depend upon the look ahead symbol.
expr  digit
expr  expr + digit
 expr + digit + digit
 expr + digit + digit + digit
 expr + digit + digit + digit + digit ……………
:
:
In this way parser can enter in an infinite loop. This is why the left
recursion is dangerous for top-down parsing because can take place from left
to right and the parser can enter in to an infinite loop.

H:\COMPILER\COMPILER.DOC
24

(b) General Form:


Any left recursive production can be expressed in the following
standard form: A  A |  --------I
Consider the production: expr  expr + digit | digit --------1
Eq.1 can be expressed in the form of eq. I, i.e.
A = expr,  = +digit,  = digit
Now, consider the form A  A |  to prove that left recursion is
dangerous for top-down parsing, i.e.
A  A | 
 A ----- A (parsing terminates)
 AA
 A
 A (parsing terminates)
 AA
 AA
 A ----- A (parsing terminates) ………
 AA
AA
:
Output: The string that can be derived from this grammar is a ’s followed by
zero or more ’s, i.e. , , , , …

Now consider the following production:

A  A’ II
A’  A’ | є
Let us find the output of (strings that can be derived from) the above
grammar,
A  A’
 A’ є
 A  є ----- A (parsing terminates)
 A’ A’
 A  A’
 A’ є ----- Aє (parsing terminates)
 A’  A’
 A  A’
 A’ є ----- A  є (terminates)

 A’ A’
 A  A’ ………
:

H:\COMPILER\COMPILER.DOC
25

Output: The string that can be derived from this grammar is a ’s followed by
zero or more ’s, i.e.
є, є, є, є, … = , , , , …

Hence, output of grammar I and II are equalent, but grammar I is left


recursive while grammar II is right recursive.
A A

A A’

A A’

 ……      ……є
Grammar I Grammar II

We know that left recursion is dangerous for top-down parsing. So, for top-
down parsing a left recursive grammar must be converted into equalent right
recursive grammar. Converting a left recursive grammar into an equalent
right recursive grammar is called elimination of a left recursive from a
grammar.
Exercise Eliminate left recursion from the following grammars:
1– EE+T|T
TT* F|F
F  (E) | id
2– expr  expr + digit
| expr - digit
| digit
digit  0 |1 |2 |3 … |9

Predictive Parsing \ Non-Recursive Decent Parsing:


It uses stack instead of recursive procedures calls. If a non-
Terminal has many RHS which RHS would be selected is the main problem.
For this purpose a table called parsing table is used in this method. This is
why this type of parser is also called Table Driven Parser. A table driven
parser has four components:
1- An input buffer: It contains input string appended by a $ sign. $

is the end of the input marker.


2- Stack: The stack contains grammar symbol on top of $. Initially stack
contains starting symbol on top of $.

H:\COMPILER\COMPILER.DOC
26

3- Parsing Table (M): It is a two dimensional array, in which row indices


are non-terminals and column indices are terminals. Contents are
either production of blanks. A blank represents an error.
4- Output: It is the selected production.

d + b $
Stack
The parsing process depends
a
upon two symbols:
1 – Current input token ‘a’. Parsing Program
X
2 – Symbol ‘X’ on top of u
the stack. v
w
c M Parsing Table
$
Top Down Parsing
There are three possibilities:
1 – X = a = $, parsing is complete accept the input.
2 – X = a  $, pop X off the stack throw it away and advance the input.
3 – If X is a non-terminal then consult the entry M[X,a] of the parsing table.
If this slough is blank, it is an error, report syntax error. If it is an X
production of the form Xuvw then pick X from the stack throw it
away and push the symbol of RHS on to the stack in reverse order, so
that u comes on the top of the stack.
4 – Xuvw is the output.
Example Consider the following grammar:
E  TE’
E’ +TE’ | є
T  FT’
T’ * FT’ | є
F (E) | id
Non-terminals: E, E’, T, T’, F
Terminals: +, *, (, ), id, $
Starting Symbol: E

Illustration of the steps performed (moves made) by non-recursive


predictive parser, processing the following input:
id + id * id (such as a + b * c)

The parsing table for this grammar is:

H:\COMPILER\COMPILER.DOC
27

NT\Term + * ( ) id $
E ETE’ ETE’
E’ E’+TE’ E’є E’є
T TFT’ TFT’
T’ T’є T’*FT’ T’є T’є
F F(E) Fid

Working Of Non-Recursive Predictive Parsing:

Stack Input Rule# Output


$ E id + id * id $ 3 M[E, id] = E  TE’
$ E’ T id + id * id $ 3 M[T, id] = T  FT’
$ E’ T’ F id + id * id $ 3 M[F, id] = F  id
$ E’ T’ id + id * id $ 2 POP X & Advance the Input
$ E’ T’ + id * id $ 3 M[T’, +] = T’  є
$ E’ Є
$ E’ + id * id $ 3 M[E’, +] = E’ +TE’
$ E’ T + + id * id $ 2 POP X & Advance the Input
$ E’ T id * id $ 3 M[T, id] = T  FT’
$ E’ T’ F id * id $ 3 M[F, id] = F  id
$ E’ T’ id id * id $ 2 POP X & Advance the Input
$ E’ T’ * id $ 3 M[T’, id] = T’FT’
$ E’ T’ F * * id $ 2 POP X & Advance the Input
$ E’ T’ F id $ 3 M[F, id] = F  id
$ E’ T’ id id $ 2 POP X & Advance the Input
$ E’ T’ id $ 3 M[T’, id] = T’ є
$ E’ id $ 3 M[E’, id] = E’є
$ $ 4 Input string is correct

First and Follow:


First Set:
DEF “If a non-terminal has many RHS then the set constituted by
considering the beginning terminals of all the RHS is called the First Set of
that non-terminal.”
Consider the following grammar:
x  ABm
| JKT
| pq
m  sj
| yk …………

H:\COMPILER\COMPILER.DOC
28

( In this grammar it has been assumed that lower case letters represents
non-terminals and upper case letters represents terminals. )
FIRST(x) = { A, J, P }
Q – What should be done if a RHS begins with a non-terminal ?
In this case that non-terminal ( i.e. the beginning non-terminal) replaced by
its first set is included in the first set of the non-terminals on LHS. Now
consider the following grammar:
x  ABm
| jkT
| Pq
m  sj
| yk
j  DMa
| CNB
FIRST(x) = { A, FIRST(j), P }
 { A, { D, C }, P }
 { A, D, C, P }
A function that computes first sets of a non-terminal of a grammar is called
First Function.
Rules For Finding First Set:
1 – If x is a terminal then FIRST(x) = {x}.
2 – If x  є the  is included in FIRST(x).
3 – Consider x  A, where A is a terminal and  is a sequence of zero or
more terminals and non-terminals. Then A is included in FIRST(x).
4 – Consider x  b, where b is a non-terminal and x is a sequence of
zero or more terminals and non-terminals. Then FIRST(b) is included
in FIRST(x), x  b, where  is a sequence of zero or more
terminals and non-terminals, b is a non-terminals and  is a sequence
of terminals and non-terminals.
Then, FIRST(x) = FIRST() OR FIRST(b)
FIRST() U FIRST(b)
Example Find first sets of all the non-terminals of the following
grammar:
E  TE’
E’ +TE’ | є
T  FT’
T’ * FT’ | є
F (E) | id
FIRST(E) = FIRST(T) = FIRST(F) = { (, id }
FIRST(E’) = { +, є }
FIRST(T) = { (, id }
FIRST(T’) = { *, є }

H:\COMPILER\COMPILER.DOC
29

FIRST(F) = { (, id }
Exercise Find first sets of all the non-terminals of the following
grammar:
stmt  expr;
expr  term expr’
expr’  + term expr’ | є
term  factor term’
term’  * factor term’ | є
factor  (expr) | id
Follow Set:
DEF “A terminal is included in the follow set of a non-terminal if it follows
(comes immediately after) that non-terminal in any production.”
Example Consider the following grammar, in this grammar it has been
assumed that lower case represents non-terminals and upper case
Represents terminals:
s  AbM
| bTk
k  JbS
| XY
…………
FOLLOW(b) = { M, T, S }
Q – What should be done if a non-terminal is followed by another non
terminal?
s  AbM
| bTk
k  JbS
| XY
t  GnU | BX

In this case instead of second non-terminal its first set is consider.


New FOLLOW(b) = { M FIRST(t), S }
 { M, G, B, S }

Note: A non-terminal cannot be a follow set of non-terminal


A function that computes follow sets of a non-terminal of a grammar is called
Follow Function.

Rules For Finding Follow Set:


1 – If s is a starting symbol then $ is included in follow set.
2 – Consider a production of this form:
s  ...aB,

where, ‘a’ is a non-terminal

H:\COMPILER\COMPILER.DOC
30

‘B’ can be terminal or non-terminal


If B is a terminal then B will be included in FOLLOW(a).
If B is a non-terminal then FIRST(B) is included in FOLLOW(a).
But FIRST(B) = B or B = FIRST(B)
In both of the cases FIRST(B) is included in FOLOW(a).
To generalized it consider:
s  ...a
where, ‘’ is a sequence of zero or more terminals and non-
terminals.
‘’ is a sequence of terminals and non-terminals.
In this case FIRST() OR FIRST() = FIRST() U FIRST() is
included in FOLLOW(a).
3– Consider s  ...a
where, ‘a’ is the last non-terminal on RHS.
Then, everything in the follow of s is included in FOLLOW(a).
Proof Prove that,
If s  ...a, where, ‘a’ is the last non-terminal on RHS, then everything in
the follow of s is included in FOLLOW(a).

To prove this rule consider the following grammar:


s  AbM ----- 1
b  Jn ----- 2
In this particular grammar lower case letters represents non-terminals and
upper case letters represents terminals.
M is included in FOLLOW(b), substitute the value of b in production 1, i.e.
s  AJnM
 M is included in FOLLOW(n)
Now consider production 2
 M is included in FOLLOW(b)
 M is also included in FOLLOW(n)
 Everything in the FOLLOW(b) will be included in the follow of the
last non-terminal on the RHS of b.
To generalize this, consider:
s  ...a
where ‘’ is a null able non-terminal (A non-terminal is null able if one
of it’s RHS is є.). Then everything in the FOLLOW(s) will be included
in FOLLOW(), as well as in the FOLLOW(a).

How To Find Follow Set:

H:\COMPILER\COMPILER.DOC
31

Finding follow set is a multi pass (multi step) process. During the first
pass rules 1 and 2 are applied. During the subsequent passes rule 3 is
repeatedly applied until it doesn’t add anything new to the follow sets
Example Find follow sets of all the non-terminals of the following
grammar?
E  TE’
E’ +TE’ | є
T  FT’
T’ * FT’ | є
F (E) | id

FOLLOW I Pass II Pass


FOLLOW (E) { ), $ } { ), $ }
FOLLOW (E’) {} { ), $ }
FOLLOW (T) {+} { +, ), $ }
FOLLOW (T’) {} { +, ), $ }
FOLLOW (F) {*} { *, +, ), $ }
Exercise Find follow sets of all the non-terminals in the following
grammars:
1 – stmt  expr;
expr  term expr’
expr’  + term expr’ | є
term  factor term’
term’  * factor term’ | є
factor  (expr) | id
2– se
e  te’
e’ +te’ | є
t  f t’
t’ * f t’ | є
f  (e) | id

Construction Of The Parsing Table:


Input: A grammar G
Output: A Parsing Table M
Method:
1 – For each production A   do steps 2 and 3.
2 – Find FIRST() for each terminal ‘a’ in FIRST() make
M[A, ] = A  .

3– If є is included in FIRST(), then find FOLLOW(A). For each terminal


b in FOLLOW(A) make (M[A, b] = Aє)+(M[A, $] = Aє).

H:\COMPILER\COMPILER.DOC
32

4– Make each empty slot an error.


Example Construct a Parsing Table for the following grammar:
E  TE’
E’ +TE’ | є
T  FT’
T’ * FT’ | є
F (E) | id
Non-terminals: E, E’, T, T’, F
Terminals: +, *, (, ), id, $
FIRST(E) = FIRST(T) = FIRST(F) = { (, id }
FIRST(E’) = { +, є }
FIRST(T) = { (, id }
FIRST(T’) = { *, є }
FIRST(F) = { (, id }

FOLLOW I Pass II Pass


FOLLOW (E) { ), $ } { ), $ }
FOLLOW (E’) {} { ), $ }
FOLLOW (T) {+} { +, ), $ }
FOLLOW (T’) {} { +, ), $ }
FOLLOW (F) {*} { *, +, ), $ }

1 – Consider: E  TE’
Compare with: A
 A = E,  = TE’,
FIRST() = FIRST(TE’) = FIRST(T) = { (, id }
a = (, id
For a = (,
M[A, a] = A  
M[E, (] = E  TE’
For a = id
M[A, a] = A  
M[E, id] = E  TE’

2 – Consider: E’ +TE’ | є


Compare with: A
 A = E’,  = +TE’,
FIRST() = FIRST(+TE’ | є) = FIRST(T) = { +, є} = { + }
a = +, є
For a = +,

M[A, a] = A  

H:\COMPILER\COMPILER.DOC
33

M[E’, +] = E  +TE’
For a = є
Find, FOLLOW(A)
 FOLLOW(E’) = { ), $ }
b = ), $
For b = )
M[A, b] = A  є
M[E’, )] = E’ є
For b = $
M[E’, $] = E’ є

3 – Consider: T  FT’
Compare with: A
 A = T,  = FT’,
FIRST() = FIRST(FT’) = FIRST(F) = { (, id }
a = (, id
For a = (,
M[A, a] = A  
M[T, (] = T  FT’
For a = id
M[A, a] = A  
M[T, id] = T  FT’
4– Consider: T’ * FT’ | є
Compare with: A
 A = T’,  = * FT’| є,
FIRST() = FIRST(* FT’| є) = { * ,є }
a = * ,є
For a = *
M[A, a] = A  
M[T’, *] = T’  *FT’ | є
For a = є
Find, FOLLOW(A)
 FOLLOW(T’) = { +, ), $ }
b = +, ), $
For b = +
M[A, b] = A  є
M[T’, +] = T’ є
For b = )
M[A, b] = A  є
M[T’, )] = T’ є

For b = $

H:\COMPILER\COMPILER.DOC
34

M[A, b] = A  є
M[T’, $] = T’ є

4– Consider: F (E) | id
Compare with: A
 A = F,  = (E) | id,
FIRST() =FIRST(F) = FIRST((E) | id) = { (, id }
a = (, id
For a = (,
M[A, a] = A  
M[F, (] = F  (E)
For a = id
M[A, a] = A  
M[F, id] = F  id

Parsing Table
NT\Term + * ( ) id $
E ETE’ ETE’
E’ E’+TE’ E’є E’є
T TFT’ TFT’
T’ T’є T’*FT’ T’є T’є
F F(E) Fid
Exercise Find FIRST and FOLLOW sets of all the non-terminals also
construct the parsing table of the following grammars:
1 – stmt  expr;
expr  term expr’
expr’  + term expr’ | є
term  factor term’
term’  * factor term’ | є
factor  (expr) | id
2– se
e  te’
e’ +te’ | є
t  f t’
t’ * f t’ | є
f  (e) | id

Bottom-Up parsing:

H:\COMPILER\COMPILER.DOC
35

In this type of parsing the construction of the tree starts from the
leaves and it proceeds up to the root. This type of parsing is generated by a
parser (generated by a tool).

Bottom-Up parsing uses in Push-Down automata. For this purpose two


things are required:
1 – A state machine to drive the parser.
2 – A stack to remember the states.
Here we shall not consider the mechanics of the state machine.
Rather we shall consider STACK and input only. Initially, the STACK is empty.
The parser works as follows:
1 – If a few items on the top of the STACK form RHS of a production and
the top-most item is not followed by current input token in any
production then POP these items off the STACK and PUSH the
corresponding LHS on to the stack. This operation is called REDUCE
Operation.
2 – Otherwise PUSH the current input token on to the STACK and
advance the input. This operation is called SHIFT Operation.
3 – If a REDUCE operation brings starting symbol on top of the STACK,
accept the input, it is correct and the parsing is complete.
Note: Initially the STACK is empty.
Advancing the input means increments the current input token.
$ represents end of the input.
Exercise To Illustrate the working of Bottom-Up Parser, let us consider
the following grammar:
se
ee+t|t
t t*f|f
f  (t) | num
num  0| 1| 2| … |9
In this grammar:
Non-terminals: s, e, t, g, num
Terminals: +, *, (, ), 0, 1, 2, …, 9
Consider the following input: 2*(3+4)
The steps performed by the parser are shown, which shows input and
stack after each step.

Working Of The Bottom-Up Parsing:

H:\COMPILER\COMPILER.DOC
36

STACK Input Next Action


2 * (3 + 4)$ Shift
2 * (3 + 4)$ Reduce
num * (3 + 4)$ Reduce
f * (3 + 4)$ Reduce
t * (3 + 4)$ Shift
t * (3 + 4)$ Shift
t * ( (3 + 4)$ Shift
t * ( 3 + 4)$ Reduce
t * ( num + 4)$ Reduce
t * ( f + 4)$ Reduce
t * ( t + 4)$ Reduce
t * ( e + 4)$ Shift
t * ( e + 4)$ Shift
t * ( e + 4 )$ Reduce
t * ( e + num )$ Reduce
s
t * ( e + f )$ Reduce
t * ( e + t )$ Reduce
e
t * ( e )$ Shift
t
Construction * Of ( The e t) $ Reduce
t Tree:*
Parse f $ Reduce
t Construction of $ Reduce
e
the tree starts from the f $ Reduce
s on each Reduce
leaves $
operation, we move one e )
A reduce
step upward operation
and hason brought starting symbol on top of the stack.
accept the Shift
each input.operation
The inputwe
is correct, parsing is complete.
move one step t
horizontally towards
right. f
Exercise Illustrate
the steps performed by num
a Bottom-Up parser in
processing the following e + 4
inputs:
se t
ee+t|t
f
t t*f|f
f  (t) | num num
num  0| 1| 2| … |9
Input: t * ( 3
1– 2+3*4
2– 2*3+4*5 f
3– (2+3)*4
Also construct the parse num
tree?
H:\COMPILER\COMPILER.DOC

2
37

Q – How to develop a parser program?


There are two methods to construct the parser program, i.e.
1 – Write your own program.
2 – Generate parser program using a tool named yacc (in UNIX) b_yacc.
(in DOS). Yacc means Yet Another Compiler Compiler

Q – How to generate parser program using a yacc tool?


Step 1 – Write a specification for the parser program using yacc
language and save it in a .y file. y extension indicates yacc
specification file. (say first.y )

Step 2 – Process the specification using yacc tool. For this purpose use
the following command in UNIX environment.

H:\COMPILER\COMPILER.DOC
38

$ yacc first.y
This generates the parser program in the form of a function
yyparse() and by default saves it in a file y.tab.c It can be
renamed if desired.
Step 3 – Compile this C program. For this purpose use the following
command:
$ cc y.tab.c -ly
where, ‘cc’ mean compiler compiler.
‘y.tab.c’ is a default file name.
‘ly’ means library yacc.
This generates an executable file a.out which is the parser.

How To Write yacc Specification:


Like lex, yacc specification has three parts:
I– Definition & Declarations. -- Optional
%%
II – Translation Rules. -- Compulsory
%%
III – User defined routines. -- Optional

I– Definition & Declarations: This part contains:


– include directive.
– definition of constants and declaration variables.
II – Translation Rules: consists of rules, it has two components, i.e.
– The first component is production.
– Action, it consists of one or more statements in C language,
enclosed within { and }.
This part looks like follows:
production 1 { action 1 }
production 2 { action 2 }
: :
production n { action n }
III – User defined routines: It consists of definition of function used in the
translation rules part.
Example Let us construct a simple desk calculator that reads an
arithmetic expression, evaluates it and then prints its numeric
value.
To begin with let us develop a desk calculator for the following grammar:
EE+T|T
TT *F|F

F  (E) | DIGIT
Step 1 – Write yacc specification, i.e.

H:\COMPILER\COMPILER.DOC
39

//yacc specification
%{
#include< ctype.h > //for type checking
%}
%token DIGIT
%%
line : expr ‘\n’ { printf(“%d\n”,$1); } // $ can be used interchangeably
expr: expr ‘+’ term { $$ = $1 + $3 }
| term
term: term ‘*’ factor{ $$ = $1 * $3 }
| factor
factor: ‘(‘ expr ‘)’ { $$ = $2 }
| DIGIT
%%
//lexical Analyzer
int yylex()
{
int c;
c = getche();
if( isdigit(c)) // function in ctype.h
{
yylval = c – ‘0’;
return yylval;
}
return c;
}
Save it in a file first.y and perform steps 2 and 3 as mentioned above.
Exercise Develop Lexical Analyzer and parser program separately for
the desk calculator?

H:\COMPILER\COMPILER.DOC
40

Chapter 4
Semantics Analysis

H:\COMPILER\COMPILER.DOC
41

Repeat
In this phase it is verified that a grammatically correct statement is
meaningful.
In programming, meaningful would be confined to the type checking
only.

H:\COMPILER\COMPILER.DOC
42

Chapter 5
Intermediate Code Generation

H:\COMPILER\COMPILER.DOC
43

HLL program Theoretically Assembly Language

Intermediate Code Generation


Practically

The objective of a compiler is to translate a HLL program into


Assembly Language. But compiler doesn’t perform this translation directly. It
firstly translates a HLL into an intermediate language and then intermediate
language is translated into Assembly Language.
Repeat
Advantages:
1 – Portability, If the machine changes we need to change only the
backend of the compiler and if language changes we need to change
only the front end of the compiler.
2 – Machine independent code optimization can be done.

Intermediate Languages:
There are many intermediate languages but we shall consider only
three or four intermediate languages / intermediate representation.
1 – Syntax Tree: It represents =
hierarchical structure of the statement.
For e.g. the syntax tree of thea +
statement a = b * -c + b * -c is shown
in fig. I * *

b uminus b uminus
Fig I Syntax Tree
2– DAG (Directed Acyclic Graph): It =
contains the same information as
syntax tree but here the information isa +
in more compact form. A DAG for the
above statement is shown in fig II. *

b uminus
Fig II DAG

H:\COMPILER\COMPILER.DOC
44

id a +

* *

id b U- id b U-

id c Id c

Fig III TAC Tree


3 – TAC (Three Address Code): The Table I of TAC
representation of the syntax tree are shown in 0 id b
table I and fig III, each node is represented as a 1 id c
record with fields for its operator and children. 2 uminus 1
Example A statement containing more 3 * 0 2
than one operator is broken down in to TAC 4 id b
as follows, where t1 and t2 are the temporary 5 id c
variables generated by the compiler to build the 6 uminus 5
result of the intermediate evaluation of 7 * 4 6
expressions. 8 + 3 7
9 id a
TAC is a linearised for syntax tree or DAG 10 = 9 8
in a explicit names correspond to the interior
nodes of the graph.
TAC for the syntax tree and DAG of the following statement:
a = b * -c + b * -c are as follows:
TAC for the Syntax Tree TAC for DAG
t1 = uminus c t1 = uminus c
t2 = b * t1 t2 = b * t1
t3 = uminus c t3 = t2 + t2
t4 = b * t3 a = t3
t5 = t2 + t4
a = t5

H:\COMPILER\COMPILER.DOC
45

Types of TAC Statement:


1 – Assignment Statements of the form:
x = y b-op z
where b-op is a binary operator i.e. an operator which requires two
operands.
2 – Assignment Instruction of the form:
x = u-op y
where u-op is a unary operator i.e. an operator which requires one
operand. Examples of the unary operators are NOT, NEG, etc.
3 – Copy Statement of the form:
x=y
i.e. contents of y are copied into x.
4 – Unconditional Jump of the form:
GOTO L
means tab at label L would be executed.
5 – Conditional Jump of the form:
if ( x rel-op y ) GOTO L
where, rel-op means <, >, =, !,<=, >=, !=.
means tab at label L would be executes if x rel-op y is true.
6 – Parameter x calls p, x : ( for procedural calls ) and return y, where y
representing a return value (is optional).
7 – Index Assignment of the form:
x = y[ i ]; ----- I
y[ i ] = x; ----- II
where, I sets the value of x to the value in memory location i
units beyond y.
II sets the value in memory location i units beyond y to
the value of x.
8 – Address and Pointer Assignments i.e.
x = &y; ----- I
x = *y; ----- II
*x = y; ----- III
where, I means address of y is copied into x.
II means the contents of memory location pointed to by
y become r-value of x.
III means the contents of y become the contents of the
memory location pointed to by x.

r and l values of a variable:


Consider the following declarations of a variable:
int x = 5;
x
In this case a memory location big enough to 5
store an integer is reserved and the name x is r value of x 00AB8h
bound to that memory location. Now x has two l value of x
attributes associated with it
H:\COMPILER\COMPILER.DOC
46

1– Address of the memory location to x is bound, this is called l-value of


x.
2 – Value contained in memory location bound to x, this is called r-value of
x.
In the example illustrates in previous figure:
l-value of x = 00A86h
r-value of x = 5

Binding Of Names:
In programming language environment state
semantics Environment is
a function that maps a name storage value
name to a storage location
location; State is a function

function that binds a storage location to the value held there.


Hence,
environment : name  storage location
state : storage location  value
OR environment : name  l – value
state : l – value  r – value
The environment and state are different, e.g. an
assignment changes the state but not the environment. pi 0
i.e. 00AB8h
float pi = 0;
After the assignment
pi = 3.14; pi 3.14
The address associated with pi remains the same but 00AB8h
the value contains in that location is changes.

Thus assignment statement changes the r-value while l-value remains the
same.

Runtime Organization Of The Storage:


The following organization of the storage at runtime applies to the
language such as FORTRAN, Pascal and C.
When a compiler obtains a block of storage from the operating system
for a compiled program to run in it then runtime storage might be divided to
hold,
1 – Generated Target Code: The size of the target code doesn’t change,
so, it can be placed in a statistically determined area.

2– Data Objects: It can be static as well as dynamic.


Static Data objects are termed in a statistically Code
Static Data
STACK
H:\COMPILER\COMPILER.DOC

Heap
47

determined area adjacent to the code. While Dynamic Data objects


require another area of memory. This is obtained from a reservoir of
memory location called Heap.
3– STACK: It is used during the activation of procedures and functions.
The information regarded the execution of procedures is saves down
stack. This includes IP and other registers.
The size of the stack also keeps on changing.
The sizes of the stack and heap can change as
the program executes. This is why the stack and heap are placed at
the opposite ends of the memory.

Activation Record:
DEF “An activation record is used to store the information regarding a
single execution of a procedure or function.“
In languages like Pascal, C etc. the activation record is usually pushed on to
the runtime stack when a procedure is called. It is popped off when a control
returns to the caller. The activation record is a record / structure having the
following fields:
1 – Temporaries: The intermediate values of the Return Value
expressions are evaluated and placed in temporary Actual
variables. Parameters
2 – Local Data: This field contains the data i.e. local to Optional
the execution of a procedures / functions. Control Links
3 – Saved Machine Status: It holds the information Saved
about the state of the machine just before calling a Machine
procedure, i.e. this field stores the values of IP and Status
other registers.
Local Data
4 – Optional Access Link: Non local data held in other
activation record. This field is not required in the case Temporaries
of a FORTRAN, but is required in the case of Pascal, C, etc.
5 – Optional Control Link: Points to the activation record of the calling
procedures / functions.
6 – Actual Parameters: This field is used by the calling procedure to
supply actual parameters.
7 – Return Value: This is used by the called procedure to return a value
to the calling procedure.
All of these fields are not always used. Sometimes registers are used in
the place of some of the fields.

Note: Functions returns some value but procedures/routines not.

Symbol Table:
DEF “A Symbol Table is a data structure which contains information about
tokens and their attributes.”

H:\COMPILER\COMPILER.DOC
48

A compiler uses symbol table to keep track of scope and binding


information about names. Every time a new name is encountered the symbol
table is searched. Changes to the table occur if the new name or information
about the new name is occur. Thus symbol table mechanism should allow us:
1 – Enter new information.
2 – Modify existing information.
These two things can be done efficiently.
Consider widely used data structures for the symbol table:
1 – Linear lists is a simple but not efficient if the number of items are large.
2 – Hash Table is a complex but efficient. The entries to be made in the
symbol table are not uniform. To make the entries uniform, each entry
is defined as a structure with two fields.
I– Name Name Pointer
II – Pointer to the information associated with
this name.
Information
The information is stored somewhere else. The
information is entered in to the symbol table in many steps. For example,
keyword may be entered
when the table is initialized even before invoking lexical analyzer.

1 – The List Data Structure (Linear lists):


A list data structure as illustrated in the above figure. Each entry has
two fields:
I– Identifier,
II – Information.
The pointer available marks the end of the list. The search is done in
backward direction. If we reach the beginning of the list name is not in the
list. This name and the related information must be entered into the list.
Making an entry for a name and searching a name are independent
activities. Suppose a symbol table contains n names. If we make insertion
without checking of the name then the work is constant. If multiple entries of
a name are not allowed then it is entered only if it is not present in the symbol
table. To find an existing name in a list we have to search half of the list on
the average. Effort required to make inquiry about an existing item in a list of
n items is proportional to n/2.
Effort for inquiry n/2n
 Cn.
where c is a constant.
Effort for making e inquiries = eCn

Effort require to enter one name  n


a new name is entered only if it is not exit in the entire list, i.e.
 Cn
Effort require to enter n names into the list:

H:\COMPILER\COMPILER.DOC
49

 n * Cn
 cnn
Total effort for n entries and inquiries:
 cnn + eCn
 ( n + e )Cn
In a medium size program we might have n = 100; e = 1000;
Effort = ( 100 + 1000 ) * 100
 1100 * 100
 1,10,000 C = E (say)
If the size of the program becomes 10 times larger then
n’ = 1000; e’ = 10,000
Effort = (n’ + e’) n’c
 (1000 + 10,000) 1000 C
 11,000,000 C
 100 (110,000 C)
 100 E
Thus, if the size of a program increases by 10 times the effort require to
compile that program increases 100 times.
Here comes the inefficiency.
2 – Hash Table: Closed Hash Table Open Hash Table
We shall 0 Data 0 Pointer Data NULL
consider Open 1 Data 1 Pointer Data NULL
Hash
Table, i.e. : :
I– A hash table : :
consisting of an : :
array of n pointers. : :
II – m separates : :
link list called : :
buckets. ( m <=n ). 210 Data 210 Pointer NULL
Data
Each record appears exactly in one of the lists. To convert an entry s
into index of an array we apply a hash function h(s). Returns an integer
between zero and (n-1). If s is in the symbol table then it is in the list h(s)
otherwise this element is inserted at the front of this list. As a rule of thumb
average length of a list is n / m. If there are n names entered into the table.
For hash table effort, for an inquiry is en / m effort for an entry Cn / m.  in
hash table length of each list reduces to n / m items.

Effort for e inquiry and n entries respectively:


E = eCn + nCn
m m
 (e + n)Cn

H:\COMPILER\COMPILER.DOC
50

m
where m can be made as large as we like. This method is more
efficient then link list, Suppose m = n / 2 then
E = 2 ( e + n ) Cn
n/2
 2 ( e + n) C
For a medium size program:
n = 100; e = 1000
E = 2 ( e + n) C
 2 ( 1000 + 100 ) C
 2 ( 1100 ) C
 2200 C = 1 E
If the program becomes 10 times larger then on the average, i.e.
n = 1000; e = 10,000
Now, E = 2 ( e + n) C
 2 ( 10,000 + 1000 ) C
 2 ( 11,000 ) C
 22000 C
 10 ( 2200 ) C
 10 E
This time the effort also becomes 10 times larger, thus effort varies
linearly as the size of the program.

Space Required:
1 – n words for hash table.
2 – bn words for n entries where b is the space required for one entry.

Formal & Actual Parameter:


DEF “The argument or parameters used in the definition of a function are
called Formal Parameters.”
Example Consider the following function that computes a volume of a
cylinder:
float volume( int r, int h )
{
int v;
v = M_PI * r * r * h;
return v;
}

OR
float volume( int r, int h )
{
return M_PI * r * r * h;

H:\COMPILER\COMPILER.DOC
51

}
In this example r and h are formal / dummy parameters.
Values of these parameters must be supplied when the function is
called. These values are called Actual Parameters.
Exercise Find out the Actual and Formal Parameters in the following
functions definitions:
1–
# include < ---- >
inline int square ( x ){ return x * x; }

void main(void)
{
int a = 5, b;
b = square(a);
cout<<“The square of “<<a <<” is “<<b;
}

2–
# include < ---- >
inline int square ( x ){ return x * x; }

void main(void)
{
cout<<“The square of “<<5 <<” is “<<square(5);
}

Parameter Passing:
DEF “Associativity actual parameters with formal parameters at the line of
function call is called the Parameter Passing.”

There are four methods of parameter passing, i.e.


1 – Call by value,
2 – Call by reference,
3 – Call by name,
4 – Copy restore.
It is important for the compiler writer to know the methods of
parameter passing used by a language.

1 – Call By Value:
The formal parameters are treated as local variables so that there is a
room for them in the activation record. When the function or procedure is
called the actual parameters are copied into the formal parameters.

H:\COMPILER\COMPILER.DOC
52

All the subsequent processing takes place on the copies. Therefore


actual parameters remains unchanged.
Example Consider the following program:
# include < ---- >
void swap(int x, int y);

void main(void)
{
int a = 5, b = 6;
cout <<”Before swapping the value of a is “<<a
<<” and the value of b is “<<b<<endl;
swap( a, b );
cout <<”After swapping the value of a is “<<a
<<” and the value of b is “<<b;
}
void swap(int x, int y)
{
int temp;
temp = x;
x = y;
y = temp;
}
Exercise What is the output of the above program?
Although the values of the formal parameters have changed, this change is
not reflected by the actual parameters and the values of the actual
parameters remain intact.
Distinguishing Features of this method is that the values of the actual
parameters don’t change.
Note: The local variables of procedures or functions are created when that
procedure or function is active and these are discarded when the procedure
or functions terminates.

2 – Call By Reference:
In this method the addresses (l values) of the actual parameters are
copied into the addresses of the formal parameters, so that, each pair of
actual and formal parameters points to the same location. In this case the
affects of processing are reflected in actual parameters.

Example Consider the following program:


# include < ---- >
void swap(int &x, int &y);

H:\COMPILER\COMPILER.DOC
53

void main(void)
{
int a = 5, b = 6;
cout <<”Before swapping the value of a is “<<a
<<” and the value of b is “<<b<<endl;
swap( a, b );
cout <<”After swapping the value of a is “<<a
<<” and the value of b is “<<b;
}

void swap(int &x, int &y)


{
int temp;
temp = x;
x = y;
y = temp;
}

Now the interchange of values x and y is reflected in a and b.


Exercise What is the output of the above program?
Distinguishing Features of this method is that the formal parameters are
reflected in the actual parameters.

3 – Call By Name:
The working of this method is displayed:
1 – The procedure is treated as a macro, i.e. its body is substituted
for the call in the caller with actual parameters literally
substituted for the formals. Such a literal substitution is called
Macro Expansion or Inline Expansion.
2 – The local names of the called procedure are kept distinct from
the names of the calling procedure. We can think of each local
name of the called procedure being systematically renamed into
a distinct new name before macro expansion is done.
3 – The actual parameters are placed within parenthesis if
necessary to preserve their integrity.

4 – Copy Restore:
It is a hybrid between call by value and call by reference ( Also known
are COPY IN COPY OUT).
The working of this method is:

H:\COMPILER\COMPILER.DOC
54

1– Before the contents is transferred to the called procedure the


actual parameters are evaluated. The actual parameters are
passed by value to the called procedure ( i.e. r-value of the
actual parameters are copied into the r-value of the formal
parameters). In a addition l-value are also determined before
call.
2 – Just before the termination of the called procedure the contents
of the formals are copied back into the actual parameters, so
that the results of processing appear in the actual parameters.
This method is used by some implementations of FORTRAN.

H:\COMPILER\COMPILER.DOC
55

Chapter 6
Code Optimization

H:\COMPILER\COMPILER.DOC
56

Let translate x = x + 1 into two Assembly routines, i.e.


1– MOV AX, x
ADD AX, 1
MOV x, AX
2– INC x ;Machine having increment instruction

H:\COMPILER\COMPILER.DOC
57

Chapter 7
Code Generation
The final phase of a compiler is code generator. The input to this
phase is an intermediate representation IR with optimized code of the source
program and its output is an equivalent program in a target language.

Source  Front IR Code IR … Code Target


Program End  Optimizer  … Generation Program

Error
Handler
Symbol
Table

Position of Code Generator:

It is required that:
1– The generated code must be correct.
2– The generated code must be of high quality (i.e. it makes an effective
use of resources)
3– The code generator itself should be efficient.

Design Issues Of A Code Generator:

1– Input to the code generator: It consists of:


a– IR produced by the front end of the compiler.
b– Information in the symbol table i.e. used to determine run time
addresses of the data objects denoted by the names in IR.
As we know there are several choices for IR such as postfix notation
(linear representation), stack machine code (virtual machine representation),
TAC, syntax tree, or DAG’s (graphical representation). Although the
algorithms are couched (to build) for TAC, DAG’s and many
of the techniques apply other as well.
We assume that front end has produced error-free and detailed IR of the
source program so that values of the names appearing in IR could be
represented names appearing in IR could be represented by qualities that
the target machine can directly manipulate (like bits, bytes, integers, pointers

H:\COMPILER\COMPILER.DOC
58

etc). We also assume that type checking has taken place and the type
conversion has been inserted where necessary, (In some compilers this kind
of type of type checking is done together with the code generation).

2 – Target Program: The target program is the output of the code generator.
Like the IR it can have many forms, i.e.
a – Absolute Machine Code is can be placed in fix location in
memory and immediate by executed.
b – Relocatable Machine Code allows sub programs to be compiled

separately and then are link and execute.


c– Assembly Code makes the process of code generation easier.
We can generate symbolic instructions and use the assembler
to generate code.

3 – Memory Management: Mapping names in the source program to


addresses of the data objects in runtime memory is done co-operatively by
the front end of the compiler and the code generator. The type in the
declaration determines the width i.e. the amount of storage needed for the
declared object. From the symbol table information a relative address can be
determined for the name in the data area of the procedure. If machine code
is being generated then label in Three Address Statement TAS must be
converted into addresses of the instructions.

4 – Instruction Selection: The instruction selection depends upon the


instruction set of the target machine. If the target machine doesn’t support
each data type of HLL then each exception to the general rule is handled
specially.
If we don’t care about the efficiency of the target program then
instruction selection is simple. For each type of TAC we can design a code
skeleton that outlines the target code to be generated for that construct. For
example, every TAS of the form:
x=y+z
where, x, y and z are statistically allocated
can be translated as follows:
MOV AX, y
ADD AX, z
MOV x, AX
Unfortunately, this statement by statement code generation
procedures poor code. For example, consider the following two statements:
a=b+c
d=a+e
This would be translated into Assembly language as follows:
Statement 1
1 MOV AX, b

H:\COMPILER\COMPILER.DOC
59

2 ADD AX, c
3 MOV a, AX
Statement 2
4 MOV AX, a
5 ADD AX, e
6 MOV d, AX
In this code 4th and 5th statements is redundant.

The quality of the generated code is determined by its speed and size.
A target machine with a rich instruction set may provide several ways of
doing the same thing. Since the cost differences between different
implementation may be significant, a naïve translation may lead to a correct
but unacceptably inefficient code, e.g. if a target machine has INC instruction
then,
Repeat
Let translate x = x + 1 into two Assembly routines, i.e.
1 – MOV AX, x
ADD AX, 1
MOV x, AX
2 – INC x ;Machine having increment instruction
The second one is more efficient.
The selection of a best Assembly code sequence for a three address
construct also depends upon the context in which the construct appears.

Instruction Cost:
Cost of an instruction
 cost associated with the source and destination mode + 1
This cost corresponds to the length in words of the instruction.
Addressing modes involving:
a – Registers have cost zero.
b – A memory location have cost 1.
c– Literals (constants) have cost 1.
In order to save space we must minimize the length of an instruction.
This has a additional benefit as well.
For most machine and for most instructions the time taken to fetch an
instruction from the memory exceeds the time spent in executing the
instruction. Thus by minimizing the instruction length we can minimize the
time taken to perform that instruction.
Consider the following examples:
MOV AX, BX
Cost associated with source =0
Cost associated with destination =0
Cost associated with source and destination mode = 0
Cost of instruction = 1 (because one word of info. has been moved.

H:\COMPILER\COMPILER.DOC
60

MOV a, AX
Cost of instruction = 1 + [1(execution cost)] = 2
ADD AX, 1
Destination Mode = 0 + 1
Instruction Cost = 1 + 1(execution cost) = 2
We assume that the machine is byte addressable have n registers four
byte word.

Consider the following statement:


a=b+c
where, a, b and c are simple variables located in distinct memory
location. This instruction can be implemented in the following ways:
1 – MOV AX, b ;Cost = (0 + 1) + 1(execution cost) = 2
ADD AX, c ;Cost = (0 + 1) + 1(execution cost) = 2
MOV a, AX ;Cost = (1 + 0) + 1 (execution cost)= 2
Total cost of this implementation is  6
2 – MOV a, b* ;Cost = (1 + 1) + 1(execution cost) = 3
ADD a, c ;Cost = (1 + 1) + 1(execution cost) = 3
Total cost of this implementation is  6
* In some advance architectures it is allowed.
3 – If SI, DI, BP contains the address of a, b and c respectively, then we
can use,
MOV AX, [DI] ;Cost = (0 + 0) + 1 (execution cost)= 1
ADD AX, [BP] ;Cost = (0 + 0) + 1 (execution cost)= 1
MOV [SI], AX ;Cost = (0 + 0) + 1 (execution cost)= 1
Total cost of this implementation is  3
4 – If AX, BX contains the values of b and c respectively and b is not
needed after assignment, then:
ADD AX, BX ;Cost = (0 + 0) + 1 (execution cost)= 1
MOV a, AX ;Cost = (1 + 0) + 1 (execution cost)= 2
Total cost of this implementation is  3
This is order to generate good code its addressing capabilities must be
efficient.

H:\COMPILER\COMPILER.DOC

You might also like