0% found this document useful (0 votes)
8 views

Module 1

The document outlines the course on Compiler Construction at Amity School of Engineering and Technology, detailing its objectives, learning outcomes, and syllabus. It covers essential topics such as compiler phases, lexical analysis, and tools like LEX and YACC, while also emphasizing the importance of understanding programming language translation and optimization techniques. The course aims to equip students with practical skills in designing and implementing compilers, alongside theoretical knowledge of compiler structures and processes.

Uploaded by

Shivika Mittal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Module 1

The document outlines the course on Compiler Construction at Amity School of Engineering and Technology, detailing its objectives, learning outcomes, and syllabus. It covers essential topics such as compiler phases, lexical analysis, and tools like LEX and YACC, while also emphasizing the importance of understanding programming language translation and optimization techniques. The course aims to equip students with practical skills in designing and implementing compilers, alongside theoretical knowledge of compiler structures and processes.

Uploaded by

Shivika Mittal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 185

Amity School of Engineering and Technology

COMPILER CONSTRUCTION
Module I
Introduction to Compilers
Dr. A. K. Jayswal
ASET(CSE)
PhD(CSE)-JNU
MTech(CSE)-JNU
GATE(CS),
UGC-NET(CS)

1
Module I:
❑ Introduction of Compiler
❑ Cousins of the Compiler
❑ Phases of a Compiler
❑ Lexical Analysis
❑ Finite state Machine, R.E.
❑ Compiler writing tools-LEX, YACC
❑ CFG-Derivation, Ambiguity

2
Course Title: Compiler Construction [CSE304]

Course Objectives:

The objective of this course is to describe the utilization of formal Grammar using Parser
representations, especially those on bottom-up and top-down approaches and various algorithms;
to learn techniques for designing parser using appropriate software. The theory and practice of
programming language translation, compilation, and run-time systems, organized around
significant programming project to build a compiler for simple but nontrivial programming
language. To understand, design and implement a parser. To understand design code generation
schemes. To understand optimization of codes and runtime environment.
Pre-requisites: Computer architecture or equivalent, (Data structures and algorithms) or
equivalent, (Systems programming) or equivalent Familiarity with Java
3
Course Title: Compiler Construction [CSE304]

Course Learning Outcomes:


At the end of this course, the student will be able to
1. Describe the compiler concepts with their utilization.
2. Design various types of compilers.
3. Analyze & implement SLR and LALR parsing techniques.
4. Synthesize the code generation techniques.
5. Demonstrate the process of Implementation of optimization of source code.
6. Understand the structure of compilers
7. Understand the basic techniques used in compiler construction such as lexical analysis, top-
down, bottom-up parsing, context-sensitive analysis, and intermediate code generation
8. Understand the basic data structures used in compiler construction such as abstract syntax
trees, symbol tables, three-address code, and stack machines
9. Design and implement a compiler using a software engineering approach.
4
Course Contents/Syllabus

5
Course Contents/Syllabus

6
Assessment: Theory/Lab

7
Amity School of Engineering and Technology

Recommended Reading
Textbooks:
• Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman, “Compilers:
Principles Techniques and Tool”, Second Edition, Pearson Publication, 2007.​

• A.A Putambekar, “Compiler Construction”, Technical Publications, 2009.

Reference Book:
• Des Watson, A Practical Approach to Compiler Construction, First
Edition, Springer, 2017​

8
Amity School of Engineering and Technology

OBJECTIVES

After completing this section (module –I), you will


be able to
1.1 Understand the working of compiler.
1.2 Differentiate between phases of compiler.
1.3 Explain Lexical Analyzer
1.4 Analyze and differentiate Compiler writing tools-LEX, YACC
1.5 Explain Finite state Machine, R.E., Ambiguity in CFG

9
Amity School of Engineering and Technology

Module-1 Assessment​

• Quiz (Conceptual and Numerical Based)-2 marks

• Assignment- 2 Marks

10
Amity School of Engineering and Technology

❑ All computers only understand machine Language

Therefore, HLL instructions must be translated into machine language prior to execution.

11
Amity School of Engineering and Technology

Translator
It is a program that takes as input a program written in one programming
language i.e., the source program and translates into an equivalent program in
another language i.e. the target language.

Translator
Source program Target program

Compiler is a translator.

12
Amity School of Engineering and Technology

Compiler
Compiler is a software which converts a program written in high level language
(Source Language) to low level language (Object/Target/Machine Language).

13
Amity School of Engineering and Technology

Language processing systems (using Compiler)

Key responsibilities of Preprocessor are:


Micro expansion (Macros are defined using #define directives which is
used to define constants or to create short, reusable code), Preprocessor
replaces macro invocations with their definitions and

File inclusion ( Preprocessor directives such as #include are used to


include header files into the source program. The preprocessor is
responsible for inserting contents of included files into source program)

Note: Preprocessor removes these lines and substitute entire program of


that header files into the source program

14
Amity School of Engineering and Technology

Cousins of Compiler(other translators)


1. Interpreter: It is program that executes other programs. Instead of
producing a target program as a translation, it executes a source program
statement by statement.

It is slower than Compiler but provides better error diagnostic than Compiler

15
Amity School of Engineering and Technology

Compiler Vs Interpreter

16
Amity School of Engineering and Technology

Cousins of Compiler(other translators)


2. Preprocessor : It translates programs written in one high level language into
another high-level language program.

3. Assembler: It translates assembly language program into machine language


program

17
Amity School of Engineering and Technology

Structure of a Compiler
There are two phases of whole Compilation process:
• Analysis (Machine Independent/Language dependent)
• Synthesis (Machine dependent/Language independent)

Compilation process is a complex process, so it is partitioned into a series of sub


processes called Phases.
Phases: A phase is logically an operation which takes as input one representation of
source program and produces as output another representation.
There are six different phases of compiler.

18
Amity School of Engineering and Technology

Structure of a Compiler

19
Amity School of Engineering and Technology

Structure of a Compiler

Semantic Analysis do the type checking and generates a annotated parse tree (a Parse tree with
semantic action, called SDT) or (a parse tree + data type information)
20
Amity School of Engineering and Technology

Output of a Compiler phases:

21
Amity School of Engineering and Technology
Amity School of Engineering and Technology

23
Amity School of Engineering and Technology

Front-end and Back-end of a compiler:

❑ The front-end phases are Lexical, Syntax and Semantic


analyses. These form the "analysis phase" as you can
well see these all do some kind of analysis.

❑ The Back End phases are called the "synthesis phase" as


they synthesize the intermediate and the target language
and hence the program from the representation created
by the Front-End phases. The advantages are that not
only can lots of code be reused, but also since the
compiler is well structured -it is easy to maintain &
debug.

24
Amity School of Engineering and Technology

(numerical Questions)-Lexical Analysis ( No of tokens)


Q.1 p𝑟𝑖𝑛𝑡𝑓(“%𝑑 𝐻𝑎𝑖”, &𝑥);
Q.2 𝑖𝑛𝑡 max(𝑥, 𝑦)
𝑖𝑛𝑡 𝑥, 𝑦;
/∗ 𝑓𝑖𝑛𝑑 𝑡ℎ𝑒 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑜𝑓 𝑥 𝑎𝑛𝑑 𝑦 ∗/
{
𝑟𝑒𝑡𝑢𝑟𝑛 𝑥 > 𝑦? 𝑥: 𝑦 ;
}

Q.3: 𝑚𝑎𝑖𝑛
{
𝑎 = 𝑏 +++−−−−+++==; }

Q.4:

44
Amity School of Engineering and Technology

(numerical Questions)-Lexical Analysis ( No of tokens)


Q.4: 𝑚𝑎𝑖𝑛
{
𝑖𝑛𝑡 𝑎 = 10;
𝑐ℎ𝑎𝑟 𝑏 = ”𝑎𝑏𝑐”;
𝑖𝑛 𝑡 𝑐 = 30;
𝑐ℎ 𝑎𝑟 𝑑 = ”𝑥𝑦𝑧”;
𝑖𝑛 /∗ 𝑐𝑜𝑚𝑚𝑒𝑛𝑡 ∗/ 𝑡 𝑚 = 40.5; }

45
Amity School of Engineering and Technology

Analysis-Synthesis model of Compiler


We basically have two parts of compilers:
1. Analysis part
2. Synthesis part.

1. Analysis phase creates an intermediate representation from the given source code.
• Lexical Analyzer
• Syntax Analyzer
• Semantic Analyzer
2. Synthesis phase creates an equivalent target program from the intermediate representation.
• Intermediate Code Generator
• Code Optimizer
• Code Generator

47
Amity School of Engineering and Technology

Grouping of Phases
Phases of Compiler are grouped into two:
1. Front End
2. Back End

1. Front End : It consists of those phases or parts of phases that primarily depends on
the source language and are independent of the target machine. It includes lexical,
syntax analysis, creation of symbol table, semantic analysis and generation of
intermediate code. It also includes error handling which goes along with each of
these phases.
2. Back End: It includes those phases or parts of phases that depend on the target
machine and independent of the source language. It includes intermediate code
generation, code optimization, code generation along with the necessary error
handling and symbol table operations.
48
Amity School of Engineering and Technology

Example

49
Amity School of Engineering and Technology

Some Important terms (related to lexical analyzer):


1. Pattern: It is a rule which describes a set of lexemes that can represent a
particular token in the source program. For eg.- an identifier can be described as
a letter followed by letters or digit.

2. Lexeme: Lexemes are the smallest logical units of a program. It is a sequence of


characters present in the source program for which a token is produced. For eg.-
10, int, + , etc

3. Token: It is a sequence of characters that can be treated as a unit in the


grammar of the programming languages.
Classes of similar lexemes are identified by the same token.
For eg: identifier, keyword, operator, constant, delimeter , etc.
50
Pattern is a rule, pattern for id is
Letter(letter+digit)*
Example: FORTRAN statement

E
M
C

Symbol Table
𝑟 + means one or
more instances.
r? means 0 or one
instance
Regular definitions:
Re-writing Regular definitions:
Recognition of tokens: Grammar
Token recognition can be
done using regular definitions:
D1→r1
D2→r2
.
.
Di is a new symbol not in ∑
and ri is a RE over ∑ .
No token for Ws: blank,
tab, new line etc.
Amity School of Engineering and Technology

Scanning the Input (concept of buffer scheme)

The source program lies in the input buffer. For Example: consider the statement E=M*C**2
Two pointers are used to recognize the lexeme:
𝒍𝒆𝒙𝒆𝒎𝒆𝑩𝒆𝒈𝒊𝒏: Mark the beginning of the current lexeme whose extent we are attempting to
determine
𝒇𝒐𝒓𝒘𝒂𝒓𝒅: Scan until a pattern match is found

67
Amity School of Engineering and Technology

Scanning the Input (concept of buffer scheme)


The lexical analyzer scans the given input from left to right one character at a time.
It uses two pointers
1. begin ptr(bp): or lexeme Begin pointer and
2. forward ptr(fp) (or forward): to keep track of the pointer of the input scanned.

Initially both the pointers point to the first character of the input string.
fp move ahead to find the white space (i.e. end of the lexeme), after encountering white space bp and fp
are set to next token.
This reading of characters from secondary storage is very costly. Hence buffering technique is used. In
buffering technique, a block of data is first placed from main memory to buffer and then from buffer to
lexical analyzer.
There are two methods used in this context:
1. One Buffer Scheme and
2. Two Buffer Scheme.
68
Amity School of Engineering and Technology

One Buffer Scheme


In this scheme, only one buffer is used to store the input string but the
problem with this scheme is that if lexeme is very long (as compared to
length of the buffer) then it crosses the buffer boundary, to scan rest of the
lexeme the buffer has to be refilled, that makes overwriting the first of
lexeme.

69
Amity School of Engineering and Technology

Two Buffer Scheme


To overcome the problem of one buffer scheme, a two-buffer scheme is used. In this method two buffers
are used to store the input string. The first buffer and second buffer are scanned alternately. When end of
current buffer is reached the other buffer is filled. To identify the boundary of first buffer, end of buffer
(eof) character should be placed at the end first buffer.

Similarly end of second buffer is also recognized by the end of buffer mark present at the end of second
buffer. when fp encounters first eof, then one can recognize end of first buffer and hence filling up second
buffer is started. In the same way when second eof is obtained then it indicates of second buffer.
Alternatively, both the buffers can be filled up until end of the input program and stream of tokens is
identified. This eof character introduced at the end is calling Sentinel which is used to identify the end of
buffer.

70
Thus, both the test can be combined by extending each buffer to hold a sentinels character (eof)
at the end.
LEX(lexical Analyzer generator):
▪ LEX is a s/w tool to generate
lexical analyzer (1st phase of a
Compiler). It is based on R.E.
▪ YACC is a tool to generate syntax
analyzer to check syntax (2nd
phase of a compiler. It is based on
Grammar or production rule.
Construction of Lexical Analyzer with LEX tool:

❑ 1st a lex specification or Lex source file (𝒍𝒆𝒙. 𝒍) is prepared. It consists of Regular Expression
associated with Lex actions (called Lex language).
❑ Then Lex compiler is used to convert Lex specification file (lex.l) into C language file 𝒍𝒆𝒙. 𝒚𝒚. 𝒄 (it is
a lex program in C language). To run anything, we want .exe file. So, a C compiler is used which
convert this C program file (lex.yy.c) into executable file (binary form), called 𝒂. 𝒐𝒖𝒕.
❑ Now this executable file or output file 𝒂. 𝒐𝒖𝒕 is ready to generate a sequence of tokens. Any input
stream, to this file, generates a sequence of tokens. For example: Tokens may be identifiers,
keywords, operators etc.
Structure of lex program:

Declaration section start with %{ and end with %} and


having information such as #include<stdio.h>, any
constant or variable declaration.
In translation section, a pattern is defined with
associated action within brackets { }. An action is a C
programming code.
A auxiliary section start with a main() and contains
yylex() fuction. The job of this function is to transfer the
execution to translation section. Program execution
always start with main program.
Lex specification: Square bracket [abc] means either a
or b or c.
How to write patterns-
[a-z]: any letter between range a to z
[a\-z]: occurrence of either a or z.
[^ab]: negation of ab (anything except
a and b.
?: zero or one occerrance
| : or operator

For example vowels is represented as


[aeiou].
And consonants [b-d f-h j-n p-t v-z]
Lex Compiler:

Compiling lex program: 3


steps:
❑ Lex file is prepared and
saved with .l extension
❑ CC means c compiler and
❑ Since C compiler can not
execute the lex file so we
have to exclude lex file so,
–lfl means excluding lex file.
A simple LEX program for C “tokens”

For example: suppose we enter the


string 𝒊𝒏𝒕 𝒂, 𝒃, 𝒄;
The generated token are:
Keywords, identifier, special symbol,
identifier, special symbol, identifier,
special symbol.
YACC (yet another compiler-compiler)
Structure of YACC program:
Executing a YACC program:
Compiling a YACC program:
For Example:
Consider a grammar to accept the language
L={ set of all strings start with 01}
Lex file 𝒂𝒃𝒄. 𝒍
YACC program
Now compile the program by using the following steps:
Amity School of Engineering and Technology

Formal Definition:

92
Amity School of Engineering and Technology

93
Amity School of Engineering and Technology

Context Free Grammar


Definition. A context-free grammar is a 4-tuple (, NT, R, S), where:

•  is an alphabet (each character in  is called terminal)


• NT is a set (each element in NT is called nonterminal)
• R, the set of rules, is a subset of NT  (  NT)*
• S, the start symbol, is one of the symbols in NT

If (,)  R, we write production  → 

 is called a sentential form

94
Amity School of Engineering and Technology

Examples of
context Free
Grammar:
Amity School of Engineering and Technology

Language of
a Grammar
Amity School of Engineering and Technology

Example:
Amity School of Engineering and Technology

Context Free
Language
(Definition):
Amity School of Engineering and Technology

Example
Amity School of Engineering and Technology

Another Example

100
Amity School of Engineering and Technology

Another Example

101
Amity School of Engineering and Technology

Derivation Order and Parse Tree

102
Amity School of Engineering and Technology

Derivation Order and Parse Tree

103
Amity School of Engineering and Technology

Parse Tree
A parse tree of a derivation is a tree in which:

• Each internal node is labeled with a nonterminal

•If a rule A → A1A2…An occurs in the derivation then A is a parent node of nodes labeled A1,
A2, …, An

a
S

a S
b
S

e
104
Amity School of Engineering and Technology

Parse Tree
S →A|AB Sample derivations:
A →|a|Ab|AA S  AB  AAB  aAB  aaB  aabB  aabb
B →b|bc|Bc|bB S  AB  AbB  Abb  AAbb  Aabb  aabb

These two derivations use same productions, but in different orders.


This ordering difference is often uninteresting.
Derivation trees give way to abstract away ordering differences.

S Root label = start node.

Each interior label = variable.


A B
Each parent/child relation = derivation step.
A A b B
Each leaf label = terminal or .
a a b
All leaf labels together = derived string = yield.
105
Amity School of Engineering and Technology

Leftmost, Rightmost Derivations


Definition. A left-most derivation of a sentential form is one in which rules transforming
the left-most nonterminal are always applied

Definition. A right-most derivation of a sentential form is one in which rules transforming


the right-most nonterminal are always applied

106
Amity School of Engineering and Technology

Leftmost, Rightmost Derivations


S →A|AB Sample derivations:
A →|a|Ab|AA S  AB  AAB  aAB  aaB  aabB  aabb
B →b|bc|Bc|bB S  AB  AbB  Abb  AAbb  Aabb  aabb

S These two derivations are special.

A B 1st derivation is leftmost.


Always picks leftmost variable.
A A b B
2nd derivation is rightmost.
Always picks rightmost variable.
a a b
107
Amity School of Engineering and Technology

Ambiguity in CFG

108
Amity School of Engineering and Technology

How to check a Given CFG is Ambiguous?

Q. Check whether the following CFG is ambiguous?


E→𝐸+𝐸 S → 𝑎𝑆/𝑆𝑎/a S → 𝑎𝑆𝑏𝑆/𝑏𝑆𝑎𝑆/∈ R → 𝑅 + 𝑅/𝑅.R/R*/a/b/c
E→𝐸∗𝐸
E → 𝑖𝑑
109
Amity School of Engineering and Technology

Ambiguous
Grammar
(Examples1
,Example2)
Amity School of Engineering and Technology

Check whether the following CFG is ambiguous?

Q.1:
bExp → 𝑏𝐸𝑥𝑝 𝑶𝑹 𝑏𝐸𝑥𝑝
bExp → 𝑏𝐸𝑥𝑝 𝑨𝑵𝑫 𝑏𝐸𝑥𝑝
Check Your bExp → 𝑵𝑶𝑻 𝑏𝐸𝑥𝑝
bExp → 𝑻𝑹𝑼𝑬
Progress-1 bExp → 𝑭𝑨𝑳𝑺𝑬
Amity School of Engineering and Technology

Converting Ambiguous to Unambiguous Grammar:

112
Amity School of Engineering and Technology

Converting Ambiguous to Unambiguous Grammar:


Example1:

Note: Removing Left recursion removes ambiguity from expression related grammar (now,
only one parse tree for any w) but still here a problem of precedence and associativity
113
Amity School of Engineering and Technology

Converting
Ambiguous to
Unambiguous
Grammar:
Example2:

Note: Removing Left recursion and (or) Left factoring from Grammar, removes the ambiguity in a CFG.
Amity School of Engineering and Technology

Converting Ambiguous to Unambiguous Grammar:


Example3:

Note: Removing Left recursion and (or) Left factoring from Grammar, removes the ambiguity in a CFG.

115
Amity School of Engineering and Technology

Ambiguity in CFG (some more examples):

Example
Amity School of Engineering and Technology

Ambiguity in CFG

Example
Amity School of Engineering and Technology

Ambiguity in CFG

Example
Amity School of Engineering and Technology

Ambiguity in CFG

Example
Amity School of Engineering and Technology

Take a=2
Ambiguity in CFG
▪ Two different Parse
tree may cause
problems in
Example
applications which use
the derivation tree.
▪ For Example: -
Evaluating
expressions, and in
general, in compiler for
programming
languages.
Amity School of Engineering and Technology

Q.1 Consider the following parse tree for the expression a#b$c$d#e#f, involving two binary
operators $ and #.
Which one of the following is correct for the given parse tree?
(A) $ has higher precedence and is left associative; # is right associative
(B) # has higher precedence and is left associative; $ is right associative
(C) $ has higher precedence and is left associative; # is left associative
(D) # has higher precedence and is right associative; $ is left associative

121
Ambiguous Grammar
Definition. A grammar G is ambiguous if there is a word w  L(G)
having at least two different parse trees or
Two or more than 2 LMD or
Two or more than 2 RMD
S→A
S→B
S → AB
A → aA
B → bB
A→e
B→e

Notice that a has at least two left-most derivations


Amity School of Engineering and Technology

Example
Amity School of Engineering and Technology

Example
Amity School of Engineering and Technology

Example
Amity School of Engineering and Technology

Two Derivation Trees for same w => Ambiguous Grammar

S →A|AB Other derivation trees for


A →  | a | A b | AA w = aabb this string?
B →b|bc|Bc|bB
S
S S
A
A B A B
A A Infinitely
A A b B A A b many others
A A A b possible.
a a b a A b
a  A b
a
a
Amity School of Engineering and Technology

Another Example of Ambiguous Grammar


Amity School of Engineering and Technology

Two
Derivation
tree for W
Amity School of Engineering and Technology

Important
note for
Ambiguity
in CFG
Amity School of Engineering and Technology

Writing Unambiguous Grammar:

Consider the following Grammar:


𝐸 →𝐸+𝐸
𝐸 →𝐸∗𝐸
𝐸→ 𝐸
𝐸→𝑎

130
Amity School of Engineering and Technology

Writing
Ambiguous to
unambiguous
CFG
Amity School of Engineering and Technology

Check your Progress-2

Remove the Ambiguity in the following grammar

Ex2: 𝑅→𝑅 + 𝑅/𝑅. 𝑅/𝑅 ∗/𝑎/𝑏/𝑐

𝐸𝑥3: 𝑏𝐸𝑥𝑝→𝑏𝐸𝑥𝑝 𝑂𝑅 𝐵𝑒𝑥𝑝


𝑏𝐸𝑥𝑝→𝑏𝐸𝑥𝑝 𝐴𝑁𝐷 𝐵𝑒𝑥𝑝
𝑏𝐸𝑥𝑝→𝑁𝑂𝑇 𝐵𝑒𝑥𝑝
𝑏𝐸𝑥𝑝→𝑇𝑅𝑈𝐸
𝑏𝐸𝑥𝑝→𝐹𝐴𝐿𝑆𝐸

132
Amity School of Engineering and Technology

Ambiguity

CFG ambiguous  any of following equivalent statements:


•  string w with multiple (2 or more than 2) derivation trees.
•  string w with multiple (2 or more than 2) leftmost derivations.
•  string w with multiple (2 or more than 2) rightmost derivations.

Defining ambiguity of grammar, not language.


Amity School of Engineering and Technology

Rules for converting ambiguous to unambiguous grammar


Generally, productions are ambiguous when they have more than one occurrence of
a given non-terminal on their right-hand side.

Rules (for expression related (precedence and associativity) grammar)


1. Check the precedence of the operators involved.
2. Different precedence operators are treated differently for removing ambiguity.
3. First remove the ambiguity for the minimum precedence operator and so on…
4. Then check operator associativity while writing grammar

134
Amity School of Engineering and Technology

Rules for Converting Ambiguous to Unambiguous Grammar:

135
Amity School of Engineering and Technology

Rules for Disambiguating a grammar


Example: If a production rule is
S->ASBSy|a1|a2|……|an
Then ambiguity can be removed by rewriting the production rule as
S -> ASBSy|S’
S’-> a1|a2|…|an

If the operator is left associative then change the right most symbol
Eg.- E->E*E|id can be replaced by
E->E*E’|E’
E’-> id
If the operator is right associative then change the Left most symbol
Eg.- E->E E|id can be replaced by
E->E’ E|E’
E’-> id 136
Amity School of Engineering and Technology

Removal of Left Recursion and Left Factoring

137
Amity School of Engineering and Technology

Elimination of Left Recursion


A grammar is left recursive if the first symbol on the right hand side of a rule is the same
non-terminal as that in the left hand side.
E.g.:- S->Sa

Rule to remove Left recursion from a grammar of the form


S->Sά1| Sά2|…| Sάn|β1|β2|…|βm
S->β1S’|β2S’|…|βmS’
S’-> ά1S’| ά2S’|…| άnS’|€

138
Amity School of Engineering and Technology

Elimination of Left Factoring


Left factoring is a process which isolates the common parts of two or more productions
into a single production. After elimination of left factoring, the produced grammar is
suitable for top down parsing.

Rule to remove Left factoring from a grammar of the form


A-> ά β1| ά β2|…| ά βm
Then,
A-> ά A’
A’-> β1| β2|…| βm

139
Amity School of Engineering and Technology

End of Module-1

**************

140
Amity School of Engineering and Technology

Extra slide for Module-1

**************

141
Regular Expression
Amity School of Engineering and Technology

158
Amity School of Engineering and Technology

Introduction of Lexical Analysis


Lexical Analysis is the first phase of the compiler also known as a scanner. Its main task
is to read the input characters and produce a sequence of tokens that the parser(next
phase) uses for syntax analysis.

Lexical Analysis can be implemented with the Deterministic finite Automata.


The output is a sequence of tokens that is sent to the parser for syntax analysis.

159
Amity School of Engineering and Technology

Some Important terms:


1. Pattern: It is a rule which describes a set of lexemes that can represent a
particular token in the source program. For eg.- an identifier can be described as
a letter followed by letters or digit.

2. Lexeme: Lexemes are the smallest logical units of a program. It is a sequence of


characters present in the source program for which a token is produced. For eg.-
10, int, + , etc

3. Token: It is a sequence of characters that can be treated as a unit in the


grammar of the programming languages.
Classes of similar lexemes are identified by the same token.
For eg: identifier, keyword, operator, constant, delimeter , etc.
160
Amity School of Engineering and Technology

Introduction of Lexical Analysis

Token may have two things: <token-name, token value>


token-name is the type and token value points to an entry in the symbol table for the
token.
Example of Non-Tokens:
Comments, preprocessor directive, macros, blanks, tabs, newline, etc.

161
Amity School of Engineering and Technology

Symbol Table in Compiler


Symbol Table is an important data structure created and maintained
by the compiler in order to keep track of semantics of variable i.e. it
stores information about scope and binding information about names,
information about instances of various entities such as variable and
function names, classes, objects, etc.
It is built in lexical and syntax analysis phases.
The information is collected by the analysis phases of compiler and is
used by synthesis phases of compiler to generate code.
It is used by compiler to achieve compile time efficiency.

162
Amity School of Engineering and Technology

Symbol Table in Compiler


Items stored in Symbol table:
Variable names and constants
Procedure and function names
Literal constants and strings
Compiler generated temporaries
Labels in source languages
Information used by compiler from Symbol table:
Data type and name
Declaring procedures
Offset in storage
If structure or record then, pointer to structure table.
For parameters, whether parameter passing by value or by reference
Number and type of arguments passed to function
Base Address

163
Amity School of Engineering and Technology

Specification of Tokens(Pattern)
Regular expressions are an important notation for specifying patterns.
Regular expression is a grammar. It defines the rule. For the grammar, first
we have to define the alphabet

An alphabet is a finite, non-empty set of symbols


• We use the symbol ∑ (sigma) to denote an alphabet
• Examples:
• Binary: ∑ = {0,1}
• All lower case letters: ∑ = {a,b,c,..z}
• digits ∑ = {0-9}

164
Amity School of Engineering and Technology

Regular Expression

A string or word is a finite sequence of symbols chosen from ∑


• Empty string is  (or “epsilon”)

• Length of a string w, denoted by “|w|”, is equal to the number of


(non- ) characters in the string
• E.g., x = 010100 |x| = 6
• x = 01  0  1  00  |x| = ?

• xy = concatentation of two strings x and y

165
Amity School of Engineering and Technology

Example of Minimization of DFA

166
Amity School of Engineering and Technology

Transition diagram for Different types of Tokens


1. Identifiers:

2. Relational Operator

167
Amity School of Engineering and Technology

Transition diagram for Different types of Tokens


3. Keyword

168
Amity School of Engineering and Technology

Introduction of Lexical Analysis Valid Tokens:


'int’
'main’
Example 1: '(‘
')’
'{‘
'int’
'a’
',’
'b’
';’
'a’
'=‘
'10’
';’
'return’ '0’ ';’ '}'

169
Amity School of Engineering and Technology

Introduction of Lexical Analysis


Example 2: → There are 5 valid token in this
printf statement.

Example 3: int max(int i);


• Lexical analyzer first read int and finds it to be valid and accepts as token
• max is read by it and found to be a valid function name after reading (
• int is also a token , then again i as another token and finally ;
Answer: Total number of tokens 7: int, max, ( ,int, i, ), ;

170
Amity School of Engineering and Technology

Bootstrapping
Bootstrapping is widely used in the compilation development.
Bootstrapping is used to produce a self-hosting compiler. Self-hosting compiler is a type
of compiler that can compile its own source code.
Bootstrap compiler is used to compile the compiler and then you can use this compiled
compiler to compile everything else as well as future versions of itself.

A compiler can be characterized by three languages:


Source Language
Target Language
Implementation Language

171
Amity School of Engineering and Technology

Bootstrapping
The T- diagram shows a compiler SCIT for Source S, Target T, implemented in I.

Follow some steps to produce a new language L for machine A:


1. Create a compiler SCAA for subset, S of the desired language, L using language "A" and
that compiler runs on machine A.

2. Create a compiler LCSA for language L written in a subset of L.

172
Amity School of Engineering and Technology

Bootstrapping
3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language L,
which runs on machine A and produces code for machine A.

173
Amity School of Engineering and Technology

174
Amity School of Engineering and Technology

Writing Program in C for any given Regular Expression or


(any Language)

175
Amity School of Engineering and Technology

Program#1: WAP in C/C++ to Implement a deterministic finite automata (DFA) for accepting the
language
𝐿 = {𝑎𝑛𝑏𝑛 | 𝑛 mod 2 = 0, 𝑚 ≥ 1}

Regular expression for above language L is,


L = (aa)*.b+

Example (Output):

Input: aabbb
Output: ACCEPTED //n=2(even) m=3 (>=1)
Input: aaabbb
Output: NOT ACCEPTED //n=3(odd), m = 3

176
Amity School of Engineering and Technology

L = (aa)*.b+

Approaches:
There are 3 steps involve which results in acceptance of string:

1. Construct FA for (aa)* means having even number of a’s.


2. Construct FA for b+ means having any number of b’s greater than one.
3. Concatenate the two FA and make single DFA.
Any other combination result is the rejection of the input string.

Given DFA has following states.


State 3 leads to the acceptance of the string, whereas states 0, 1, 2 and 4 leads
to the rejection of the string.

177
Amity School of Engineering and Technology

178
Amity School of Engineering and Technology

179
Amity School of Engineering and Technology

180
Amity School of Engineering and Technology

181
L={01,001,101,110001,1001,……….}
L={01,001,101,110001,1001,……….}

Amity School of Engineering and Technology

Problem#2
Design deterministic finite automata (DFA) with ∑ = {0, 1} that accepts the
languages ending with “01” over the characters {0, 1}.
Solution:
L={01,001, 10001,1101,….}
Minimum string length is 2 so minimum 3 states required and its DFA is

182
L={01,001,101,110001,1001,……….}
L={01,001,101,110001,1001,……….}

Amity School of Engineering and Technology

Problem#2
Solution:
L={01,001,10001,…..}
=(0+1)*01

OUTPUT

183
Amity School of Engineering and Technology

Problem #3

L={Odd no of 0’s and Odd no of 1’s}

184
Amity School of Engineering and Technology

185

You might also like