CD Unit-1
CD Unit-1
Dr. V. Padmavathi
Associate Professor
Dept. of CSE
CBIT
Text Books:
1. Alfred V Aho, Monica S Lam, Ravi Sethi, Jeffrey D Ullman, “Compilers: Principles
Techniques & Tools”, Pearson Education 2nd Edition, 2013.
2. Steven Muchnik, “Advanced Compiler Design and Implementation”, Kauffman,
1998.
Suggested Reading:
1. Kenneth C Louden, “Compiler Construction: Principles and Practice”, Cengage
Learning, 2005.
2. Keith D Cooper & Linda Tarezon, “Engineering a Compiler”, Morgan Kafman,
Second edition, 2004.
3. John R Levine, Tony Mason, Doug Brown “Lex &Yacc”, 3rd Edition Shroff
Publisher, 2007.
Online Resources:
1. http://www.nptel.ac.in/courses/106108052
2. https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/
3. http://en.wikibooks.org/wiki/Compiler_Construction
4. http://dinosaur.compilertools.net/
5. http://epaperpress.com/lexandyacc/
• Pre-requisites: Formal Language and Automata
Theory, Data Structures.
int x = 10;
int is a lexeme for the token keyword
x is a lexeme for the token identifier
int
x
= Lexemes
10
Phases of compilerSource program
Parse tree
semantic
Semantic analyzer Error
Symbol Syntax tree Handler
Table
Intermediate code generator
Intermediate code
Code Optimizer
Optimized code
Code generator
Target program
Position : = initial + rate * 10
Intermediate code generator
Lexical anayzer
id1: = id2 + id3 * 10
temp1: = inttoreal (10)
temp2: = id3 * temp1
Syntax analyzer temp3: = id2 + temp2
:= id1: = temp3
id2 *
temp1:=id3*10.0
id3 10 id1: = id2 + temp1
Semantic analyzer
:= Code generator
id1
+
id2 *
MOVF id3, R2
id3 inttoreal MULF #10.0, R2
MOVF id2 , R1
10 ADDF R2, R1
MOVF R1, id1
Phases and Passes
Pass Phase
1) Pass is a physical scan over a source 1) A phase is a logically cohesive
program. The portions of one or more
phases are combined into a module operation that takes i/p in one form
called pass. and produces o/p in another form.
3) Splitting into more no. of passes reduces 3) Splitting into more no. of phases
memory.
reduces the complexity of the
4) Single pass compiler is faster than two program.
pass .
4) Reduction in no. of phases, increases
the execution speed
• The repetitions to process the entire source program before generating
code are referred as passes.
11
Compiler Writing Tools
• The compiler writer can use some specialized tools that
help in implementing various phases of a compiler. These
tools assist in the creation of an entire compiler or its parts.
• Some commonly used compiler construction tools include:
• Parser Generator –
It produces syntax analyzers (parsers) from the input that is
based on a grammatical description of programming
language or on a context-free grammar. It is useful as the
syntax analysis phase is highly complex and consumes more
manual and compilation time.
• Scanner Generator –
It generates lexical analyzers from the input that consists of
regular expression description based on tokens of a language.
It generates a finite automaton to recognize the regular
expression.
Example: Lex
• Syntax directed translation engines –
It generates intermediate code with three address format from the input that
consists of a parse tree. These engines have routines to traverse the parse
tree and then produces the intermediate code. In this, each node of the parse
tree is associated with one or more translations.
• Assembler implementation
• Online text searching (GREP, AWK) and word processing
• Website filtering
• Command language interpreters
• Scripting language interpretation (Unix shell, Perl, Python)
• XML Parsing and documentation tree construction
• Database query interpreters
Other uses of Program Analysis Techniques
• Converting sequential loop to a parallel loop
• Program analysis to determine if programs are data-race free
• Profiling programs to determine busy regions
• Program slicing
• Data-flow analysis approach to software testing
- Uncovering errors along all the paths
- Dereferencing null pointers
- Buffer overflows and memory leaks
• Worst case execution time (WCET) estimation and energy analysis
Applications of
Compiler technology
Compilers are everywhere…
• Draws results from mathematical logic, lattice theory, linear algebra, probability,
etc.
- type checking, static analysis, dependence analysis, loop parallelization, cache
analysis, etc.
• Greedy algorithms- register allocation
• Heuristic search- list scheduling
• Graph algorithms- dead code elimination, register allocation
• Dynamic programming- instruction selection
• Optimization techniques- instruction scheduling
• Finite automata-lexical analysis
• Pushdown automata- parsing
• Fixed point algorithms- data-flow analysis
• Complex data structures- symbol tables, parse trees, data dependence graphs
• Computer architecture- machine code generation
Bootstrapping
20
• A process by which a simple language is used to translate
more complicated language which in turn translates even
more complicated language (and this process continues) is
known as bootstrapping.
21
Full Bootstrapping
• A full bootstrap is necessary when we are building a new
compiler from scratch.
• Example:
• We want to implement an Ada compiler for machine M. We
don’t currently have access to any Ada compiler (not on M,
nor on any other machine).
• Idea: Ada is very large, we will implement the compiler in a
subset of Ada and bootstrap it from a subset of Ada compiler
in another language. (e.g. C)
• The process illustrated by the T-diagrams is
called bootstrapping.
22
• A compiler is characterized by
1) source language
2) Target language
3) Implementation language
23
24
25
26
27
Data Structures
• The interaction between the algorithms used by the phases
of a compiler and the data structures that support these
phases, of course is a strong one.
• A compiler should be compiling a program in time
proportional to the size of the program.
Time α Size
that is O(n) where n is measure of program size (usually
number of characters)
The following data structures used by the phases of compiler
▪ Tokens
▪ Syntax tree
▪ Symbol table
▪ Literal table
▪ Intermediate Code
▪ Temporary Files
Tokens
• Scanner converts characters to tokens
• Represents tokens as a value of enumerated datatype
• Preserves the strings of characters
• Other information derived from it like name associated with
an identifier or the value of a number.
• Generates one token at a time- single symbol lookahead
• Single global variable used to the token information
• An array of tokens may be used.
Syntax tree
• Standard pointer based structure that is dynamically allocated
as parsing proceeds.
• Entire tree can be kept as a single variable pointing to the
root.
• Each structure is a record collects information from parser
and later phases.
• For example: datatype of an expression may be kept as a field
in syntax tree.
• Each node in syntax tree may require different attributes to be
stored.
Literal table
• Quick insertion and lookup
• Stores constants and strings used in a program.
• Does not allow deletions since its data applies globally to the
program and a constant or string will appear only once in this
table.
• Is important in reducing the size by allowing reuse of
constants and strings.
• It is needed by the code generator to construct symbolic
addresses for literals and entering data definitions in the
target code files.
Intermediate code
• Depending on the kind of intermediate code and the kinds of
optimizatioms the code may be kept as an array of text
strings, temporary text files or as a linked list of structures.
Temporary Files
Symbol table
• Keeps the information associated with identifiers: functions,
variables, constants and datatypes.
• Interacts with every phase.
• Since the symbol table accessed so frequently, insertion,
deletion and access operations need to be efficient,
preferably constant time operations.
• Hence, hash table and other tree structures may be used.
• Sometimes several tables are used and maintained in a list of
a stack.
Lexical Analysis
Role of Lexical Analyzer
1. It read the characters and produces as output a sequence of tokens
2. It removes comments, white spaces (blank, tab, new line character) from the
source program.
3. If lexical analyzer identifies any token of type identifier ,they are placed in symbol
table.
4. Lexical analyzer may keep track of the no. of new line characters seen, so that a
line number can be associated with an error message.
Lexical Analyzer in Perspective
token
source lexical
analyzer parser
program
get next
token
symbol
table
Input Buffering
• The speed of lexical analyzer is the concern
• Lexical analysis needs to look ahead several characters before a match can
be announced.
• Because of large amount of time consumption in moving characters,
specialized buffering techniques have been developed to reduce the amount
of overhead required to process an input character.
• Two buffer input scheme is useful when look ahead is necessary
- Buffer pairs
- Sentinels
Fig. shows the buffer pairs which are used to hold the input data
Scheme
• Consists of two buffers, each consists of N-character size which are reloaded
alternatively.
• N-Number of characters on one disk block, e.g., 4096.
• N characters are read from the input file to the buffer using one system read
command.
• eof is inserted at the end if the number of characters is less than N.
Pointers
Two pointers lexemeBegin and forward are maintained.
lexeme Begin points to the beginning of the current lexeme which is yet to be found.
forward scans ahead until a match for a pattern is found.
• Once a lexeme is found, lexemebegin is set to the character immediately after the
lexeme which is just found and forward is set to the character at its right end.
• Current lexeme is the set of characters between two pointers.
Two portions of a buffer can be separated. If you move the look Ahead cursor halfway
through the first half, the second half will be filled with fresh characters to read. If you
shift the look Ahead cursor to the right end of the second half's buffer, the first half will
be filled with new characters, and so on.
• Example: statement int a,b;
Before the token ("int") can be identified, the character ("blank space") beyond the
token ("int") must be checked.
Both pointers will be set to the next token ('a') after processing
the token ("int"), and this procedure will be continued throughout
the program .
Disadvantages of the scheme
• This scheme works well most of the time, but the amount of
look ahead is limited.
• This limited lookahead may make it impossible to recognize
tokens in situations where the distance that the forward pointer
must travel is more than the length of the buffer.
(eg.) DECLARE (ARGl, ARG2, . . . , ARGn)
• It cannot determine whether the DECLARE is a keyword or an
array name until the character that follows the right parenthesis.
Sentinels
• In the previous scheme, each time when the forward pointer is moved, a
check is done to ensure that one half of the buffer has not moved off. If it is
done, then the other half must be reloaded.
• Therefore the ends of the buffer halves require two tests for each advance
of the forward pointer.
Test 1: For end of buffer.
Test 2: To determine what character is read.
• The usage of sentinel reduces the two tests to one by extending each buffer
half to hold a sentinel character at the end.
• The sentinel is a special character that cannot be part of the source
program. (eof character is used as sentinel).
Advantages
Symbol
Table
Lexical Analyzer…
• Construction of a lexical analyzer requires the specification of
tokens, patterns, and lexemes for the source language.
– Tokens
• Terminal symbols in the grammar for the source
language
– Patterns
• Rules describing the set of lexemes that can represent a
particular token in source programs.
– Lexemes
• Sequence of characters in the source program that are
matched by the pattern for a token.
Lexical Analyzer
• Example:
Token Pattern Sample Lexemes
while characters w, h, i, l, e while
for characters f, o, r for
identifier letter followed by letters total, result, a1, b4
and digits
integer digit followed by digits 125, 23, 4567
string characters surrounded “Hello World”, “a+b”
by double quotes
Lexical Analyzer…
• Attributes for Tokens
– Required when more than one lexeme matches a pattern.
– Lexical analyzer must provide additional information in such cases.
– Example:
• Consider the following C statement:
– X = Y - Z++
• The token names and associated attributes will be as shown below:
– <id, pointer to symbol-table entry for X>
– <assign_op>
– <id, pointer to symbol-table entry for Y>
– <arith_op, minus>
– <id, pointer to symbol-table entry for Z>
– <incre_op, post>
How to Construct a Lexical Analyzer?
• Construction of a Lexical Analyzer mainly
involves two operations:
– Specification of Tokens
• This requires the knowledge of Languages and Regular Expressions.
◼ Regular expressions -A declarative way to express the pattern of any string over an
alphabet or an algebraic way to describe languages.
• If E is a regular expression, then L(E) is the language it defines.
• A Language denoted by a regular expression is said to be regular set.
– Recognition of Tokens
• This requires the knowledge of Finite Automata.
• Finite Automata is a recognizer that takes an input string &
determines whether it’s a valid sentence of the language.
Specification of Tokens
◼ Alphabet
◼ Strings (words)
◼ Language
◼ Longest Match Rule
◼ Operations
◼ Notations
◼ Regular Expression
What is a Language?
• Alphabet (character class)
– Denotes any finite set of symbols .
– ∑ (sigma) is used to denote an alphabet
– Example: {0,1} is binary alphabet
• String (sentence, or word)
– Finite sequence of symbols drawn from an alphabet.
– Example: { 00110, 01110101 }
• Language
– Denotes any set of strings over some fixed alphabet.
– Example: {{ 0000, 0001, 0010, 0011, 0100, 0101, 0110,
0111}
Operations on Languages
• Union, concatenation, closure, and exponentiation.
• Also called as regular operations.
• Definitions:
Operation Definition and Notation
Union of L and M L M = { s | s is in L or s is in M }
Concatenation of L and M L M = { st | s is in L and t is in M }
Kleene Closure of L
Positive Closure of L
Operations on Languages…
• Example:
– Let L be the set {A, B, . . . , Z, а, b, . . . , z} and
– D the set {0, 1, . . . , 9}, then
• L U D is the set of letters and digits.
• LD is the set of strings consisting of a letter followed by a digit.
• L4 is the set of all four-letter strings.
• L* is the set of all strings of letters, including , the empty string.
• L(L U D)* is the set of all strings of letters and digits beginning with
a letter.
• D+ is the set of all strings of one or more digits.
Operations on Languages…
• Another Example:
– Let the alphabet be the standard 26 letters (a,b,…,z}.
– If A = {good, bad} and B = {boy, girl}, then what is the value
of the following expressions?
• AUB
• AB
• A*
• A+
Longest Match Rule
• It states that the lexeme scanned should be determined based on the longest
match among all available tokens.
• This is used to resolve ambiguities when deciding the next token in the input
stream.
• During analysis of source code, the lexical analyzer will scan code letter by letter
and when an operator, special symbol or whitespace is detected it decides
whether a word is complete.
• The lexical analyzer also follows rule priority where a reserved word, e.g., a
keyword, of a language is given priority over user input. That is, if the lexical
analyzer finds a lexeme that matches with any existing reserved word, it should
generate an error.
What are Regular Expressions?
• Regular expressions are built using regular operations to
describe languages.
• The value of a regular expression is a language.
• Examples:(0 1) 0*
• The value of this expression is the language consisting
of all strings starting with a 0 or a 1 followed by any
number of 0s.
• The value of this expression is the language consisting
of all possible strings of 0s and 1s.
What are Regular Expressions?
• Formal Definition
• Examples:
– Assume that the alphabet B is {0,1}.
– What languages do the following regular expressions represent?
• 0*10*
• B*1B*
• B*001B*
• 1*(01+)*
• (BB)*
• (BBB)*
• 01 10
• 0B*0 1B*1 0 1
Regular Definitions
• Example:
– C identifiers are strings of letters, digits, and underscores.
The following is a regular definition for the language of C
identifiers.
• letter_ → A | B | … | Z | a | b | … | z | _
• digit → 0|1|…|9
• id → letter_ ( letter_ | digit )*
Extensions of Regular Expressions
• Other Examples:
Review of Finite Automata
(FA)
Non-Deterministic: Has more than one alternative action for the same
input symbol
Non-Deterministic Finite Automata (NFAs) easily
represent regular expression, but are somewhat
less precise.
65
Construction of a Lexical Analyzer:
Recognition of Tokens
• Example 3:
• Example 4:
– Transition diagram for white space
• Example 5:
– Transition diagram for unsigned numbers
Converting Transition Diagram into a Program
Recognition of Tokens
Lexems, their token and attribute values
Lex Tool