UNIT II SYNTAX ANALYSIS
Role of Parser – Grammars – Context-free grammars – Writing a grammar Top Down
Parsing – General Strategies – Recursive Descent Parser Predictive Parser-LL(1) –
Parser-Shift Reduce Parser-LR Parser- LR (0)Item Construction of SLR Parsing Table –
Introduction to LALR Parser – Error Handling and Recovery in Syntax Analyzer-YACC
tool – Design of a syntax Analyzer for a Sample Language
Need and Role of the Parser
A parser (or syntax analyzer) is a crucial component of the compiler.
Parsing or syntax analysis is a process that takes the input in the form of tokens
(from the lexical analyzer) and produces one of the following outputs:
A parse tree (also called a syntax tree), or
A set of syntactic errors
The parser ensures that the source code follows the grammatical rules of the
programming language.
Syntax Tree or Parse Tree
A syntax tree is a graphical representation of a statement (hierarchical
structure).
In a syntax tree:
CS3501 – Compiler Design Page 1
o Interior and root nodes are operators
o Leaf nodes are operands
Parse Tree for: a = b + 10
=
/ \
a +
/\
b 10
Separating Lexical Analyzer from Syntax Analyzer - (Nov/Dec 2005 – 2 marks)
The separation of these two phases (lexical and syntax) has two advantages:
1. It accelerates the process of compilation.
2. The errors in the source input can be identified precisely.
Context Free Grammar (Nov/Dec 2009 – 2
marks)
Every programming language has rules or formats that define the syntactic
structure of a program.
Syntax of Programming Language Using Context-Free Grammar (CFG)
The syntax of programming language can be described by Context-Free
Grammar (CFG).
Example: CFG for Assignment Statement
S → id := E
E → E + E | E * E | E - E | E / E | (E) | id
Advantages of CFG
1. Easy to understand
2. Syntax of programming languages is specified using CFG, so errors can be
easily identified.
3. Source programs can easily be converted into object programs from parse
tree structures.
CS3501 – Compiler Design Page 2
CFG Definition
A Context-Free Grammar (CFG) is defined as:
G = {V, T, P, S}
Where:
V – Set of Variables or Non-Terminals
T – Set of Terminals
P – Set of Production Rules
S – Starting Non-Terminal Symbol
Example :
CFG for assignment statement in programming language:
S → id := E
E → E * E | E / E | E + E | E - E | (E) | id
Notational Conventions of CFG
Terminals are represented using lowercase letters, operators, digits, and
symbols.
Non-Terminals are represented using uppercase letters.
Start Symbol: The letter S is generally used as the start symbol.
Derivation: The process of getting the given string from the starting non-
terminal of a grammar is called derivation.
o If leftmost variable is replaced in each step of derivation, it is called
Leftmost Derivation.
o If rightmost variable is replaced in each step of derivation, it is called
Rightmost Derivation.
Example
Consider the grammar:
G: E → E * E | E + E | id
CS3501 – Compiler Design Page 3
Derive the string id * id + id using LMD and RMD.
Leftmost Derivation (LMD)
E→E*E
E → id * E [E → id]
E → id * E + E [E → E + E]
E → id * id + E [E → id]
E → id * id + id [E → id]
Rightmost Derivation (RMD)
Given Grammar:
E → E * E | E + E | id
Derive the string id * id + id using Rightmost Derivation:
E→E+E
→ E + id [E → id]
→ E * id + id [E → E * E]
→ id * id + id [E → id]
Parse Tree (or) Derivation Tree
The hierarchical structure of derivation is called a derivation tree.
Each interior node of the parse tree is a non-terminal.
The children of the node may be a terminal or non-terminal of the sentential
forms (right side of the production) that are read from left to right.
The sentential form in the parse tree is called the yield or frontier of the tree.
Ambiguity (Asked in Nov/Dec 2005 & 2007 – 2 marks)
The grammar which produces more than one parse tree for any input string,
then it is called an ambiguous grammar.
CS3501 – Compiler Design Page 4
Example
Prove that the following grammar is ambiguous:
G: E → E + E | E * E | id
Sample String:
id + id * id
Parse Tree 1:
E
/|\
E + E
/|\
E * E
| |
id id
→ Final leaves (yield): id + id * id
Parse Tree 2:
E
/|\
E * E
/|\
E + E
| |
id id
→ Final leaves (yield): id + id * id
Conclusion: The grammar is ambiguous because the same string has two different
parse trees.
Parsing Techniques and General Strategies
1. Top-Down Approach
2. Bottom-Up Approach
CS3501 – Compiler Design Page 5
Top-Down Approach:
In Top-Down parsing, the parse tree is generated from top to bottom, i.e.,
from root to leaves.
This derivation continues until we reach the input string.
Example for Top-Down Parsing:
Given Grammar:
G: S → xPz
P → yW | y
Step 1:
S
/|\
x P z
Step 2:
Try P → yW
So, the string becomes: xywz Invalid (doesn’t match grammar rule)
Step 3:
Try P → y
So, the string becomes: xyz Valid
S
/|\
x P z
|
y
Bottom-Up Parsing
In bottom-up parsing, the parse tree is generated from leaves to root, i.e., from
bottom to top.
CS3501 – Compiler Design Page 6
The input string is taken first and we try to reduce the string to the starting
non-terminal.
The process of parsing terminates as soon as we reach the starting non-
terminal.
Given Grammar:
G: S → xPz
P → yW | y
Construct parse tree for the input "xyz" using bottom-up approach
Step 1:
x y z
Step 2:
P
/ | \
x y z
Step 3:
S
/ | \
x P z
|
y
Disadvantages of top-down parsing / writing a grammar
1. Backtracking
2. Left Recursion
3. Left Factoring
Backtracking:
In top-down parsing, the parse tree is formed by using the first production rule of
non-terminal.
→ If first production rule is not matched with the required input string, then the
parser will go back to the second production rule of non-terminal.
This is called backtracking. It increases lot of overhead.
CS3501 – Compiler Design Page 7
Left Recursion
A grammar is said to be left recursive if it is in the form:
A → Aα
where α is terminal or non-terminal.
Top-down parser enters infinite loop because of left recursion.
It can be removed by the following formula:
If A → Aα / β is a production, then it can be rewritten as:
A → βA′ ,A′ → αA′ / ε
Example:
Eliminate left recursion from the following grammar:
A → Aa / b
Answer:
A → bA′
A′ → aA′ / ε
Left Factoring
Left factoring is performed when the variable produces more than one similar type of
production rules.
→ If A → αβ₁ / αβ₂ is a production, then it is left factored as:
A → αA′ ,A′ → β₁ / β₂
Example:
Do left factoring for the given grammar
G : S → iEtS / iEtSeS / a
Ans : S → iEtS S'
S' → ε / eS
S→a
Difference between Top/down parsing and Bottom-up parsing:
CS3501 – Compiler Design Page 8
Top down Bottom up
1. parse tree is built from root to leaves parse tree is built from leaves to root
(ii) It is less efficient. It has the following
problems:
No backtracking and left recursion
* Backtracking
problem.
* left Recursion
* left factoring
Examples
(iii) Examples: (i) LR parser
(i) Recursive descent parser (ii) shift-Reduce parser
Recursive Descent parser (or) predictive parser
Procedure –
Step1: Elimination of left Recursion
Step2: perform the left factoring
Step3: Computation of first and follow function
Step4: Construction of parsing Table
Step5: parsing the input.
FIRST and FOLLOW Function –
CS3501 – Compiler Design Page 9
FIRST Function – First(α) is the set of terminal symbols which are first symbols
appearing at right hand side in derivation of α.
Procedure to Compute first function
Step1: If the terminal symbol a then
FIRST(a) = a
Step2: If there is a production X → ε then
FIRST(X) = ε
Step3: If there is a production A → X Y then
FIRST(A) = FIRST(X) where FIRST(X) ≠ ε
(or) FIRST(A) = FIRST(Y) where FIRST(X) = ε
FOLLOW function –
→ FOLLOW(A) is defined as the set of terminal symbols that appears immediately to
the right of A.
Procedure to Compute follow function
Step1: For the start symbol S, place $ in FOLLOW(S)
Step2: If there is a production A → α B β
where FIRST(β) ≠ ε (or) A → B β then
FOLLOW(B) = FIRST(β)
Step3: If there is a production A → α B β
where FIRST(β) = ε or A → α B then
FOLLOW(B) = FOLLOW(A)
Construction of Parsing Table
For each production A → α of grammar:
Step1: For each terminal a in FIRST(α)
add A → α to m[A, a]
Step2: If ε is in FIRST(α) or α is ε
add A → α to m[A, b] where
b = FOLLOW(A)
Step3: Make each undefined entry of m as error
CS3501 – Compiler Design Page 10
Example:
Consider the grammar:
E →E+T|T
T →T*F|F
F → ( E ) | id
(i) Construct the predictive parsing table
(ii) Parse the input id + id * id (May/Jun 2010 – 16 marks)
Step1: Elimination of Left Recursion
E → T E'
E' → + T E' | ε
T → F T'
T' → * F T' | ε
F → ( E ) | id
Step2: No need to do left factoring
Step3: First and Follow function
First:
FIRST(E) = { (, id }
FIRST(E') = { +, ε }
FIRST(T) = { (, id }
FIRST(T') = { *, ε }
FIRST(F) = { (, id }
FIRST(() = { ( }
FIRST()) = { ) }
FIRST(id) = { id }
FIRST( + ) = { + }
FIRST( * ) = { * }
CS3501 – Compiler Design Page 11
Follow:
FOLLOW(E) = { ), $ }
FOLLOW(E') = { ), $ }
FOLLOW(T) = { +, ), $ }
FOLLOW(T') = { +, ), $ }
FOLLOW(F) = { *, +, ), $ }
Step4: Parsing Table
id + * ( ) $
E E → T E' E → T E'
E' E' → + T E' E' → ε E' → ε
T T → F T' T → F T'
T' T' → ε T' → * F T' T' → ε T' → ε
F F → id F→(E)
Step 5: Parsing the Input
Stack Input Action
$E id + id × id $ E → T E'
$ E' T id + id × id $ T → F T'
$ E' T' F id + id × id $ F → id
$ E' T' id id + id × id $ pop id
$ E' T' + id × id $ T' → ε
$ E' + id × id $ E' → + T E'
$ E' T + + id × id $ remove +
$ E' T id × id $ T → F T'
$ E' T' F id × id $ F → id
$ E' T' id id × id $ remove id
$ E' T' × id $ T' → * F T'
$ E' T' F * × id $ remove *
$ E' T' F id $ F → id
$ E' T' id id $ remove id
$ E' T' $ T' → ε
CS3501 – Compiler Design Page 12
$ E' $ E' → ε
$ $ Accept
LL(1) Parser or LL(1) Grammar
If each cell of the predictive parser table has a single entry, then the grammar
is called LL(1).
Predictive parsing algorithm is used in LL(1) grammar.
LL(1):
o One input symbol scanned at a time.
o LMD → Leftmost Derivation.
o Left-to-right scan.
Example
Prove that the following grammar is LL(1):
G:
S→AaAb|BbBa
A→ε
B→ε
Step 1: No left recursion
No left recursion present.
Step 2: No left factoring
No left factoring needed.
Step 3: First and Follow sets
First sets:
FIRST(S) = { a, b }
FIRST(Aa) = { a }
FIRST(Bb) = { b }
FIRST(A) = { ε }
FIRST(B) = { ε }
CS3501 – Compiler Design Page 13
Step 4: Follow sets
FOLLOW(S) = { $ }
FOLLOW(A) = { a, b }
FOLLOW(B) = { b, a }
Step 5: Predictive parser table
Non-terminal Input Symbol Production
S a S→AaAb
S b S→BbBa
A a A→ε
A b A→ε
B b B→ε
B a B→ε
No multiple entries → Grammar is LL(1).
Bottom-Up Parsing: Shift-Reduce Parser (Nov/Dec 2006 – 8 marks)
A Shift-Reduce parser attempts to construct a parse tree from leaves to root.
It works on the same principle as other bottom-up parsers.
A Shift-Reduce parser requires the following data structures:
1. Input Buffer → Stores the input string.
2. Stack → Stores strings and helps in accessing LHS and RHS of production
rules.
Initial Configuration of a Shift-Reduce Parser
Stack Input Buffer
$ w$
Handle Pruning
A handle is a substring that matches the RHS of a production rule in the
grammar.
CS3501 – Compiler Design Page 14
The process of detecting handles and using them in reduction is called handle
pruning.
Shift-Reduce Parser — Basic Operations
1. Shift
Moving symbols from the input buffer onto the stack.
2. Reduce
If a handle appears on the top of the stack, replace it with the corresponding
LHS of the production rule.
3. Accept
If the stack contains only the start symbol and the input buffer is empty,
parsing is successful.
4. Error
A situation where neither shift nor reduce is possible.
Example
Grammar:
E→E-E
E→E*E
E → id
Input String:
id₁ - id₂ * id₃
Stack Input Buffer Action
$ id₁ - id₂ * id₃ $ Shift
$ id₁ - id₂ * id₃ $ Reduce E → id
$E - id₂ * id₃ $ Shift
$E- id₂ * id₃ $ Shift
$ E - id₂ * id₃ $ Reduce E → id
$E-E * id₃ $ Shift
$E-E* id₃ $ Shift
$ E - E * id₃ $ Reduce E → id
$E-E*E $ Reduce E → E * E
$E-E $ Reduce E → E - E
$E $ Accept
CS3501 – Compiler Design Page 15
SLR parsing (or) Construction of SLR parsing Table:
Procedure:-
Step1: Find LR(0) or Canonical Collection of set of items
Step2: Construct the SLR parsing table
Step3: Parse the Input.
LR(0) items (or) Canonical Collection of Set of items:
→ LR(0) items of the grammar G₁ are productions with dot at some position on the
right side of production rule.
eg A → • xy A → x • y
A → xy •
Construction of LR(0) items:-
Step1: for the grammar G₁, initially add S'→ • S to the Set of C
Step2: for each Set of items I₁ in C, perform the Closure Operation and goto Operation
Step3 Until no more Set of items Can be added to C.
Closure Operation - If I is a set of items for the grammar G, then the Closure of I is
Constructed as follows -
Step1: Initially Add S'→•S to Closure(I)
Step2: If A→α•Bβ is a Closure(I) and B→γ is a production then add B→•γ to
Closure(I)
Step3: Repeat this step until no more new item is added to Closure(I)
Goto Operation → If there is a production A→α•Bβ in I then Goto(I,B) is written as
A→αB•β → it means that simply shifting dot symbol one position forward over
grammar symbol (terminal or Non-terminal)
Example Consider the Grammar G₁: X→Xb/a/XA, and Compute Closure(G₀) and A→a
Goto(I) Closure(G₀) Goto(I) X'→•X Goto(I₀,X)→X'→X•
CS3501 – Compiler Design Page 16
Closure(I₀)
X'→•X X→•Xb X→•a X→•XA
Goto(I)
Goto(I,X) = X'→X• Goto(I,X) = X→X•b Goto(I,X) = X→X•A Goto(I,X) = A→•d
Goto(I,a) = X→a•
Construction of SLR parsing Table:
Procedure:-
Step1: Initially find LR(0) items
Step2: If goto(Iᵢ,a) = Iⱼ then set Action[i,a] = Shiftj where a must be terminal.
Step3: If goto(Iᵢ,A) = Iⱼ then set goto[i,A] = j where A must be non-terminal.
Step4: If S'→S• is in Iᵢ then set Action[i,$] = Accept
Step5: If A→α• is in Iᵢ then set Action[i,a] = reduce A→α where a = follow(A)
Example Check whether the following grammar is SLR(1) or not. E→E*E/id
Step1: LR(0) items Closure(I₀)
E'→•E } I₀ E→•E*E E→•id
Goto(I₀) Goto(I₀,E) = E'→E• } I₁ E→E•*E
Goto(I₀,id) = E→id• } I₂
Goto(I₁) Goto(I₁,) = E→E•E } E→•E*E } I₃ E→•id
Goto(I₃) Goto(I₃,E) = E→E*E• } I₄ E→E•*E
Goto(I₃,id) = E→id• } I₂
Goto(I₄) Goto(I₄,) = E→E•E } E→•E*E } I₃ E→•id
CS3501 – Compiler Design Page 17
Construction to LALR Parser:
To this type of parser, the lookahead symbol is generated for each set of items. The
table obtained by this method are smaller in size than LR(k) parser. Infact the states of
SLR and LALR parsing are always same. Most of the programming language use LALR
parser.
Procedure
Step1: Construction of LR(1) items or Canonical set of items along with the lookahead
CS3501 – Compiler Design Page 18
Step2: Building LALR parsing table
Step3: parsing the Input using CLR table.
Example: G₁: S→CC, C→aC, C→d
Construct LR(1) items
I₀: S'→•S,
$ S→•CC,
$ C→•aC, a/d
C→•d, a/d
I₁: Goto (I₀,S) S'→S•, $
I₂: Goto (I₀,C) S→C•C, $ C→•aC, $ C→•d, $
I₃: Goto (I₀,a) C→a•C, a/d/$ C→•aC, a/d/$ C→•d, a/d/$
I₄: Goto (I₀,d) C→d•, a/d/$
I₅: Goto (I₂,C) S→CC•, $
I₆: Goto (I₃,C) C→aC•, a/d/$
Errors Encountered in Different phases -
→ An important role of the Compiler is to report any errors in the Source program
that it detects during the entire translation process
→ Each phases of Compiler Can encounter errors after detecting errors must be
Corrected to proceed Compilation process
→ The Syntax and Semantic phases handle large no of errors in Compilation process
Types of Errors -
(1) Lexical Errors → lexical analyser detects errors from i/p characters → Name of
Some keywords identified typed incorrectly → eg. Switch is written as Swich
CS3501 – Compiler Design Page 19
(2) Syntax Errors - → Syntax errors are detected by Syntax analyser → Errors like
semicolon missing or unbalanced parenthesis → eg ((a+b*(c-d)) in the statement)
(i) Semantic Errors: → Data type mismatch errors handled by Semantic analyzer
→ Incompatible data type value assignment
→ eg. Assigning a string value to integer eg. int a = 50; char b = a + 40; // Semantic
error
(ii) Logical errors: → Code not readable and infinite loops → Misuse of operators,
codes written after end of main() block. eg. int f = 1; if (f == 0) // logical error { }
Note: Intermediate code generation, code optimization and code generation forms
logical errors.
→ Not getting the desired output
Error Recovery Strategy:
(i) Panic Mode (ii) Statement Mode
(iii) Error productions (iv) Global Correction
Error Recovery Strategies in Syntax Analyzer (May/Jun-2007) 6 marks
Panic Mode: It is the Simple error recovery Mechanism that the parser discards input
Symbol one at a time until token is found.
eg. else; // Error { }
Phrase level Recovery: Here parser performs local Corrections on the input like
Swapping letters, replacing operators, inserting a Missing letter etc.
eg. fi (a > b) // Error { }
Error production: For other Common errors like Unused symbol, create the grammar
using production with warning messages. If an error Encountered the appropriate
Warning message is generated.
Global Corrections: We would like a Compiler to make a few Changes as possible in
processing a Incorrect Input Symbol. It is a Combination of both panic mode and
phrase level recovery technique.
CS3501 – Compiler Design Page 20
YACC or Design of parser or Language Specifying Syntax Analyser - Yet Another
Compiler Compiler
YACC is the tool to Specify the Syntax analyser. It automatically Construct the parser.
Creating a parser generator with YACC:
The following diagram Shows that how the parser is Created by the Specification file.
YACC Specification:
YACC Specification file Consist of three parts
(i) Declaration part
(ii) Translation Rules
(iii) Supporting C functions [note in margin: could v: library yacc -d y gcc y.tab.c -ll -
lfl]
declarations
CS3501 – Compiler Design Page 21
%% translation rules
%%
Supporting C functions
Declaration part: In this section C declaration Can be put. we Can also add or declare
grammar token within %{ Symbol.
Translation Rule: It Consists of all the production rules of CFG with Corresponding
actions.
Rule 1 Action 1
Rule 2 Action 2 ⋮ ⋮
Rule n Action n
C Functions Section: This Section Consist of one main function in which yyparse()
will be called.
Example YACC Specification of Simple desk Calculator:
%{
#include <ctype.h>
%}
%token DIGIT
%% line : expr '\n' { printf(" %d", $1); }
; expr : expr '+' term { $$ = $1 + $2; }
| term ;
term : term '*' factor { $$ = $1 * $2; }
| factor ; factor : '(' expr ')' { $$ = $2; }
| DIGIT ;
%%
CS3501 – Compiler Design Page 22
yylex()
int c;
c = getchar();
if (isdigit(c))
yyval = c-'0';
return DIGIT;
return c;
Design of a Syntax Analyzer for a Sample Language
CS3501 – Compiler Design Page 23
Introduction:
Syntax Analysis (Parsing) is the second phase of the compiler. It receives
input in the form of tokens from the lexical analyzer and produces a parse tree
(also known as a syntax tree) that represents the grammatical structure of the
source code.
Purpose:
The main goal of designing a syntax analyzer is to ensure the input source code
follows the grammatical rules of the language and to construct a parse tree or abstract
syntax tree for further processing.
Sample Language Definition
Let us define a simple expression language that supports:
Identifiers (e.g., id)
Arithmetic operators: +, *
Parentheses for grouping
Grammar:
E→E+T|T
T→T*F|F
F → ( E ) | id
This grammar is ambiguous and left-recursive, and not suitable for predictive
parsing. So, we refactor it:
Left Factored & Left Recursion Removed Grammar:
E → T E'
E' → + T E' | ε
T → F T'
T' → * F T' | ε
F → ( E ) | id
3. Components of the Syntax Analyzer
Grammar Representation
The syntax analyzer uses CFG.
CS3501 – Compiler Design Page 24
Productions are written in BNF (Backus–Naur Form).
FIRST and FOLLOW Sets
Required for building LL(1) parsing tables.
Help in predictive decisions.
Parse Table (for LL(1) Parser)
Rows → Non-terminals
Columns → Terminals (including $)
Each cell → Production rule or error
Parsing Stack
Simulates top-down parsing.
Starts with start symbol and $ at the bottom.
Parsing Algorithm
Repeatedly match stack top and current input token.
Apply productions from parse table.
Detect and report syntax errors.
4. Parsing Approaches
LL(1) Parser (Top-Down)
Working:
Uses FIRST and FOLLOW sets
Selects rule based on lookahead (1 token)
Pros: Simple, easy to implement
Cons: Cannot handle left recursion or ambiguity
LR Parser (Bottom-Up)
Working:
Uses shift and reduce actions
CS3501 – Compiler Design Page 25
Builds parse tree from leaves (input) to root
Types:
LR(0), SLR(1), LALR(1), Canonical LR(1)
Pros: More powerful, handles larger class of grammars
Cons: Complex table construction
5. Implementation Strategy
Step-by-Step Design:
1. Define Tokens
o id, +, *, (, )
2. Define Grammar
o Use simplified or transformed grammar as shown above
3. Construct FIRST and FOLLOW sets
4. Build Parsing Table
o Use rules and sets to fill LL(1) parse table
5. Write Parser Code
o Stack-based simulation of parsing
o Error handling
6. Integrate with Lexer
o Use tools like LEX (scanner) and YACC (parser)
6. Example
Let’s parse the string:
id + id * id
Steps (Using LL(1) Table):
1. Stack: E $, Input: id + id * id $
2. Expand E → T E'
3. Expand T → F T', F → id
4. Match id, Input: + id * id
5. Expand T' → ε, then E' → + T E'
6. And so on...
Finally, if all input is matched and stack is empty (except $), parsing is successful.
CS3501 – Compiler Design Page 26
7. Error Handling Techniques:
1. Panic Mode: Skips input symbols until a synchronizing token is found.
2. Phrase-Level Recovery: Replaces or inserts symbols to continue parsing.
3. Error Productions: Modify the grammar to include common errors.
4. Global Correction: Minimizes the total number of changes needed.
8. Tools for Implementation
LEX (Lexical Analyzer Generator)
Tokenizes input based on regular expressions
YACC (Yet Another Compiler Compiler)
Generates parsers from grammar definitions
Works with LEX
Sample Integration Flow:
lex sample.l
yacc -d sample.y
gcc lex.yy.c y.tab.c -o parser
9. Output of Syntax Analyzer
Parse Tree
Errors, if any
(Optionally) Intermediate code (for further stages)
10. Conclusion:
A well-designed syntax analyzer ensures that the source program is syntactically valid.
Whether using a table-driven LL(1) parser or a powerful LR parser, the analyzer plays
a crucial role in translating human-readable code into a machine-understandable
structure.
CS3501 – Compiler Design Page 27