0% found this document useful (0 votes)
602 views

Introduction To Javacc: Cheng-Chia Chen

This document provides an introduction to JavaCC, a parser generator written in Java. It discusses: - What a parser generator is and how it can generate a scanner and parser from a lexical and grammar specification. - The key features of JavaCC including that it is a top-down LL(K) parser generator that allows for lexical and grammar specifications in one file along with tree building, customization options, documentation generation, and internationalization. - The basic steps to use JavaCC which include writing a .jj specification file defining the grammar and actions, running JavaCC to generate source files, writing a program that uses the generated parser, and compiling and running the program. An example regular expression

Uploaded by

AsmaBatoolNaqvi
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
602 views

Introduction To Javacc: Cheng-Chia Chen

This document provides an introduction to JavaCC, a parser generator written in Java. It discusses: - What a parser generator is and how it can generate a scanner and parser from a lexical and grammar specification. - The key features of JavaCC including that it is a top-down LL(K) parser generator that allows for lexical and grammar specifications in one file along with tree building, customization options, documentation generation, and internationalization. - The basic steps to use JavaCC which include writing a .jj specification file defining the grammar and actions, running JavaCC to generate source files, writing a program that uses the generated parser, and compiling and running the program. An example regular expression

Uploaded by

AsmaBatoolNaqvi
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 87

Introduction to JavaCC

Cheng-Chia Chen

1
What is a parser generator

T o t a l = p r i c e + t a x ;

Scanner

Total = price + tax ;

assignment Parser

Total = Expr

id + id Parser generator
(JavaCC)
price tax
lexical+grammar
specification 2
JavaCC
• JavaCC (Java Compiler Compiler) is a scanner and
parser generator;
• Produce a scanner and/or a parser written in java, itself
is also written in Java;
• There are many parser generators.
– yacc (Yet Another Compiler-Compiler) for C programming
language (dragon book chapter 4.9);
– Bison from gnu.org
• There are also many parser generators written in Java
– JavaCUP;
– ANTLR;
– SableCC

3
More on classification of java parser generators
• Bottom up Parser Generators Tools
– JavaCUP;
– jay, YACC for Java www.inf.uos.de/bernd/jay
– SableCC, The Sable Compiler Compiler www.sablecc.org
• Topdown Parser Generators Tools
– ANTLR, Another Tool for Language Recognition www.antlr.org
– JavaCC, Java Compiler Compiler javacc.dev.java.net

4
Features of JavaCC
• TopDown LL(K) parser genrator
• Lexical and grammar specifications in one file
• Tree Building preprocessor
– with JJTree
• Extreme Customizable
– many different options selectable
• Document Generation
– by using JJDoc
• Internationalized
– can handle full unicode
• Syntactic and Semantic lookahead

5
Features of JavaCC (cont’d)
• Permits extneded BNF specifications
– can use | * ? + () at RHS.
• Lexical states and lexical actions
• Case-sensitive/insensitive lexical analysis
• Extensive debugging capability
• Special tokens
• Very good error reporting

6
JavaCC Installation
• Download the file javacc-4.X.zip from https://
javacc.dev.java.net/
• unzip javacc-4.X.zip to a directory %JCC_HOME%
• add %JCC_HOME\bin directory to your %path%.
– javacc, jjtree, jjdoc are now invokable directly from the command
line.

7
Steps to use JavaCC
• Write a javaCC specification (.jj file)
– Defines the grammar and actions in a file (say, calc.jj)
• Run javaCC to generate a scanner and a parser
– javacc calc.jj
– Will generate parser, scanner, token,… java sources
• Write your program that uses the parser
– For example, UseParser.java
• Compile and run your program
– javac -classpath . *.java
– java -cp . mainpackage.MainClass

8
Example 1: parse a spec of regular expressions
and match it with input strings
• Grammar : re.jj
• Example
– % all strings ending in "ab"
– (a|b)*ab;
– aba;
– ababb;
• Our tasks:
– For each input string (Line 3,4) determine whether it matches the
regular expression (line 2).

9
the overall picture

% comment REParserTo tokens


REParser MainClass
kenManager
(a|b)*ab;
a;
result
ab; javaCC

re.jj

10
Format of a JavaCC input Grammar
• javacc_options
• PARSER_BEGIN ( <IDENTIFIER>1 )
java_compilation_unit
PARSER_END ( <IDENTIFIER>2 )
• ( production )*

11
the input spec file (re.jj)
options {
USER_TOKEN_MANAGER=false;
BUILD_TOKEN_MANAGER=true;
OUTPUT_DIRECTORY="./reparser";
STATIC=false;
}

12
13
re.jj
PARSER_BEGIN(REParser)
package reparser;

import java.lang.*;

import dfa.*;

public class REParser {


public FA tg = new FA();

// output error message with current line number


public static void msg(String s) {
System.out.println("ERROR"+s);
}

public static void main(String args[]) throws Exception {

REParser reparser = new REParser(System.in);

reparser.S();
}
}
PARSER_END(REParser)

14
re.jj (Token definition)
TOKEN : {
<SYMBOL: ["0"-"9","a"-"z","A"-"Z"] >
| <EPSILON: "epsilon" >
| <LPAREN: "(“ >
| <RPAREN: ")“ >
| <OR: "|" >
| <STAR: "*“ >
| <SEMI: ";“ >

SKIP: {
< ( [" ","\t","\n","\r","\f"] )+ >
|
< "%" ( ~ ["\n"] )* "\n" > { System.out.println(image); }
}

15
re.jj (productions)
void S() : { FA d1; }
{
d1 = R() <SEMI>
{ tg = d1; System.out.println("------NFA"); tg.print();

System.out.println("------DFA");
tg = tg.NFAtoDFA(); tg.print();

System.out.println("------Minimize");
tg = tg.minimize(); tg.print();

System.out.println("------Renumber");
tg=tg.renumber(); tg.print();

System.out.println("------Execute");
}
testCases()

16
re.jj
void testCases() : {}
{ (testCase() )+ }

void testCase(): { String testInput ;}


{ testInput = symbols()
<SEMI>
{ tg.execute( testInput) ; }
}

String symbols() :
{Token token = null; StringBuffer result = new StringBuffer(); }
{
(
token = <SYMBOL>
{ result.append( token.image) ; }
)*
{ return result.toString(); }
}

17
re.jj (regular expression)
// R --> RUnit | RConcat | RChoice

FA R() : {FA result ;}


{ result = RChoice() { return result; } }

FA RUnit() :
{ FA result ; Token d1; }
{
(
<LPAREN> result = RChoice() <RPAREN>
|
<EPSILON> { result = tg.epsilon(); }
|
d1 = <SYMBOL> { result = tg.symbol( d1.image ); }
)
{ return result ; }
}

18
re.jj
FA RChoice() : { FA result, temp ;}
{
result = RConcat()
( <OR> temp = RConcat() { result = result.choice( temp ) ;} )*
{return result ; } }

FA RConcat() : { FA result, temp ;}


{ result = RStar()
( temp = RStar() { result = result.concat( temp ) ;} )*
{return result ; } }

FA RStar() : {FA result;}


{ result = RUnit()
( <STAR> { result = result.closure(); } )*
{ return result; } }

19
Format of a JavaCC input Grammar
javacc_input ::= javacc_options
PARSER_BEGIN ( <IDENTIFIER>1 )
java_compilation_unit
PARSER_END ( <IDENTIFIER>2 )
( production )*
<EOF>

color usage:
– blue --- nonterminal
– <orange> – a token type
– purple --- token lexeme ( reserved word;
– I.e., consisting of the literal itself.)
– black -- meta symbols

20
Notes
• <IDENTIFIER> means any Java identifers like var, class2, …
– IDENTIFIER means IDENTIFIER only.
• <IDENTIFIER>1 must = <IDENTIFIER>2
• java_compilation_unit is any java code that as a whole can appear
legally in a file.
– must contain a main class declaration with the same name as
<IDENTIFIER>1 .
• Ex:
PARSER_BEGIN ( MyParser )
package mypackage;
import myotherpackage….;
public class MyParser { … }
class MyOtherUsefulClass { … } …
PARSER_END (MyParser)

21
The input and output of javacc TokenMgrError.java

(MyLangSpec.jj )
Token.java

PARSER_BEGIN ( MyParser ) javacc


package mypackage;
ParseException.java
import myotherpackage….;
public class MyParser { … }
class MyOtherUsefulClass { … } …
PARSER_END (MyParser)
MyParser.java

MyParserCostant.java MyParserTokenManager.java

22
Notes:
• Token.java and ParseException.java are the same for all input and
can be reused.
• package declaration in *.jj are copied to all 3 outputs.
• import declarations in *.jj are copied to the parser and token
manager files.
• parser file is assigned the file name <IDENTIFIER>1 .java
• The parser file has contents:
…class MyParser { …
//generated parser is inserted here.
…}
• The generated token manager provides one public method:
Token getNextToken() throws TokenMgeError;

23
Lexical Specification with JavaCC

24
javacc options
javacc_options ::=
[ options { ( option_binding )* } ]

• option_binding are of the form :


– <IDENTIFIER>3 = <java_literal> ;
– where <IDENTIFIER>3 is not case-sensitive.
• Ex:
options {
USER_TOKEN_MANAGER=true;
BUILD_TOKEN_MANAGER=false;
OUTPUT_DIRECTORY="./sax2jcc/personnel";
STATIC=false;
LOOKAHEAD=2;
}

25
More Options
• LOOKAHEAD
– java_integer_literal (1)
• CHOICE_AMBIGUITY_CHECK
– java_integer_literal (2) for A | B … | C
• OTHER_AMBIGUITY_CHECK
– java_integer_literal (1) for (A)*, (A)+ and (A)?
• STATIC (true)
• DEBUG_PARSER (false)
• DEBUG_LOOKAHEAD (false)
• DEBUG_TOKEN_MANAGER (false)
• OPTIMIZE_TOKEN_MANAGER
– java_boolean_literal (false)
• OUTPUT_DIRECTORY (current directory)
• ERROR_REPORTING (true)

26
More Options
• JAVA_UNICODE_ESCAPE (false)
– replace \u2245 to actual unicode (6 char  1 char)
• UNICODE_INPUT (false)
– input strearm is in unicode form
• IGNORE_CASE (false)
• USER_TOKEN_MANAGER (false)
– generate TokenManager interface for user’s own scanner
• USER_CHAR_STREAM (false)
– generate CharStream.java interface for user’s own inputStream
• BUILD_PARSER (true)
– java_boolean_literal
• BUILD_TOKEN_MANAGER (true)
• SANITY_CHECK (true)
• FORCE_LA_CHECK (false)
• COMMON_TOKEN_ACTION (false)
– invoke void CommonTokenAction(Token t) after every getNextToken()
• CACHE_TOKENS (false)

27
Example: Figure 2.2
1. if IF
2. [a-z][a-z0-9]* ID
3. [0-9]+ NUM
4. ([0-9]+”.”[0-9]*) | ([0-9]*”.”[0-9]+) REAL
5. (“--”[a-z]*”\n”) | (“ “|”\n” | “\t” )+ nonToken, WS
6. . error
• javacc notations 
1. “if” or “i” “f” or [“i”][“f”]
2. [“a”-”z”]([“a”-”z”,”0”-”9”])*
3. ([“0”-”9”])+
4. ([“0”-”9”])+ “.” ( [“0”-”9”] ) * |
([“0”-”9”])* ”.” ([“0”-”9”])+

28
JvaaCC spec for the tokens from Fig 2.2
PARSER_BEGIN(MyParser) class MyParser{}
PARSER_END(MyParser)
/* For the regular expressin on the right, the token on the left will be
returned */
TOKEN : {
< IF: “if” >
| < #DIGIT: [“0”-”9”] >
|< ID: [“a”-”z”] ( [“a”-”z”] | <DIGIT>)* >
|< NUM: (<DIGIT>)+ >
|< REAL: (<DIGIT>)+ “.” (<DIGIT>)* |
(<DIGIT>)+ “.” (<DIGIT>)* >
}

29
JvaaCC spec for the tokens from Fig
2.2 (continued)
/* The regular expression here will be skipped during lexical analysis */
SKIP : { < “ “> | <“\t”> |<“\n”> }

/* like SKIP but skipped text accessible from parser action */


SPECIAL_TOKEN : {
<“--” ([“a”-”z”])* (“\n” | “\r” | “\n\r” ) >
}

/* . For any substring not matching lexical spec, javacc will throw an
error */
/* main rule */
void start() : {}
{ (<IF> | <ID> |<NUM> |<REAL>)* }
30
31
Grammar Specification with JavaCC

32
The Form of a Production
java_return_type java_identifier ( java_parameter_list ) :
java_block
{ expansion_choices }

• EX :
void XMLDocument(Logger logger): { int msg = 0; }
{ <StartDoc> { print(token); }
Element(logger)
<EndDoc> { print(token); }
| else()
}

33
Example ( Grammar 3.30 )
1. PL
2. S  id := id
3. S  while id do S
4. S  begin L end
5. S if id then S
6. S  if id then S else S
7. L S
8. L L;S

1,7,8 : P  S (;S)*

34
JavaCC Version of Grammar 3.30
PARSER_BEGIN(MyParser)
pulic class MyParser{}
PARSRE_END(MyParser)

SKIP : {“ “ | “\t” | “\n” }

TOKEN: {
<WHILE: “while”> | <BEGIN: “begin”> | <END:”end”>
| <DO:”do”> | <IF:”if”> | <THEN : “then”>
| <ELSE:”else”> | <SEMI: “;”> | <ASSIGN: “=“>
|<#LETTER: [“a”-”z”]>
| <ID: <LETTER>(<LETTER> | [“0”-”9”] )* >

35
JavaCC Version of Grammar 3.30 (cont’d)
void Prog() : { } { StmList() <EOF> }

void StmList(): { } {
Stm() (“;” Stm() ) *
}

void Stm(): { } {
<ID> “=“ <ID>
| “while” <ID> “do” Stm()
| <BEGIN> StmList() <END>
| “if” <ID> “then” Stm() [ LOOKAHEAD(1) “else” Stm() ]

36
Types of producitons
• production ::= javacode_production
| regulr_expr_production
| bnf_production
| token_manager_decl

Note:
1,3 are used to define grammar.
2 is used to define tokens
4 is used to embed codes into token manager.

37
JAVACODE production
• javacode_production ::= “JAVACODE”
java-return_type iava_id “(“ java_param_list “)”
java_block

• Note:
– Used to define nonterminals for recognizing sth that is hard to
parse using normal production.

38
Example JAVACODE
JAVACODE void skip_to_matching_brace()
{
Token tok;
int nesting = 1;
while (true) {
tok = getToken(1);
if (tok.kind == LBRACE) nesting++;
if (tok.kind == RBRACE) {
nesting--;
if (nesting == 0) break; }
tok = getNextToken(); } }

39
Note:
• Do not use nonterminal defined by JAVACODE at choice
point without giving LOOKHEAD.
• void NT() : {} {
skip_to_matching_brace()
| some_other_production()
}
• void NT() : {} {
"{" skip_to_matching_brace()
| "(" parameter_list() ")"
}

40
41
TOKEN_MANAGER_DECLS
token_manager_decls ::=
TOKEN_MGR_DECLS : java_block

• The token manager declarations starts with the reserved


word "TOKEN_MGR_DECLS" followed by a ":" and then a
set of Java declarations and statements (the Java block).
• These declarations and statements are written into the
generated token manager (MyParserTokenManager.java)
and are accessible from within lexical actions.
• There can only be one token manager declaration in a
JavaCC grammar file.

42
regular_expression_production
regular_expr_production ::=
[ lexical_state_list ]
regexpr_kind [ [ IGNORE_CASE ] ] :
{ regexpr_spec ( | regexpr_spec )* }

• regexpr_kind::=
TOKEN | SPECIAL_TOKEN | SKIP | MORE

• TOKEN is used to define normal tokens


• SKIP is used to define skipped tokens (not passed to later parser)
• MORE is used to define semi-tokens (I.e. only part of a token).
• SPECIAL_TOKEN is between TOKEN and SKIP tokens in that it is
passed on to the parser and accessible to the parser action but is
ignored by production rules (not counted as an token). Useful for
representing comments.

43
lexical_state_list
lexical_state_list::=
< * > | < java_identifier ( , java_identifier )* >
• The lexical state list describes the set of lexical states for
which the corresponding regular expression production
applies.
• If this is written as "<*>", the regular expression
production applies to all lexical states. Otherwise, it
applies to all the lexical states in the identifier list within
the angular brackets.
• if omitted, then a DEFAULT lexical state is assumed.

44
regexpr_spec
regexpr_spec::=
regular_expression1 [ java_block ] [ : java_identifier ]

• Meaning:
• When a regular_expression1 is matched then
– if java_block exists then execute it
– if java_identifier appears, then transition to that lexical state.

45
regular_expression
regular_expression ::=
java_string_literal
| < [ [#] java_identifier : ] complex_regular_expression_choices >
| <java_identifier>
| <EOF>

• <EOF> is matched by end-of-file character only.


• (3) <java_identifier> is a reference to other labeled regular_expression.
– used in bnf_production
• java_string_literal is matched only by the string denoted by itself.
• (2) is used to defined a labled regular_expr and not visible to outside
the current TOKEN section if # occurs.
• (1) for unnamed tokens

46
Example
<DEFAULT, LEX_ST2> TOKEN [IGNORE_CASE] : {
< FLOATING_POINT_LITERAL:
(["0"-"9"])+ "." (["0"-"9"])* (<EXPONENT>)? (["f","F","d","D"])? |
"." (["0"-"9"])+ (<EXPONENT>)? (["f","F","d","D"])? |
(["0"-"9"])+ <EXPONENT> (["f","F","d","D"])? |
(["0"-"9"])+ (<EXPONENT>)? ["f","F","d","D"] >
{ // do Something } : LEX_ST1
| < #EXPONENT: ["e","E"] (["+","-"])? (["0"-"9"])+ >
}
• Note: if # is omitted, E123 will be recognized erroneously
as a token of kind EXPONENT.

47
Structure of complex_regular_expression
• complex_regular_expression_choices::=
complex_regular_expression (| complex_regular_expression )*
• complex_regular_expression ::=
( complex_regular_expression_unit )*
• complex_regular_expression_unit ::=
java_string_literal | < java_identifier >
| character_list
| ( complex_regular_expression_choices ) [+|*|?]

• Note:
unit concatenation;juxtaposition
concatenation;juxtaposition
complex_regular_expression choice; | 
complex_regular_expression_choice (.)[+|*|?] 
unit
• Principle :
先串接再選擇 , 套用重複運算必須先加括號

48
character_list
character_list::=
[~] [ [ character_descriptor ( , character_descriptor )* ] ]
character_descriptor::=
java_string_literal [ - java_string_literal ]
java_string_literal ::= // reference to java grammar
“ singleCharString* “
note: java_sting_literal here is restricted to length 1.
ex:
– ~[“a”,”b”] --- all chars but a and b.
– [“a”-”f”, “0”-”9”, “A”,”B”,”C”,”D”,”E”,”F”] --- hexadecimal digit.
– [“a”,”b”]+ is not a regular_expression_unit. Why ?
• should be written ( [“a”,”b”] )+ instead.

49
bnf_production
• bnf_production::=
java_return_type java_identifier "(" java_parameter_list ")"
":"
java_block
"{" expansion_choices "}“

• expansion_choices::= expansion ( "|" expansion )*


• expansion::= ( expansion_unit )*

50
expansion_unit
• expansion_unit::=
local_lookahead
| java_block
| "(" expansion_choices ")" [ "+" | "*" | "?" ]
| "[" expansion_choices "]"
| [ java_assignment_lhs "=" ] regular_expression
| [ java_assignment_lhs "=" ]
java_identifier "(" java_expression_list ")“
Notes:
1 is for lookahead; 2 is for semantic action
4 = ( …)?
5 is for token match
6. is for match of other nonterminal
51
lookahead
• local_lookahead::= "LOOKAHEAD" "("
[ java_integer_literal ] [ "," ] [ expansion_choices ] [ "," ]
[ "{" java_expression "}" ] ")“

• Notes:
• 3 componets: max # lookahead + syntax + semantics
• examples:
– LOOKHEAD(3)
– LOOKAHEAD(5, Expr() <INT> | <REAL> , { true} )
• More on LOOKAHEAD
– see minitutorial on javacc.dev.java.net

52
JavaCC API
• Non-Terminals in the Input Grammar
• NT is a nonterminal =>
returntype NT(parameters) throws ParseError;
is generated in the parser class

• API for Parser Actions


• Token token;
– variable always holds the last token and can be used in
parser actions.
– exactly the same as the token returned by getToken(0).
– two other methods - getToken(int i) and getNextToken() can
also be used in actions to traverse the token list.

53
Token class
• public int kind;
– 0 for <EOF>
• public int beginLine, beginColumn, endLine, endColumn;
• public String image;
• public Token next;
• public Token specialToken;
• public String toString()
• { return image; }
• public static final Token newToken(int ofKind)

54
Error reporting and recovery
• It is not user friendly to throw an exception and exit the
parsing once encountering a syntax error.

• two Exceptions
– ParseException .  can be recovered
– TokenMgrError  not expected to be recovered

• Error reporting
– modify ParseExcpetion.java or TokenMgeError.java
– generateParseException method is always invokable in parser
action to report error

55
Error Recovery in JavaCC:
• Shallow Error Recovery
• Deep Error Recovery

• Shallow Error Recovery


• Ex:
void Stm() : {} {
IfStm()
| WhileStm() }

if getToken(1) != “if” or “while” => shallow error

56
Shallow recovery
can be recovered by additional choice:
void Stm() : {} {
IfStm()
| WhileStm()
| error_skipto(SEMICOLON)
}
where
JAVACODE
void error_skipto(int kind) {
ParseException e = generateParseException(); // generate the exception
object.
System.out.println(e.toString()); // print the error message
Token t;
do { t = getNextToken(); } while (t.kind != kind);}

57
Deep Error Recovery
• Same example: void Stm() : {} { IfStm() | WhileStm() }
• But this time the error occurs during paring inside
IfStmt() or WhileStmt() instead of the lookahead entry.
• The approach: use java try-catch construct.
void Stm() : {} {
try {
( IfStm() | WhileStm() )
} catch (ParseException e) {
error_skipto(SEMICOLON);
}
}
note: the new syntax for javacc bnf_production.

58
References:

• javaCC web site :


http://Javacc.dev.java.net

• JavaCC documentation :
https://javacc.dev.java.net/doc/docindex.html

59
Looking ahead in javacc

60
What’s LOOKAHEAD?
• What strings are void Input() :
{}
matches by Input() ?
{
– abcc (yes)
"a" BC() "c"
– abc (no!!) }
• Why ?
– javacc‘s default greedy void BC() :
lookahead alg. {}
{
"b" [ "c" ]
}

61
Why matching abcc ?
• Input() :abcc
• “a”  consume a :abcc
• BC() :bcc
• “b” consume b :bcc
• [“c”] greedily consume c : cc
• “c”  consume c :c
• succeed! :

62
Why abc not matched ?
• Input() :abc
• “a”  consume a :abc
• BC() :bc
• “b” consume b :bc
• [“c”] greedily consume c :c
• even if no consumption seems better
• “c”  need a ‘c’ :  don’t match
• fail!

• Why such behavior ?


– 1 one symbol lookahead(for performance)
– 2. avoid Backtracking!

63
How to math both input ?
• Rewrite the grammar!
• increase lookhead number

64
What about these rewritings ?

void Input() : good! void Input() :


{}
{} {
{ "a" ( BC1() | BC2() )
}
"a" "b" "c" [ "c" ]
} void BC1() :
{}
void Input() : {
{} "b" "c" "c"
}
{
"a" "b" "c" "c" void BC2() :
{}
| {
"b" "c" [ "c" ]
"a" "b" "c" }
}
65
Looking Ahead
• Backtracking is unacceptable language parser
• LOOKAHEAD:
– The process of exploring tokens further in the input
stream to determine decision at various choice points.
– once making a decision, it commits to it and there is no
backtracking!
– Since some of these decisions may be made with less
than perfect information you need to know something
about LOOKAHEAD to make your grammar work
correctly.
• The two ways in which you make the choice
decisions work properly are:
1. Modify the grammar to make it simpler.
2. Insert hints at the more complicated choice points to
help the parser make the right choices.
66
Four Choice Points in javacc
• ( exp1 | exp2 | ... | expn )
– which one to match ?
• ( … )?
– To match content inside () or bypass ?
• ( … )*
– To leave or match and then repeat ?
• ( … )+ = (…)(…)*
– To leave or match and repeat after first
match ?

67
The Default Algo for choice |
• The default choice determination algorithm looks ahead 1
token in the input stream and uses this to help make its
choice at choice points

void basic_expr() :
The choice determination algorithm :
{}
{ if (next token is <ID>) {
<ID> "(" expr() ")" // Choice 1 choose Choice 1
| } else if (next token is "(") {
"(" expr() ")" // Choice 2 choose Choice 2
| } else if (next token is "new") {
"new" <ID> // Choice 3 choose Choice 3
} } else {
produce an error message
}

68
A Modified Grammar

void basic_expr() :
{}
{
What happans
<ID> "(" expr() ")“ // Choice 1
on <ID>?
|
"(" expr() ")" // Choice 2
| Why?
"new" <ID> // Choice 3
|
<ID> "." <ID> // Choice 4
}
Warning: Choice conflict involving two expansions at line 25,
column 3 and line 31, column 3 respectively. A common
prefix is: <ID> Consider using a lookahead of 2 for earlier expansion. 69
Greedy behavior for (…)*
Note: the choice determination
void identifier_list() : algorithm does not look beyond
{} the (...)*
{
<ID> ( "," <ID> )*
}

• Suppose the first <ID> has already been matched and that the parser has
reached the choice point (the (...)* construct). Here's how the choice
determination algorithm works:

while (next token is ",") {


choose the nested expansion (i.e., go into the (...)* construct)
consume the "," token
if (next token is <ID>) consume it, otherwise report error
}

70
Another Example
• When making a choice at ( "," <ID> )*, it will always go into
the (...)* construct if the next token is a ",".
– It will do this even when identifier_list was called from
funny_list and the token after the "," is an <INT>.
– Intuitively, the right thing to do in this situation is to skip
the (...)* construct and return to funny_list

void identifier_list() : void funny_list() :


{} {}
{ {
<ID> ( "," <ID> )* identifier_list() "," <INT>
} }

71
A Concrete input
One input "id1, id2, 5",
the parser will complain that it encountered a 5 when it was
expecting an <ID>.
•Note – during parser generation, javacc would give the
warning message:

Warning: Choice conflict in (...)* construct at line 25, column 8.


Expansion nested within construct and expansion following
construct have common prefixes, one of which is: ",“ Consider
using a lookahead of 2 or more for nested expansion.

•Essentially, JavaCC is saying it has detected a situation


which may cause the parser to do strange things. The
generated parser will still work - except that it probably
doesn’t do what you expect

72
Multiple Token Lookaheads Specs
• the default algorithm works fine in most situations. In
cases where it does not work well, javacc provides you
with warning messages.

• If you have javacc file without producing any warnings,


then the grammar is a LL(1) grammar.

• There are two options for lookahead if your grammar is


not LL(1).
– Modify your grammar to make it LL(1)
– give more lookaheads globally or where needed.

73
Option 1 - Modify your grammar
• You can modify your grammar so that the warning
messages go away. That is, you can attempt to make
your grammar LL(1) by making some changes to it
• But not always work!

void basic_expr() : {} {
<ID> "(" expr() ")“ // Choice 1 Factor void basic_expr() :{ } {
common
| left parts <ID> ( "(" expr() ")" | "." <ID> )
|
"(" expr() ")" // Choice 2
"(" expr() ")"
| |
"new" <ID>
"new" <ID> // Choice 3
}
|
<ID> "." <ID> // Choice 4
} 74
Factoring not always work!!
void basic_expr() : {}
{
{ initMethodTables(); } <ID> "(" expr() ")"
|
"(" expr() ")"
|
"new" <ID>
|
{ initObjectTables(); } <ID> "." <ID>
}

• Since the actions are different, left-factoring cannot be


performed.
75
Option 2 – Increase lookadeads
• You can provide the generated parser with some hints to
help it out in the non LL(1) situations.

• All such hints are specified using either


– setting the global LOOKAHEAD option to a larger value or
– using the LOOKAHEAD(...) construct to provide a local hint on
puzzled choice points.

• Comparisons between the two options


– Option 1 makes your grammar perform better.
– Option 2 give you a simpler grammar - one that is
easier to develop and maintain - one that is more
human friendy.
– Sometimes Option 2 is the only choice.

76
• Global Option LOOKAHEAD
– options { LOOKAHEAD=2; … }
• local lookahead :

void
voidbasic_expr()
basic_expr(): :
{}{}
{{ ifif(next
LOOKAHEAD(2) (next22tokens
tokensare
are<ID>
<ID>and and"("
"(") ){ {
LOOKAHEAD(2) choose
chooseChoice
Choice11
<ID>
<ID>"("
"("expr()
expr()")"//
")"//Choice
Choice11 } }else
|| elseifif(next
(nexttoken
tokenisis"(")
"("){ {
choose
chooseChoice
Choice22
"("
"("expr()
expr()")"
")" ////Choice
Choice22 } }else
|| elseifif(next
(nexttoken
tokenisis"new")
"new"){ {
choose
chooseChoice
Choice33
"new"
"new"<ID> //
//Choice
Choice33 } }else
||
<ID> elseifif(next
(nexttoken
tokenisis<ID>)
<ID>){ {
choose
chooseChoice
Choice44
<ID>
<ID>"."
"."<ID>
<ID> // //Choice
Choice44 } }else
}} else{ {
produce
producean anerror
errormessage
message
}}
77
void identifier_list() :
{}
{
<ID>
( LOOKAHEAD(2) "," <ID> )*
}

while (next 2 tokens are "," and <ID>) {


choose the nested expansion (i.e., go into the (...)* construct)
consume the "," token
consume the <ID> token
}

78
Syntactic lookahead
• How many lookaheads are needed in the java type
declaration ?

void TypeDeclaration() :
{}
{
ClassDeclaration() // public static final class
|
InterfaceDeclaration() // public abstract abstract interface
}

79
Solution 1

void TypeDeclaration() :
{}
{
LOOKAHEAD(2147483647) ClassDeclaration()
|
InterfaceDeclaration()
}

• Where 2147483647 is Integer.MAX_VALUE.


• Maybe 100 is ok as well !

80
Solution 2 – syntactic lookahead

void TypeDeclaration() :
{}
{
LOOKAHEAD( ClassDeclaration() )
ClassDeclaration()
|
InterfaceDeclaration()
}

• Lookahead of a complete ClassDeclaraation() takes too


much time and makes a lot of unnecessary checking.

81
Solution 3 – a better one

void TypeDeclaration() :
{}
{
LOOKAHEAD( ( "abstract" | "final" | "public" )* "class" )
ClassDeclaration()
|
InterfaceDeclaration()
}

82
Solution 4 – syntactic lookahead + number bound

void TypeDeclaration() :{}


{
LOOKAHEAD( 10, ( "abstract" | "final" | "public" )* "class" )
ClassDeclaration()
|
InterfaceDeclaration()
}
• Meaning: Look ahead at most 10 tokens, if not violating
the pattern ( "abstract" | "final" | "public" )* "class" try
ClassDeclaration().
• default max numbers of tokens to be looked ahead is
Integer.MAX_VALUE for syntactic lookahead.

83
Semantic lookahead
• Could we make the parser choose 2nd alternative on
input “a” “a” without changing the order ?
void Input() : {}
{
"a“
| “a” “a”
}

• Syntactic lookahead impossible since it can’t say things


like that next toke is “a” and following token is not “a”.

84
Solution: semantic lookahead
void Input() : {}
{
LOOKAHEAD( { getToken(1).kind == A && getToken(2).kind != A })
<A:"a“>
| “a” “a”
}
• syntactic + semantic
void Input() : {}
{
LOOKAHEAD(“a”, {getToken(2).kind != A })
<A:"a“>
| “a” “a”
}

85
Complete LOOKAHEAD directive

LOOKAHEAD( amount,
expansion,
{ boolean_expression } ) followExpansion

• At least one of the three entries must be present.


• The default values for each of these entities is defined
below:
– { boolean_expr }  { true;}
– exapnsion  followExpansion
– "amount“  expansion present ? 2147483647 : 0
– Note: amount = 0, no syntactic LOOKAHEAD is performed.

86
References on javacc lookahead
• https://javacc.dev.java.net/doc/lookahead.html
• http://userpages.umbc.edu/~vick/431/Lectures/Spring06/
4_Parsing/1_LL/Looking_ahead_in_javacc.ppt

87

You might also like