Introduction To Javacc: Cheng-Chia Chen
Introduction To Javacc: Cheng-Chia Chen
Cheng-Chia Chen
1
What is a parser generator
T o t a l = p r i c e + t a x ;
Scanner
assignment Parser
Total = Expr
id + id Parser generator
(JavaCC)
price tax
lexical+grammar
specification 2
JavaCC
• JavaCC (Java Compiler Compiler) is a scanner and
parser generator;
• Produce a scanner and/or a parser written in java, itself
is also written in Java;
• There are many parser generators.
– yacc (Yet Another Compiler-Compiler) for C programming
language (dragon book chapter 4.9);
– Bison from gnu.org
• There are also many parser generators written in Java
– JavaCUP;
– ANTLR;
– SableCC
3
More on classification of java parser generators
• Bottom up Parser Generators Tools
– JavaCUP;
– jay, YACC for Java www.inf.uos.de/bernd/jay
– SableCC, The Sable Compiler Compiler www.sablecc.org
• Topdown Parser Generators Tools
– ANTLR, Another Tool for Language Recognition www.antlr.org
– JavaCC, Java Compiler Compiler javacc.dev.java.net
4
Features of JavaCC
• TopDown LL(K) parser genrator
• Lexical and grammar specifications in one file
• Tree Building preprocessor
– with JJTree
• Extreme Customizable
– many different options selectable
• Document Generation
– by using JJDoc
• Internationalized
– can handle full unicode
• Syntactic and Semantic lookahead
5
Features of JavaCC (cont’d)
• Permits extneded BNF specifications
– can use | * ? + () at RHS.
• Lexical states and lexical actions
• Case-sensitive/insensitive lexical analysis
• Extensive debugging capability
• Special tokens
• Very good error reporting
6
JavaCC Installation
• Download the file javacc-4.X.zip from https://
javacc.dev.java.net/
• unzip javacc-4.X.zip to a directory %JCC_HOME%
• add %JCC_HOME\bin directory to your %path%.
– javacc, jjtree, jjdoc are now invokable directly from the command
line.
7
Steps to use JavaCC
• Write a javaCC specification (.jj file)
– Defines the grammar and actions in a file (say, calc.jj)
• Run javaCC to generate a scanner and a parser
– javacc calc.jj
– Will generate parser, scanner, token,… java sources
• Write your program that uses the parser
– For example, UseParser.java
• Compile and run your program
– javac -classpath . *.java
– java -cp . mainpackage.MainClass
8
Example 1: parse a spec of regular expressions
and match it with input strings
• Grammar : re.jj
• Example
– % all strings ending in "ab"
– (a|b)*ab;
– aba;
– ababb;
• Our tasks:
– For each input string (Line 3,4) determine whether it matches the
regular expression (line 2).
9
the overall picture
re.jj
10
Format of a JavaCC input Grammar
• javacc_options
• PARSER_BEGIN ( <IDENTIFIER>1 )
java_compilation_unit
PARSER_END ( <IDENTIFIER>2 )
• ( production )*
11
the input spec file (re.jj)
options {
USER_TOKEN_MANAGER=false;
BUILD_TOKEN_MANAGER=true;
OUTPUT_DIRECTORY="./reparser";
STATIC=false;
}
12
13
re.jj
PARSER_BEGIN(REParser)
package reparser;
import java.lang.*;
…
import dfa.*;
reparser.S();
}
}
PARSER_END(REParser)
14
re.jj (Token definition)
TOKEN : {
<SYMBOL: ["0"-"9","a"-"z","A"-"Z"] >
| <EPSILON: "epsilon" >
| <LPAREN: "(“ >
| <RPAREN: ")“ >
| <OR: "|" >
| <STAR: "*“ >
| <SEMI: ";“ >
SKIP: {
< ( [" ","\t","\n","\r","\f"] )+ >
|
< "%" ( ~ ["\n"] )* "\n" > { System.out.println(image); }
}
15
re.jj (productions)
void S() : { FA d1; }
{
d1 = R() <SEMI>
{ tg = d1; System.out.println("------NFA"); tg.print();
System.out.println("------DFA");
tg = tg.NFAtoDFA(); tg.print();
System.out.println("------Minimize");
tg = tg.minimize(); tg.print();
System.out.println("------Renumber");
tg=tg.renumber(); tg.print();
System.out.println("------Execute");
}
testCases()
16
re.jj
void testCases() : {}
{ (testCase() )+ }
String symbols() :
{Token token = null; StringBuffer result = new StringBuffer(); }
{
(
token = <SYMBOL>
{ result.append( token.image) ; }
)*
{ return result.toString(); }
}
17
re.jj (regular expression)
// R --> RUnit | RConcat | RChoice
FA RUnit() :
{ FA result ; Token d1; }
{
(
<LPAREN> result = RChoice() <RPAREN>
|
<EPSILON> { result = tg.epsilon(); }
|
d1 = <SYMBOL> { result = tg.symbol( d1.image ); }
)
{ return result ; }
}
18
re.jj
FA RChoice() : { FA result, temp ;}
{
result = RConcat()
( <OR> temp = RConcat() { result = result.choice( temp ) ;} )*
{return result ; } }
19
Format of a JavaCC input Grammar
javacc_input ::= javacc_options
PARSER_BEGIN ( <IDENTIFIER>1 )
java_compilation_unit
PARSER_END ( <IDENTIFIER>2 )
( production )*
<EOF>
color usage:
– blue --- nonterminal
– <orange> – a token type
– purple --- token lexeme ( reserved word;
– I.e., consisting of the literal itself.)
– black -- meta symbols
20
Notes
• <IDENTIFIER> means any Java identifers like var, class2, …
– IDENTIFIER means IDENTIFIER only.
• <IDENTIFIER>1 must = <IDENTIFIER>2
• java_compilation_unit is any java code that as a whole can appear
legally in a file.
– must contain a main class declaration with the same name as
<IDENTIFIER>1 .
• Ex:
PARSER_BEGIN ( MyParser )
package mypackage;
import myotherpackage….;
public class MyParser { … }
class MyOtherUsefulClass { … } …
PARSER_END (MyParser)
21
The input and output of javacc TokenMgrError.java
(MyLangSpec.jj )
Token.java
MyParserCostant.java MyParserTokenManager.java
22
Notes:
• Token.java and ParseException.java are the same for all input and
can be reused.
• package declaration in *.jj are copied to all 3 outputs.
• import declarations in *.jj are copied to the parser and token
manager files.
• parser file is assigned the file name <IDENTIFIER>1 .java
• The parser file has contents:
…class MyParser { …
//generated parser is inserted here.
…}
• The generated token manager provides one public method:
Token getNextToken() throws TokenMgeError;
23
Lexical Specification with JavaCC
24
javacc options
javacc_options ::=
[ options { ( option_binding )* } ]
25
More Options
• LOOKAHEAD
– java_integer_literal (1)
• CHOICE_AMBIGUITY_CHECK
– java_integer_literal (2) for A | B … | C
• OTHER_AMBIGUITY_CHECK
– java_integer_literal (1) for (A)*, (A)+ and (A)?
• STATIC (true)
• DEBUG_PARSER (false)
• DEBUG_LOOKAHEAD (false)
• DEBUG_TOKEN_MANAGER (false)
• OPTIMIZE_TOKEN_MANAGER
– java_boolean_literal (false)
• OUTPUT_DIRECTORY (current directory)
• ERROR_REPORTING (true)
26
More Options
• JAVA_UNICODE_ESCAPE (false)
– replace \u2245 to actual unicode (6 char 1 char)
• UNICODE_INPUT (false)
– input strearm is in unicode form
• IGNORE_CASE (false)
• USER_TOKEN_MANAGER (false)
– generate TokenManager interface for user’s own scanner
• USER_CHAR_STREAM (false)
– generate CharStream.java interface for user’s own inputStream
• BUILD_PARSER (true)
– java_boolean_literal
• BUILD_TOKEN_MANAGER (true)
• SANITY_CHECK (true)
• FORCE_LA_CHECK (false)
• COMMON_TOKEN_ACTION (false)
– invoke void CommonTokenAction(Token t) after every getNextToken()
• CACHE_TOKENS (false)
27
Example: Figure 2.2
1. if IF
2. [a-z][a-z0-9]* ID
3. [0-9]+ NUM
4. ([0-9]+”.”[0-9]*) | ([0-9]*”.”[0-9]+) REAL
5. (“--”[a-z]*”\n”) | (“ “|”\n” | “\t” )+ nonToken, WS
6. . error
• javacc notations
1. “if” or “i” “f” or [“i”][“f”]
2. [“a”-”z”]([“a”-”z”,”0”-”9”])*
3. ([“0”-”9”])+
4. ([“0”-”9”])+ “.” ( [“0”-”9”] ) * |
([“0”-”9”])* ”.” ([“0”-”9”])+
28
JvaaCC spec for the tokens from Fig 2.2
PARSER_BEGIN(MyParser) class MyParser{}
PARSER_END(MyParser)
/* For the regular expressin on the right, the token on the left will be
returned */
TOKEN : {
< IF: “if” >
| < #DIGIT: [“0”-”9”] >
|< ID: [“a”-”z”] ( [“a”-”z”] | <DIGIT>)* >
|< NUM: (<DIGIT>)+ >
|< REAL: (<DIGIT>)+ “.” (<DIGIT>)* |
(<DIGIT>)+ “.” (<DIGIT>)* >
}
29
JvaaCC spec for the tokens from Fig
2.2 (continued)
/* The regular expression here will be skipped during lexical analysis */
SKIP : { < “ “> | <“\t”> |<“\n”> }
/* . For any substring not matching lexical spec, javacc will throw an
error */
/* main rule */
void start() : {}
{ (<IF> | <ID> |<NUM> |<REAL>)* }
30
31
Grammar Specification with JavaCC
32
The Form of a Production
java_return_type java_identifier ( java_parameter_list ) :
java_block
{ expansion_choices }
• EX :
void XMLDocument(Logger logger): { int msg = 0; }
{ <StartDoc> { print(token); }
Element(logger)
<EndDoc> { print(token); }
| else()
}
33
Example ( Grammar 3.30 )
1. PL
2. S id := id
3. S while id do S
4. S begin L end
5. S if id then S
6. S if id then S else S
7. L S
8. L L;S
1,7,8 : P S (;S)*
34
JavaCC Version of Grammar 3.30
PARSER_BEGIN(MyParser)
pulic class MyParser{}
PARSRE_END(MyParser)
TOKEN: {
<WHILE: “while”> | <BEGIN: “begin”> | <END:”end”>
| <DO:”do”> | <IF:”if”> | <THEN : “then”>
| <ELSE:”else”> | <SEMI: “;”> | <ASSIGN: “=“>
|<#LETTER: [“a”-”z”]>
| <ID: <LETTER>(<LETTER> | [“0”-”9”] )* >
35
JavaCC Version of Grammar 3.30 (cont’d)
void Prog() : { } { StmList() <EOF> }
void StmList(): { } {
Stm() (“;” Stm() ) *
}
void Stm(): { } {
<ID> “=“ <ID>
| “while” <ID> “do” Stm()
| <BEGIN> StmList() <END>
| “if” <ID> “then” Stm() [ LOOKAHEAD(1) “else” Stm() ]
36
Types of producitons
• production ::= javacode_production
| regulr_expr_production
| bnf_production
| token_manager_decl
Note:
1,3 are used to define grammar.
2 is used to define tokens
4 is used to embed codes into token manager.
37
JAVACODE production
• javacode_production ::= “JAVACODE”
java-return_type iava_id “(“ java_param_list “)”
java_block
• Note:
– Used to define nonterminals for recognizing sth that is hard to
parse using normal production.
38
Example JAVACODE
JAVACODE void skip_to_matching_brace()
{
Token tok;
int nesting = 1;
while (true) {
tok = getToken(1);
if (tok.kind == LBRACE) nesting++;
if (tok.kind == RBRACE) {
nesting--;
if (nesting == 0) break; }
tok = getNextToken(); } }
39
Note:
• Do not use nonterminal defined by JAVACODE at choice
point without giving LOOKHEAD.
• void NT() : {} {
skip_to_matching_brace()
| some_other_production()
}
• void NT() : {} {
"{" skip_to_matching_brace()
| "(" parameter_list() ")"
}
40
41
TOKEN_MANAGER_DECLS
token_manager_decls ::=
TOKEN_MGR_DECLS : java_block
42
regular_expression_production
regular_expr_production ::=
[ lexical_state_list ]
regexpr_kind [ [ IGNORE_CASE ] ] :
{ regexpr_spec ( | regexpr_spec )* }
• regexpr_kind::=
TOKEN | SPECIAL_TOKEN | SKIP | MORE
43
lexical_state_list
lexical_state_list::=
< * > | < java_identifier ( , java_identifier )* >
• The lexical state list describes the set of lexical states for
which the corresponding regular expression production
applies.
• If this is written as "<*>", the regular expression
production applies to all lexical states. Otherwise, it
applies to all the lexical states in the identifier list within
the angular brackets.
• if omitted, then a DEFAULT lexical state is assumed.
44
regexpr_spec
regexpr_spec::=
regular_expression1 [ java_block ] [ : java_identifier ]
• Meaning:
• When a regular_expression1 is matched then
– if java_block exists then execute it
– if java_identifier appears, then transition to that lexical state.
45
regular_expression
regular_expression ::=
java_string_literal
| < [ [#] java_identifier : ] complex_regular_expression_choices >
| <java_identifier>
| <EOF>
46
Example
<DEFAULT, LEX_ST2> TOKEN [IGNORE_CASE] : {
< FLOATING_POINT_LITERAL:
(["0"-"9"])+ "." (["0"-"9"])* (<EXPONENT>)? (["f","F","d","D"])? |
"." (["0"-"9"])+ (<EXPONENT>)? (["f","F","d","D"])? |
(["0"-"9"])+ <EXPONENT> (["f","F","d","D"])? |
(["0"-"9"])+ (<EXPONENT>)? ["f","F","d","D"] >
{ // do Something } : LEX_ST1
| < #EXPONENT: ["e","E"] (["+","-"])? (["0"-"9"])+ >
}
• Note: if # is omitted, E123 will be recognized erroneously
as a token of kind EXPONENT.
47
Structure of complex_regular_expression
• complex_regular_expression_choices::=
complex_regular_expression (| complex_regular_expression )*
• complex_regular_expression ::=
( complex_regular_expression_unit )*
• complex_regular_expression_unit ::=
java_string_literal | < java_identifier >
| character_list
| ( complex_regular_expression_choices ) [+|*|?]
• Note:
unit concatenation;juxtaposition
concatenation;juxtaposition
complex_regular_expression choice; |
complex_regular_expression_choice (.)[+|*|?]
unit
• Principle :
先串接再選擇 , 套用重複運算必須先加括號
48
character_list
character_list::=
[~] [ [ character_descriptor ( , character_descriptor )* ] ]
character_descriptor::=
java_string_literal [ - java_string_literal ]
java_string_literal ::= // reference to java grammar
“ singleCharString* “
note: java_sting_literal here is restricted to length 1.
ex:
– ~[“a”,”b”] --- all chars but a and b.
– [“a”-”f”, “0”-”9”, “A”,”B”,”C”,”D”,”E”,”F”] --- hexadecimal digit.
– [“a”,”b”]+ is not a regular_expression_unit. Why ?
• should be written ( [“a”,”b”] )+ instead.
49
bnf_production
• bnf_production::=
java_return_type java_identifier "(" java_parameter_list ")"
":"
java_block
"{" expansion_choices "}“
50
expansion_unit
• expansion_unit::=
local_lookahead
| java_block
| "(" expansion_choices ")" [ "+" | "*" | "?" ]
| "[" expansion_choices "]"
| [ java_assignment_lhs "=" ] regular_expression
| [ java_assignment_lhs "=" ]
java_identifier "(" java_expression_list ")“
Notes:
1 is for lookahead; 2 is for semantic action
4 = ( …)?
5 is for token match
6. is for match of other nonterminal
51
lookahead
• local_lookahead::= "LOOKAHEAD" "("
[ java_integer_literal ] [ "," ] [ expansion_choices ] [ "," ]
[ "{" java_expression "}" ] ")“
• Notes:
• 3 componets: max # lookahead + syntax + semantics
• examples:
– LOOKHEAD(3)
– LOOKAHEAD(5, Expr() <INT> | <REAL> , { true} )
• More on LOOKAHEAD
– see minitutorial on javacc.dev.java.net
52
JavaCC API
• Non-Terminals in the Input Grammar
• NT is a nonterminal =>
returntype NT(parameters) throws ParseError;
is generated in the parser class
53
Token class
• public int kind;
– 0 for <EOF>
• public int beginLine, beginColumn, endLine, endColumn;
• public String image;
• public Token next;
• public Token specialToken;
• public String toString()
• { return image; }
• public static final Token newToken(int ofKind)
54
Error reporting and recovery
• It is not user friendly to throw an exception and exit the
parsing once encountering a syntax error.
• two Exceptions
– ParseException . can be recovered
– TokenMgrError not expected to be recovered
• Error reporting
– modify ParseExcpetion.java or TokenMgeError.java
– generateParseException method is always invokable in parser
action to report error
55
Error Recovery in JavaCC:
• Shallow Error Recovery
• Deep Error Recovery
56
Shallow recovery
can be recovered by additional choice:
void Stm() : {} {
IfStm()
| WhileStm()
| error_skipto(SEMICOLON)
}
where
JAVACODE
void error_skipto(int kind) {
ParseException e = generateParseException(); // generate the exception
object.
System.out.println(e.toString()); // print the error message
Token t;
do { t = getNextToken(); } while (t.kind != kind);}
57
Deep Error Recovery
• Same example: void Stm() : {} { IfStm() | WhileStm() }
• But this time the error occurs during paring inside
IfStmt() or WhileStmt() instead of the lookahead entry.
• The approach: use java try-catch construct.
void Stm() : {} {
try {
( IfStm() | WhileStm() )
} catch (ParseException e) {
error_skipto(SEMICOLON);
}
}
note: the new syntax for javacc bnf_production.
58
References:
• JavaCC documentation :
https://javacc.dev.java.net/doc/docindex.html
59
Looking ahead in javacc
60
What’s LOOKAHEAD?
• What strings are void Input() :
{}
matches by Input() ?
{
– abcc (yes)
"a" BC() "c"
– abc (no!!) }
• Why ?
– javacc‘s default greedy void BC() :
lookahead alg. {}
{
"b" [ "c" ]
}
61
Why matching abcc ?
• Input() :abcc
• “a” consume a :abcc
• BC() :bcc
• “b” consume b :bcc
• [“c”] greedily consume c : cc
• “c” consume c :c
• succeed! :
62
Why abc not matched ?
• Input() :abc
• “a” consume a :abc
• BC() :bc
• “b” consume b :bc
• [“c”] greedily consume c :c
• even if no consumption seems better
• “c” need a ‘c’ : don’t match
• fail!
63
How to math both input ?
• Rewrite the grammar!
• increase lookhead number
64
What about these rewritings ?
67
The Default Algo for choice |
• The default choice determination algorithm looks ahead 1
token in the input stream and uses this to help make its
choice at choice points
void basic_expr() :
The choice determination algorithm :
{}
{ if (next token is <ID>) {
<ID> "(" expr() ")" // Choice 1 choose Choice 1
| } else if (next token is "(") {
"(" expr() ")" // Choice 2 choose Choice 2
| } else if (next token is "new") {
"new" <ID> // Choice 3 choose Choice 3
} } else {
produce an error message
}
68
A Modified Grammar
void basic_expr() :
{}
{
What happans
<ID> "(" expr() ")“ // Choice 1
on <ID>?
|
"(" expr() ")" // Choice 2
| Why?
"new" <ID> // Choice 3
|
<ID> "." <ID> // Choice 4
}
Warning: Choice conflict involving two expansions at line 25,
column 3 and line 31, column 3 respectively. A common
prefix is: <ID> Consider using a lookahead of 2 for earlier expansion. 69
Greedy behavior for (…)*
Note: the choice determination
void identifier_list() : algorithm does not look beyond
{} the (...)*
{
<ID> ( "," <ID> )*
}
• Suppose the first <ID> has already been matched and that the parser has
reached the choice point (the (...)* construct). Here's how the choice
determination algorithm works:
70
Another Example
• When making a choice at ( "," <ID> )*, it will always go into
the (...)* construct if the next token is a ",".
– It will do this even when identifier_list was called from
funny_list and the token after the "," is an <INT>.
– Intuitively, the right thing to do in this situation is to skip
the (...)* construct and return to funny_list
71
A Concrete input
One input "id1, id2, 5",
the parser will complain that it encountered a 5 when it was
expecting an <ID>.
•Note – during parser generation, javacc would give the
warning message:
72
Multiple Token Lookaheads Specs
• the default algorithm works fine in most situations. In
cases where it does not work well, javacc provides you
with warning messages.
73
Option 1 - Modify your grammar
• You can modify your grammar so that the warning
messages go away. That is, you can attempt to make
your grammar LL(1) by making some changes to it
• But not always work!
void basic_expr() : {} {
<ID> "(" expr() ")“ // Choice 1 Factor void basic_expr() :{ } {
common
| left parts <ID> ( "(" expr() ")" | "." <ID> )
|
"(" expr() ")" // Choice 2
"(" expr() ")"
| |
"new" <ID>
"new" <ID> // Choice 3
}
|
<ID> "." <ID> // Choice 4
} 74
Factoring not always work!!
void basic_expr() : {}
{
{ initMethodTables(); } <ID> "(" expr() ")"
|
"(" expr() ")"
|
"new" <ID>
|
{ initObjectTables(); } <ID> "." <ID>
}
76
• Global Option LOOKAHEAD
– options { LOOKAHEAD=2; … }
• local lookahead :
void
voidbasic_expr()
basic_expr(): :
{}{}
{{ ifif(next
LOOKAHEAD(2) (next22tokens
tokensare
are<ID>
<ID>and and"("
"(") ){ {
LOOKAHEAD(2) choose
chooseChoice
Choice11
<ID>
<ID>"("
"("expr()
expr()")"//
")"//Choice
Choice11 } }else
|| elseifif(next
(nexttoken
tokenisis"(")
"("){ {
choose
chooseChoice
Choice22
"("
"("expr()
expr()")"
")" ////Choice
Choice22 } }else
|| elseifif(next
(nexttoken
tokenisis"new")
"new"){ {
choose
chooseChoice
Choice33
"new"
"new"<ID> //
//Choice
Choice33 } }else
||
<ID> elseifif(next
(nexttoken
tokenisis<ID>)
<ID>){ {
choose
chooseChoice
Choice44
<ID>
<ID>"."
"."<ID>
<ID> // //Choice
Choice44 } }else
}} else{ {
produce
producean anerror
errormessage
message
}}
77
void identifier_list() :
{}
{
<ID>
( LOOKAHEAD(2) "," <ID> )*
}
78
Syntactic lookahead
• How many lookaheads are needed in the java type
declaration ?
void TypeDeclaration() :
{}
{
ClassDeclaration() // public static final class
|
InterfaceDeclaration() // public abstract abstract interface
}
79
Solution 1
void TypeDeclaration() :
{}
{
LOOKAHEAD(2147483647) ClassDeclaration()
|
InterfaceDeclaration()
}
80
Solution 2 – syntactic lookahead
void TypeDeclaration() :
{}
{
LOOKAHEAD( ClassDeclaration() )
ClassDeclaration()
|
InterfaceDeclaration()
}
81
Solution 3 – a better one
void TypeDeclaration() :
{}
{
LOOKAHEAD( ( "abstract" | "final" | "public" )* "class" )
ClassDeclaration()
|
InterfaceDeclaration()
}
82
Solution 4 – syntactic lookahead + number bound
83
Semantic lookahead
• Could we make the parser choose 2nd alternative on
input “a” “a” without changing the order ?
void Input() : {}
{
"a“
| “a” “a”
}
84
Solution: semantic lookahead
void Input() : {}
{
LOOKAHEAD( { getToken(1).kind == A && getToken(2).kind != A })
<A:"a“>
| “a” “a”
}
• syntactic + semantic
void Input() : {}
{
LOOKAHEAD(“a”, {getToken(2).kind != A })
<A:"a“>
| “a” “a”
}
85
Complete LOOKAHEAD directive
LOOKAHEAD( amount,
expansion,
{ boolean_expression } ) followExpansion
86
References on javacc lookahead
• https://javacc.dev.java.net/doc/lookahead.html
• http://userpages.umbc.edu/~vick/431/Lectures/Spring06/
4_Parsing/1_LL/Looking_ahead_in_javacc.ppt
87