|
XIDEK
|
CLL SyntaxIntroductionThe file cll.syn contains the syntax that describes the C-like Language used by one set of baseline interpreters in the Extensible Interpreter Development Kit. This syntax supports the basic elements of a script language: recognition of variable names and constants, evaluation of expressions, and execution of conditional and loop statements, all based on the familiar C syntax. These may fairly be regarded as the portions of C that have become more or less standard and are not the subject of any significant controversy. Omitted are declarations, pointers, subscripts, and switch statements.The syntax also contains three additional elements. To illustrate changes in expression syntax, Fortran style exponentiation has been added. To aid in debugging scripts and to illustrate how to add new statements to the grammar, dump and print statements have been added. The files dci.syn and ast.syn use this syntax unchanged except for the addition of reduction procedures. The file dxi.syn introduces a number of syntactic changes in order to facilitate computation. The changes do not, however, affect the language recognized by the syntax. cll.syn is a convenient peg on which to hang a discussion of those features of the script language and the parsers that are common to all the interpreters in the development kit. Apart from the lexical section of the grammar, it lacks reduction procedures, since they differ from interpreter to interpreter. cll.syn is not itself used in any of the interpreters in the kit. It can be used by developers as a clean starting point for their own development. This document describes the elements of cll.syn in the order in which they appear.
C PrologueThe "C Prologue" is the initial piece of embedded C/C++ code in a syntax file. AnaGram copies it to the parser file before writing any generated code, so it is the appropriate place to include definitions and declarations needed by the parser. In particular, any data type that is used as a token type must be defined in the C prologue, either explicitly or by means of an include file.There are two header files that are required for all the parsers:
Configuration Section
IntroductionThe "configuration section" of a syntax file contains switches and declarations used to specify specific attributes of a parser. The following configuration parameters are common to all of the syntax files in the kit.
Statement Syntax
IntroductionA script is a sequence of statements. The types of statements supported are the same as in C with the exception of switch statements and declarations. In addition, there are simple dump and print statement.
If and If-Else StatementsIt is conventional to write the syntax for the if and if-else statements in C and C++ as follows:
if statement
-> "if", '(', expression, ')', statement
-> "if", '(', expression, ')', statement, "else", statement
This has the advantage of succinctness and clarity. It has the disadvantage
of ambiguity. In the statement
if (condition_one) if (condition_two) do_something();
else do_something_different();
Does the else belongs to the first or the second if?
Although, by convention
the else belongs to the second if, that is, to the
if physically closest to
it, both interpretations are consistent with the above syntax. This problem
is often referred to as the "dangling else problem", or the "if-else
ambiguity".
There are two ways to deal with the problem: resort to some sort of trick to force the parser to select the correct interpretation, or rewrite the grammar to avoid the difficulty. Using AnaGram, it is possible to use a sticky directive to handle the problem. Generally, however, it is best not to resort to such tricks. The parsers in the development kit, therefore, have been written in such a manner as to avoid this ambiguity. In order to support this technique, statements are classified as "open statements" or "closed statements", depending on whether or not they could be followed by an else clause. There are, therefore, both open and closed if statements:
open statement
-> if condition, statement
-> if condition, closed statement, "else", open statement
closed statement
-> if condition, closed statement, "else", closed statement
Notice that a simple if statement is always an open statement.
If-else statements are open or closed depending on whether the statement
following the else is open or closed.
The script language designer should note that the dangling else problem is a direct result of what is fundamentally unnecessary generality in the definition of C. If the definitions were
if statement
-> "if", '(', expression, ')', compound statement
-> "if", '(', expression, ')', compound statement,
"else", compound statement
there would be no problem. In other words, by allowing the user to avoid
writing a pair of curly braces, substantial complication has been
introduced into the language, as well as substantial opportunities for
introducing bugs into programs. Certainly every programmer has seen the
following problem:
if (condition)
do_something();
do_something_else();
where the call to do_something_else() is not controlled by the if, in spite
of the suggestive indentation. If the language had been defined to only
allow compound statements to be controlled by if statements, this opportunity
for error would not exist.
While StatementsWhile statements are exactly as in C. In order to handle the dangling else problem, the syntax contains two while statements: an open while statement and a closed while statement.
open statement
-> WHILE, '(', expression, ')', open statement
closed statement
-> WHILE, '(', expression, ')', closed statement
Note that a while statement is open or closed depending on whether the
statement controlled by the while is open or closed.
For StatementsFor statements are exactly as in C. In order to handle the dangling else problem, the syntax contains two for statements: an open for statement and a closed for statement.
open statement
-> FOR, '(', optional expression, ';',
optional expression, ';',
optional expression, ')', open statement
closed statement
-> FOR, '(', optional expression, ';',
optional expression, ';',
optional expression, ')', closed statement
Note that a for statement is open or closed depending on whether the
statement controlled by the for is open or closed.
Do-While StatementsThe do-while statement is exactly as in C. Since the statement controlled by the do and while keywords is entirely enclosed by them, the scope of an if between the do and while keywords is limited by the while. Thus there is no opportunity for a dangling else, and the do-while statement is a closed statement.
Break and Continue StatementsThe break and continue statements are exactly as in C. These statements, though apparently simple, are among the most difficult to implement in interpreters. The reason is that all other statements fit into a neat hierarchical structure. The break and continue statements do not fit neatly in that hierarchy. Thus a good deal of the complexity of the interpreters in the development kit is directly a result of choosing to implement break and continue. A script language designer might well choose to avoid break and continue in the interest of simplicity.
Expression StatementsAn expression statement is exactly like an expression statement in C: simply an expression followed by a semicolon.
Null StatementThe conventional definition of an expression statement consists of an optional expression followed by a semicolon. It is more convenient to think of the expression statement as always having a non-vanishing expression, and adding a "null statement" to deal with a free-standing semicolon.
Compound StatementA compound statement, as in C, consists of zero or more statements enclosed in curly braces.
Return StatementA return statement, as in C, consists of the keyword "return" and an optional expression. The return statement causes the interpreter to exit to the calling program returning the value of the specified expression. If no expression is provided, the return value is a Value object with type equal to uninitType.
Dump StatementThe dump statement has been added to aid in testing and debugging. It also serves as an example of how to add non-C statements to the syntax.The dump statement consists of the keyword "dump", followed by a comma-delimited list of variable names. All the interpreters implement this statement by printing the name of each listed variable and its value, each variable on a line by itself.
Print StatementThe print statement is a second example of adding a non-C statement to the syntax. It takes a comma delimited list of assignment expressions and prints each value in turn. Note that the print statement could just as well have been written
print statement
-> "print", arg list
Expression Syntax
IntroductionThe expression syntax used for all the interpreters in the development kit is almost that of C and C++. It does not support subscripts, pointers, casts other than (long) or (double), or the general function call syntax. On the other hand, it does support exponentiation using the ** operator as in Fortran. If exponentiation is not desired, it may be removed quite easily, as described here.The expression syntax, following conventional usage, implements operator precedence using normal syntactic conventions. It does not use shortcuts such as operator precedence declarations. The conventional method for writing expression syntax is to define a token for each level of precedence, beginning with the lowest level. At least two rules are written for each of these tokens. The first rule simply consists of the token that defines the next higher level of precedence. The other rules define the actual operators at the given precedence level. These latter rules are written as left or right recursive rules, depending on whether the relevant operator groups to the left or to the right. Operators are said to "group to the left" or "group to the right" depending on the implicit parenthesization that is understood. Subtraction, for example, groups to the left. We interpret
a - b - c
as
(a - b) - c
It would contravene mathematical conventions to compute it as
a - (b - c)
On the other hand, the logical or operator groups to the right:
x || y || z
is fairly interpreted as
x || (y || z)
From a mathematical point of view, associative operators may be grouped
either to the left or the right. Nonassociative operators, to be
consistent with conventional usage, must be grouped to the left.
Since grouping operators to the right requires right recursion and right recursion runs a remote risk of parser stack overflow, it is preferable to group operators to the left wherever it makes sense to do so. In the following, the tokens that define expression syntax are described in order of increasing precedence.
expression
expression
-> assignment expression
-> expression, ',', assignment expression
These rules define the comma operator as the lowest level of precedence in
the expression syntax. Since expression is left recursive, items
group to the left. The next higher level of precedence is specified by
assignment expression.
Note that the expression to the left of the comma must be evaluated and its value discarded in order to allow for side effects. The proper functioning of the initialization, condition, and increment fields of the for statement depend on this property of the comma operator. Note that if you need a comma delimited list of expressions, as in arg list and the print statement, you should use assignment expression rather than expression to avoid confusion with the comma operator.
assignment expression
assignment expression
-> conditional expression
-> name, assignment op, assignment expression
assignment op
-> '='
-> "+="
-> "-="
-> "*="
-> "/="
-> "%="
-> "|="
-> "&="
-> "^="
-> "<<="
-> ">>="
Assignment operators group to the right, since it would not make sense for
them to group to the left. They take precedence over the comma operator.
conditional expression
conditional expression
-> logical or expression
-> logical or expression, '?', expression,
':', conditional expression
The conditional expression is a ternary operator that groups to the right.
The conditional expression, as defined in C, presents an implementation
challenge, since of the two expressions selected by the logical expression,
only one is to be executed. If both were to be executed and the results of
one simply discarded, there would be a problem with possible side effects
created by the expression that was discarded.
The ? operator groups to the right and takes precedence over the comma operator.
logical or expression
logical or expression
-> logical and expression
-> logical and expression, "||", logical or expression
The logical or expression groups to the left. Note that like
the conditional expression and the logical and
expression there is a problem with side effects since the expression
to the right of the operator is to be evaluated only if the expression to
the left returns false.
The || operator groups to the right and takes precedence over ?.
logical and expression
logical and expression
-> inclusive or expression
-> inclusive or expression, "&&", logical and expression
The logical and expression groups to the left. Note that like
the conditional expression and the logical or
expression there is a problem with side effects since the expression
to the right of the operator is to be evaluated only if the expression to
the left returns true.
The && operator groups to the right and takes precedence over ||.
inclusive or expression
inclusive or expression
-> exclusive or expression
-> inclusive or expression, '|', exclusive or expression
The | operator groups to the left and takes precedence over &&.
exclusive or expression
exclusive or expression
-> and expression
-> exclusive or expression, '^', and expression
The ^ operator groups to the left and takes precedence over |.
and expression
and expression
-> equality expression
-> and expression, '&', equality expression
The & operator groups to the left and takes precedence over ^.
equality expression
equality expression
-> relational expression
-> equality expression, equality op, relational expression
equality op
-> "=="
-> "!="
Equality operators group to the left and take precedence over
&.
relational expression
relational expression
-> shift expression
-> relational expression, relational op, shift expression
relational op
-> '<'
-> "<="
-> '>'
-> ">="
Relational operators group to the left and take precedence over
equality operators.
shift expression
shift expression
-> additive expression
-> shift expression, shift op, additive expression
shift op
-> "<<"
-> ">>"
Shift operators group to the left and take precedence over
relational operators.
additive expression
additive expression
-> multiplicative expression
-> additive expression, additive op, multiplicative expression
additive op
-> '+'
-> '-'
Note that additive operators group to the left and take precedence over
shift operators.
multiplicative expression
multiplicative expression
-> unary expression
-> multiplicative expression, multiplicative op, unary expression
multiplicative op
-> '*'
-> '/'
-> '%'
Multiplicative operators group to the left and take precedence over
additive operators.
unary expression
unary expression
-> factor
-> '+', unary expression
-> unary op, unary expression
unary op
-> '-'
-> '!'
-> '~'
The unary + is treated separately, since it does not normally
require any computation.
Unary operators group to the right and take precedence over multiplicative operators.
factor
factor
-> primary
-> primary, "**", unary expression
The exponentiation operator has a distinctive operator precedence, in that
its precedence is different to the right and to the left. It takes
precedence over leading minus signs for example:
-x ** y
is construed as
-(x ** y)
but at the same time, it is possible to write:
x ** -y
without the need for parentheses around the right hand operand. These
rules reflect normal mathematical conventions, though they are difficult
to achieve with simple precedence grammars.
To remove support for exponentiation, it is sufficient to rewrite the productions for unary expression replacing factor with primary:
unary expression
-> primary
-> '+', unary expression
-> unary op, unary expression
primary
primary
-> '(', expression, ')'
-> constant
-> lvalue
-> "++", lvalue
-> "--", lvalue
-> lvalue, "++"
-> lvalue, "--"
-> function call
-> '(', "long", ')', primary
-> '(', "double", ')', primary
Because the development kit language does not support arbitrary casts, subscripts,
or pointers, primary is somewhat different from the
corresponding token definition in a standard C syntax.
Note that primary has simple right recursion for the two cast operators that are defined and the much more complex center recursion for parentheses. Otherwise, primary is the leaf node of the expression tree. Unlike C and C++, the auto-increment and auto-decrement operators are defined simply as operating directly on variables. This is possible because the development kit does not support pointers.
function call
function call
-> name, '(', optional arg list, ')'
The function call definition
is designed to allow function calls with a comma delimited argument
list. The rules for the arg list token use the assignment
expression token instead of expression since, in a calling
sequence, comma is used to delimit arguments. If the grammar had been
written only for use with parsers that generate an abstract syntax tree, it
would have been possible to write the function call rule as:
function call
-> name, '(', optional expression, ')'
since tree-walk functions could separate out the individual arguments.
lvalueAn lvalue is a reference to a value. In the Value class, lvalues are implemented as pointers. It would have been possible to implement the interpreters without using an lvalue construct, but extensions, such as adding arrays and subscripting, would have been noticeably more complex, since they would require changes to the baseline syntax rather than the simple addition of new productions.
constantThe interpreters in the kit support recognition of real, integer, string and character constants. In each case the constant value is stored in a Value object.
nameThe syntactic rules for name are the same as for variable names in C and C++. Names consist of an arbitrary length string of alphabetic characters, digits and underscores, subject to the side condition that the first character is not a digit.
Lexical RulesIntroductionAnaGram parsers do not customarily use separate lexical scanners. Instead they use conventional Backus-Naur form to describe the grammar right down to the character level. The significant difference between AnaGram and conventional parsers, is that the terminal tokens represent character sets rather than individual characters, although many character sets may well consist of a single character.Character set expression may be used directly as tokens in any rule. As an alternative, they may be named for convenience. In this grammar, four character sets are given explicit names. Since the parser is configured to take its input from a string in memory, eof is defined simply as a null byte. If, on the other hand, the parser were intended to take stream input, one might write:
eof = -1 + 0 + ^D + ^Z
Such a definition would allow for all of the conventional representations
of end of file.
The definitions of the name token and the white space token are self-explanatory.
String ConstantsString constants follow the same rules as C and C++. Support for strings is entirely dependent on the handling of string values in the Value class.
Numeric ConstantsThe recognition of numeric constants would be self-explanatory except for one significant problem. In the C/C++ scheme of things, the conventional scheme for representing an octal integer is to begin it with a leading zero. Decimal integers must begin with a non-zero digit, so there is no difficulty in distinguishing decimal from octal integers. The difficulty arises because there is no restriction on the initial digit of a floating point number.The upshot of this is that what starts out looking as though it is an octal integer can suddenly turn into a floating point number. If this should happen, the octal conversion is wrong and needs to be redone. The makeDecimal() function recovers the original digits and reconverts them as decimal. The token hybrid integer is used to accumulate these constants.
|
| Table of Contents | | | Parsifal Software Home Page |
|---|
Interpreter Development Kit
Copyright © 1997-2002, Parsifal Software.
All Rights Reserved.