|
Language Processing v2.0 |
|||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||
See:
Description
| Interface Summary | |
|---|---|
| BnfParser.yyInput | must be implemented by a scanner object to supply input to the parser. |
| Dump.Helper | what a dump helper must do. |
| RfcBuilder.Suffix | what a suffix must do. |
| RfcParser.yyInput | must be implemented by a scanner object to supply input to the parser. |
| Class Summary | |
|---|---|
| BnfBuilder | builder for recognizer built from bnf.rfc. |
| BnfParser | pj2-based parser for the BNF input language. |
| BnfParser.yyLex | |
| Boot | command line for the parser generator bootstrapped with oops3
(only first letter of each option is significant). |
| Dump | dump trees. |
| Gen | generate code for pj2. |
| Launcher | command line for self-compiled parser generator (only first letter of each option is significant). |
| Parser | factory to represent a grammar in jay-style BNF. |
| RfcBuilder | builder for recognizer built from rfc.rfc. |
| RfcParser | oops3- and pj2-based parser for the RFC input language. |
| RfcParser.yyLex | |
| Utils | useful functions for parsers. |
| Visitor | infrastructure for the visitor pattern. |
| Exception Summary | |
|---|---|
| BnfParser.yyException | thrown for irrecoverable syntax errors and stack overflow. |
| RfcParser.yyException | thrown for irrecoverable syntax errors and stack overflow. |
Compiler frontend generator.
This is the homepage of pj2, a preprocessor for jay and JLex which largely automates the
generation of trees from programs.
pj2 reads a grammar annotated with Java class names, and generates a scanner with JLex, grammar rules for jay which
include error recovery in iterations, and a template method yybuild as the link between the
parser constructed by jay and actions carried out during recognition. As implemented, yybuild builds trees that are suitable for processing with vig or expresses the input in XML,
but the method can be overridden to implement other observation activities.
pj2 supports several input languages:
jay with self-defining literals, tokens defined with patterns,
precedence levels, and rules with alternatives. pj2 combines ideas from an earlier preprocessor pj with the architecture and tree
building paradigm of oops3. It is available through wcs and for binary and source Downloads.
The following diagrams are linked to the class documentation and illustrate the data and control flow when creating a parser to build trees from arithmetic expressions and when bootstrapping pj2 itself. Command lines to carry out these operations can be found below, see Generating code.
black arrows: create tree builder for arithmetic expressions
For BNF the main program is part of BnfParser. The command line is interpreted by Launcher which uses BnfParser to read the grammar expr.bnf and request BnfBuilder to create a parser tree over
Parser. Similarly, for expr.rfc the RFC-specific classes
RfcParser and RfcBuilder are used instead.
Launcher will usually invoke Gen to visit the parser tree, let JLex construct a scanner, create rules,
and emit a source file which can be processed using jay and compiled, see Generating a frontend.
purple arrows: compile arithmetic expression into tree
The resulting program can read input conforming to the original grammar, represent it as a tree, and
serialize the result. The serialized tree could be displayed using Dump.
red arrow: control code templates
The resulting program is created from a code template resource which can be selected with the
property Gen.properties. This can be used to implement a visitor pattern for the tree or to
represent the tree with XML.
blue arrows: create bootstrap parser
The RFC grammar rfc.rfc describes the RFC input language. oops3 is used to produce a serialized parser tree from this grammar; this is possible because there is a
common subset of RFC which is acceptable to both parser generators, oops3 and pj2.
red arrows: bootstrap parser tree
Boot interprets an oops3-style command line and is instructed to load the
serialized parser (over the oops3 parser library), read the same RFC grammar rfc.rfc (using the oops3 scanner), and reflect on RfcBuilder to build another parser tree for the grammar, this time over Parser.
black arrows: compile parser tree into parser
Gen visits the parser tree and sends rules and Java code to jay, combined with
Java code for a scanner produced by JLex from yyLex.lex. jay constructs the parser tables from the rules and emits
Java code under control of pj2.skeleton which is compiled
into a frontend RfcParser.
green arrows: self-create parser tree
Launcher interprets a command line and is instructed to call RfcParser to
read the RFC grammar rfc.rfc using the scanner RfcParser.yyLex, and reflect on RfcBuilder to build another parser tree for the grammar,
again over Parser. This tree is processed as before and will again produce RfcParser.
The BNF grammar bnf.rfc describes the BNF input language for input
to oops3 so that almost the same procedure can be applied to create BnfParser using
BnfBuilder. However, the parser trees then recognize BNF rather then RFC, i.e., the BNF
grammar additionally has to be described in BNF, see bnf.bnf, to be
acceptable to the BNF parser. This grammar was carefully crafted to fit the same builder -- reflection
amounts to a very loose coupling between recognition and tree building, see Writing an observer.
pj2 generates code for jay which implements a LALR(1)-based algorithm for recognition.
Therefore, the simplest input language for pj2 is a variant of BNF with some additional
information. However, pj2 is implemented using pj2 -- a bootstrap parser is created using
oops3 -- and the input language can easily be changed. This section describes the BNF input language.
The input format for pj2 borrows from jay. Literals can have any length, are enclosed
in single quotes, and appear inline. Tokens must be defined as patterns before they are used and must be
typed; void as type of a pattern indicates that matches are not passed to the parser. Patterns
are enclosed in single quotes; backslash escape sequences follow Java conventions.
Input to pj2 has three sections, separated by lines containing %% only. Text in the first and last section is arbitary; it can be used for Java code to embed code generated by pj2, see Generating code.
The second section is free format. White space and comments ranging from # to the end of the line are ignored.
The second section contains token definitions, grammar rules, and precedence levels in any order. However, a token must be defined before it can be used in a rule and the first grammar rule defines the start symbol of the grammar.
| # token definition <type> name = string; |
| # integer represented as Long <Long> Number = '[0-9]+'; # comment and white space to be skipped <void> comment = '#.*'; <void> spaces = '[ \n\r\t]+'; |
A token definition associates a token name with a pattern string. Strings are enclosed in single quotes and use Java-style backslash escapes. The pattern string can be handed to JLex as a macro definition for scanner generation, see Generating code.
A token definition must be typed. If the type is void in any form, the token definition
specifies text to be discarded by the scanner. Otherwise the type must be a Java class name which has a
constructor accepting a single String argument. Matching input will be collected as an
object of the class, see Collecting values.
In case of ambiguity the order of token definitions is important: a JLex-based scanner tries to recognize the
longest possible input string; in case of ambiguities, a multi-character literal has precedence over any
token, an earlier token definition takes precedence over a later one, and a single-character token
definition takes precedence over a single-character literal. A generated scanner contains a main program
(in the class yyLex nested into the parser) which can be executed to see all tokens and literals
which the scanner extracts. If necessary, the input for JLex can be produced and checked for
problems, see Generating code.
| # grammar rule <type> ? name : explanation; |
| # bit sequence bits: | bits '0' | bits '1'; # arithmetic expression sum: product | sum '+' product | sum '-' product; product: term | product '*' term | product '/' term; term: Number | '(' sum ')'; |
A grammar rule associates a nonterminal name with an explanation
which in the BNF input language consists of one or more alternatives separated by |. Each
alternative is a sequence of literals, token names, and nonterminal names; only the first
alternative may be empty. Each alternative can be followed by %prec and a literal or token name;
this is passed on to jay and disambiguates during parser generation. Every nonterminal must appear
as the name of exactly one rule; the first rule presented to pj2 explains the start
symbol of the grammar.
A grammar rule may be typed, but not as void in any form. The type is an unqualified
Java class name which should be returned by the builder method used to filter data collected by the rule,
see Writing a builder, and which would be generated as a subclass of AbstractList if pj2 generates a tree factory, see Generating code. The class name can be followed by a colon and a comma-separated list of interface names
which would be referenced by the class generated within the tree factory.
If the rule is typed the type phrase can be followed by ?. In this case, if pj2
generates a tree factory, the factory method will pass a single descendant through and not generate a
tree node. This provides an effective means to flatten parse trees, see Collecting values.
A literal is a string, enclosed in single quotes and using Java-style backslash escapes. A token name must have been defined before use as described above. A nonterminal name must be explained with a rule somewhere.
| # precedence level %left literal or token name ...; %right literal or token name ...; %nonassoc literal or token name ...; |
| # arithmetic expression %left '+' '-'; %left '*' '/' '%'; %right '~'; expr: add | subtract | multiply | divide | mod | minus | plus | Number | '(' expr ')'; add: expr '+' expr; subtract: expr '-' expr; multiply: expr '*' expr; divide: expr '/' expr; mod: expr '%' expr; plus: '+' expr %prec '~'; minus: '-' expr %prec '~'; |
A precedence level specifies literals and tokens which have equal precedence and
associativity. It is passed on to jay and disambiguates during parser generation. Precedence
increases in input order, i.e., a precedence level near the end of the input is higher then a precedence
level near the beginning of the input.
The RFC input language (see rfc.rfc) extends BNF with parentheses for
grouping and suffixes to denote iterations, optional items, and possibly empty or definitely non-empty
lists with delimiters. Alternatives are still denoted with |, but an alternative is not allowed
to be empty; in that respect RFC is not an exact superset of BNF.
| # grouping item: Id | literal | '(' xor ')'; # iterations term: item ( '?' | ('+'|'*') ('/' item)? )?; ( alternatives ) # grouping item ? # optional item item + # one or more times item * # zero or more times item +/ delimiter # non-empty, delimited list item */ delimiter # empty, item, or delimited list |
| # non-empty and empty sequences of identical bits bits: '0'+ | '1'*; # comma-separated, non-empty list of bit sequences list: bits +/ ','; # non-empty sequence of arbitrary bits bits: (zero | '1')+; zero: '0'; |
Suffixes for iterations have precedence over concatenation for sequences, which in turn has
precedence over | for selections. Parentheses can be used for grouping and to change the default
precedence.
The suffix ? denotes that an item is optional (can but does not have to appear in the
input). + denotes that an item must appear one or more times. * is the combination
of both, i.e., an item can appear zero or more times. Finally, / can be appended together
with a delimiter which will have to appear between iterated items. Iterations are subject to the
restriction that the body of a mandatory iteration cannot be optional; this implies that a list can have
an optional body or an optional delimiter but not both.
pj2 only has a loose coupling between value collection and tree shaping and the grammar, see
Collecting values. Therefore, the additional constructs are converted to BNF, additional rules
are added, and error handling is inserted to make the result acceptable to jay. The suffixes in
the grammar lists.rfc
| line: opt | some | many | parens | brackets; opt: 'opt' Number?; some: 'some' Number+; many: 'many' Number*; parens: '(' Number +/ ',' ')'; brackets: '[' Number */ ',' ']'; |
are converted as follows (edited for clarity):
| some: 'some' yy.3; yy.3: Number | yy.3 Number { yyErrorFlag = 0; } | error | yy.3 error; many: 'many' yy.4; yy.4: | yy.4 Number { yyErrorFlag = 0; } | yy.4 error; parens: '(' yy.5 ')'; yy.5: Number | yy.5 ',' Number { yyErrorFlag = 0; } | error | yy.5 error Number { yyErrorFlag = 0; } | yy.5 ',' error; brackets: '[' yy.7 ']' ; |
The insertion of error recovery rules can be modified, see Generating code.
pj2 can instrument a parser so that it builds a tree to represent the recognized input. By
default, a parser generated by pj2 will ignore all recognized literals and collect all input
values recognized as tokens in input order into a flat ArrayList. The grammar a.rfc
| <void> spaces = '[ \n\r\t]+'; <Integer> Number = '[0-9]+'; sequence: Number sequence*; |
describes a scanner which will ignore white space and represent decimal digit strings as Integer values. The (ambiguous) grammar rule defines a sequence of numbers. If this parser is
compiled and executed (see Generating a frontend) it will create an ArrayList with the input numbers which Dump might display as follows:
| ArrayList Integer 1 Integer 2 Integer 3 null |
By default, the collection algorithm observing the parser will add null whenever a construct
recognizes no input; here this is the case when the second iteration of sequence recognizes
3 and has to accept no input for another sequence*. This behavior of the parser can be
toggled when the parser is generated by supplying an argument to the main program generated by pj2
or by manipulating the instance variable yyAddNull in the parser; see Generating code.
By default, a parser generated by pj2 will collect a flat list. A builder can be used to structure the list as a tree using problem-specific classses, and it can even reshape the tree.
The parser generated by pj2 and jay performs recognition and tree building by
executing a stack machine and receiving literals and tokens from a scanner. A builder is
an object specified as an argument during construction of the parser. A builder is called by (silent)
reflection on the name of a rule:
public method with the name of the rule and an int and a
List -- or only a List -- argument, the method receives the input
position (if possible) and the List of values collected for the rule. ArrayList its elements are added,
otherwise the result itself is added. Therefore, a builder can add any number of values to
represent the right-hand side of a rule, simply by wrapping them into an ArrayList,
or it can create a tree branch, e.g., by wrapping values into a subclass of AbstractList other then ArrayList. ArrayList will be part of the resulting tree only at the top level (if the
start symbol's rule is untyped) or if a builder wraps it into something else. As described in Generating code, a builder and result classes derived from AbstractList can be generated just by annotating grammar rules with result types. The grammar
b.rfc
| <Item> sequence: Number sequence*; |
describes a parser and tree factory which create the following result from the same input as before:
| 2 Item Integer 1 2 Item Integer 2 2 Item Integer 3 null |
The generated tree nodes record the input position associated with the node and Dump
reflects to display it if present. Nodes are created under control of the grammar; therefore, the input
position can depend on the grammar. The following left-recursive BNF grammar b.bnf
| <Item> sequence: | sequence Number; |
is not ambiguous and accepts the same input but creates a different tree:
| 1 Item 1 Item 1 Item 1 Item null Integer 1 Integer 2 Integer 3 |
Just by using an additional, typed rule a sequence of items can easily be collected as a tree branch. The grammar c.rfc
| <Item> sequence: Number nest*; <Nest> nest: sequence; |
describes a parser and tree factory which add a level of nesting when presented with the same input as before:
| 2 Item Integer 1 2 Nest 2 Item Integer 2 2 Nest 2 Item Integer 3 null |
The default code templates for a tree factory interpret the grammar annotations as class names.
However, as described in Generating code, a different set of code templates interpret the
annotations as XML element names and result in a tree factory which outputs XML. If there is a non-zero
input position available, an element will specify it as the value of an attribute yyLine.
| # token definition <type> name = string; |
| <Integer> Number = '[0-9]+'; |
A scanner is expected to represent the text of a token as Text of an Element. The generated scanner will use the type as the element name, i.e., the type name
should be acceptable as an XML element name.
| # grammar rule <type> ? name : explanation; |
| sequence: nest*; <Nest> nest: Number nest*; |
The tree factory uses the types as element names, too. If null is stored it is represented as
an empty element yyNull. If the start symbol is not annotated, the parser will return an ArrayList which the generated main program will represent with the element name java.util.ArrayList. Otherwise the parser returns the Element produced by the start
symbol. Each invocation of the parser creates a new Document which can be found with
getOwnerDocument; the caller of the parser should use appendChild to attach the top-level Element returned by the parser to the Document.
The generated main program displays the XML document as standard output; programs such as tidy can be used to format the result. If a tree builder and an
XML builder are generated from doc-files/xml.rfc, an input consisting of three numbers
produces the object tree
| ArrayList null 2 Nest Integer 1 null 2 Nest Integer 2 null 2 Nest Integer 3 null |
and the (tidied) XML document
| <?xml version="1.0" encoding="utf-8"?> <java.util.ArrayList> <yyNull /> <Nest yyline="2"> <Integer yyline="1">1</Integer> <yyNull /> <Nest yyline="2"> <Integer yyline="1">2</Integer> <yyNull /> <Nest yyline="2"> <Integer yyline="1">3</Integer> <yyNull /> </Nest> </Nest> </Nest> </java.util.ArrayList> |
A builder is an object which is responsible for building a tree from the token values which the parser collects. pj2 can generate a tree factory which is used as a builder by default. This tree factory can be subclassed to systematically replace one or more tree classes or to change the shape of a tree. The grammar d.bnf
| <Item> ? sequence: | sequence Number; |
builds a tree with each Item holding another Item and an Integer;
the innermost Item holds a null resulting from the empty alternative:
| 1 Item 1 Item 1 Item null Integer 1 Integer 2 Integer 3 |
However, the default tree factory yyTree can be subclassed, for example as part of the source
d.bnf processed with pj2:
| public static class Reshape extends yyTree { public Object sequence (int position, List list) { if (list.get(0) == null) return new ArrayList(); // no null int n = (int)((Integer)list.remove(list.size()-1)); list.add(new Double((double)n)); return super.sequence(position, list); } } |
If Reshape is used as a builder (e.g., by specifying the class name as an argument to
pj2, see Generating code) the numbers are recoded as Double
and null is removed:
| 1 Item 1 Item Double 1.0 Double 2.0 Double 3.0 |
Note that the rule type Item in the grammar has to be followed with ? in this case,
because the factory method sequence must be able to return Object rather then
Item for trees and Double for single numbers.
Builders can be generated or written from scratch. A generated builder is implemented as a tree factory, i.e., it is designed to be extended by subclassing. Builders are accessed by (silent) reflection on every rule in order to further facilitate subclassing.
In the implementation of pj2 itself, a parser is created by an instance of BnfBuilder or RfcBuilder as a builder, depending on the grammar input language. The
sources of these builders provide examples of a builder and subclasses written entirely from scratch.
Finally, the builder is invoked by reflection from a single method yybuild which is part of
the generated parser and can be overwritten in a subclass. This architecture was chosen deliberately so
that an entirely different observation framework can be implemented, see Writing an observer.
Although pj2 is intended to generate trees which can be processed with vig, it can
also support the classic visitor pattern. A tree factory can be generated to contain a
visitor interface, see Generating code. The grammar expr.rfc for arithmetic expressions
| <Sum>? sum: product (add | subtract)*; <Add> add: '+' product; <Subtract> subtract: '-' product; <Product>? product: term (multiply | divide)*; <Multiply> multiply: '*' term; <Divide> divide: '/' term; term: Number | '(' sum ')'; |
creates from the expression 1 + 2 * 3 / ( 4 - 5 ) the following tree (without storing null):
| 2 Sum Integer 1 2 Add 2 Product Integer 2 1 Multiply Integer 3 1 Divide 1 Sum Integer 4 1 Subtract Integer 5 |
Eval is a visitor which uses double arithmetic to
evaluate the tree. eval is called with the root of a tree and visits a branch node or evaluates a
leaf node:
| public Double eval (Object o) { try { return (Double)((Visit)o).visit(this); } catch (ClassCastException e) { return new Double(o.toString()); } } |
Only two visit methods need to be implemented, one for Sum and a very similar one for
Product nodes:
| public Object visit (Sum node) { Double result = null; for (Object child: node) if (result == null) result = eval(child); else try { result = (double)result + eval(((Add)child).get(0)); } catch (ClassCastException e) { result = (double)result - eval(((Subtract)child).get(0)); } return result; } |
All other (required) visit methods of the generated Visitor interface can be
defaulted:
| public Object visit (Add node) { return visit((Visit)node); } // ... public Object visit (Visit node) { throw new UnsupportedOperationException("should not visit "+node.getClass().getSimpleName()); } |
It should be noted that this is not a very elegant solution. The grammar could specify a Leaf
class so that an Integer leaf node could be "visited" with impunity. A grammar such as
expr.bnf can create the tree in the more familiar binary shape.
The relationship between recognition and tree building is usually based on the observer
pattern. However, to implement recognition pj2 uses jay which is based on yacc and dates back to a time well before
object orientation and design patterns. pj2 is primarily intended to automate the process of
representing input as a tree but it provides a template method yybuild in the generated parser so
that observation can be arranged by subclassing the parser and overwriting the method. For example, the
grammar g.bnf
| <Item> sequence: Number | sequence nest; <Nest> nest: sequence; |
produces the following tree when three numbers are input:
| 2 Item 1 Item Integer 1 2 Nest 2 Item 1 Item Integer 2 2 Nest 1 Item Integer 3 |
Trace is a subclass of the parser which overrides yybuild to trace recognition instead of communicate with a builder:
| protected Object yybuild (yyInput scanner, Object builder, String rule, Object... arg) { List result = Arrays.asList(arg); System.err.println("reduce "+rule+" "+result); return result; } |
yybuild is called whenever a rule in the grammar is satisfied. It has access to the scanner
(which should only be consulted about input position), the builder if any, the name of the nonterminal
which the rule explains, and the list of values which have been collected for the rule, i.e., the values
produced by the scanner for tokens (but not literals) and the values returned by yybuild for
nonterminals reduced earlier. The output illustrates in what order the rules are satisfied:
| reduce sequence [1] reduce sequence [2] reduce sequence [3] reduce nest [[3]] reduce sequence [[2], [[3]]] reduce nest [[[2], [[3]]]] reduce sequence [[1], [[[2], [[3]]]]] [[1], [[[2], [[3]]]]] |
The trace shows how the rules are completed bottom-up -- from the leaves to the
root of the tree -- and how the parser created from the ambiguous grammar operates: it tries to use the
longer alternative sequence nest if possible.
pj2 associates a call to yybuild with every explanation of every nonterminal. Unlike an
action processed by jay yybuild has access only to the values associated with tokens and
nonterminals, not literals; the result of yybuild becomes the value associated with the
nonterminal explained by the current rule in the parent rule. For BNF grammars this supports a programming
style quite similar to jay but with a better separation between grammar and Java code. Consider
the grammar for arithmetic expressions expr.bnf:
| %left '+' '-'; %left '*' '/' '%'; %right '~'; expr: add | subtract | multiply | divide | mod | minus | plus | lit | '(' expr ')'; add: expr '+' expr; subtract: expr '-' expr; multiply: expr '*' expr; divide: expr '/' expr; mod: expr '%' expr; plus: '+' expr %prec '~'; minus: '-' expr %prec '~'; lit: Number; |
yybuild can rely on the grammar which guarantees that some rules such as add will
provide two values, and all others will provide one value. Tracing makes it clear that the calls to yybuild will be made in postorder relative to an arithmetic expression. Consequently, Interpret can implement yybuild as a simple dispatcher
which by default returns the incoming argument:
| protected Object yybuild (yyInput scanner, Object builder, String rule, Object... arg) { Object result = arg[0]; Operations.Binary bin; Operations.Unary sign; if (rule.equals("lit")) result = ((Integer)arg[0]).doubleValue(); // lit: Number else if ((bin = Operations.binary.get(rule)) != null) // e.g., add: expr '+' expr result = bin.eval((double)((Double)arg[0]), (double)((Double)arg[1])); else if ((sign = Operations.unary.get(rule)) != null) // e.g., minus: '-' expr result = sign.eval((double)((Double)arg[0])); System.err.println("reduce "+rule+" "+result); return result; } |
Operations contains hashtables which map rule names to
functions implementing the corresponding unary or binary operations:
| public interface Binary { double eval (double left, double right); } public static final HashMap<String,Binary> binary = new HashMap<String,Binary>(); // ... static { binary.put("add", new Binary(){ public double eval (double left, double right) { return left + right; } }); // ... } |
The trace for the input 1 + 2 * -3 shows how evaluation proceeds in postorder:
| reduce lit 1.0 reduce expr 1.0 reduce lit 2.0 reduce expr 2.0 reduce lit 3.0 reduce expr 3.0 reduce minus -3.0 reduce expr -3.0 reduce multiply -6.0 reduce expr -6.0 reduce add -5.0 reduce expr -5.0 -5.0 |
It must be emphasized that yybuild is completely dependent on the grammar rules. The
following grammar expr.rfc describes arithmetic expressions using
iterative constructs:
| <Sum>? sum: product (add | subtract)*; <Add> add: '+' product; <Subtract> subtract: '-' product; <Product>? product: term (multiply | divide)*; <Multiply> multiply: '*' term; <Divide> divide: '/' term; term: Number | '(' sum ')'; |
Because certain rule names are the same, the interpreter Stack can reuse the operations tables; however this grammar is geared toward a stack-based
evaluation:
| Stack<Double> stack = new Stack<Double>(); protected Object yybuild (yyInput scanner, Object builder, String rule, Object... arg) { Object result = null; Operations.Binary bin; Operations.Unary sign; if (rule.equals("term")) try { result = ((Integer)arg[0]).doubleValue(); // term: Number stack.push((Double)result); } catch (NullPointerException e) { } // term: '(' sum ')' else if ((bin = Operations.binary.get(rule)) != null) { // e.g., add, subtract... double right = (double)stack.pop(); result = bin.eval((double)stack.pop(), right); stack.push((Double)result); } System.err.println("reduce "+rule+" "+result); return result; } |
The trace for the input 1 + 2 * 3 shows that the iterative constructs result in a number of
"anonymous" rules which pj2 must create for the benefit of jay and which will be visible in
yybuild:
| reduce term 1.0 reduce yy.6 null reduce product null reduce term 2.0 reduce term 3.0 reduce multiply 6.0 reduce yy.4 null reduce yy.5 null reduce yy.6 null reduce product null reduce add 7.0 reduce yy.1 null reduce yy.2 null reduce yy.3 null reduce sum null 7.0 |
In summary, yybuild provides similar access to the recognition process as jay;
however, iterative constructs can make this access very difficult to use -- which is the good reason for
jay itself to not provide extensions to BNF.
pj2 is packaged as an executable jar archive which by default calls commands which implements a command line interface for the various algorithms
implemented in pj2. Any number of commands can be given to one invocation of the program. At this
point all commands and keywords can be abbreviated to their first letter.
commands manipulates a collection of named trees. New parser trees can
be generated by invoking a parser. Trees can be displayed. Code for jay and Java code for
scanners and tree factories can be generated from parser trees. There is a -generate command to
batch the typical commands for creating input for jay which includes Java code for a driver,
scanner, and tree factory from a single input file.
| java -jar pj2.jar -command argument ... |
The command -build tries to parse a file and build a tree associated with the name
new.
The second example below shows that the input language in file depends on the class with the
main program; by default the class is pj2.RfcParser which supports RFC (a superset of BNF).
| -build new file |
| # build the RFC parser from its own description java -jar pj2.jar -build rfc rfc.rfc # build the BNF parser from its own description java -cp pj2.jar pj2.BnfParser -build bnf bnf.bnf # build a parser for arithmetic expressions java -jar pj2.jar -build expr expr.bnf |
The command -dump displays any List-based tree as an indented list of
(unqualified) class names on standard output. If no name is specified, a serialized tree is read
from standard input; otherwise the tree associated with name is displayed.
| -dump [name] |
| # build and display the default parser java -jar pj2.jar -build rfc rfc.rfc -dump rfc |
Most of these commands extract information from the parser tree associated with name and
generate code on standard output which is usually processed with jay and then compiled with a
Java compiler.
-jay is the most important command and should be specified after -prolog and before
the others in this section. It emits code so that jay can generate a parser for the rules in the
parser source. Iterative constructs are converted to BNF and error recovery code is inserted.
Error recovery code might introduce some conflicts into a RFC grammar; however, the default recovery
inserted by RfcBuilder should still be approppriate. flags can be specified to toggle
individual flags which control generation of error recovery code, see RfcBuilder. e
suppresses resetting yyErrorFlag; this can result in fewer errors to be reported within a cascade
of errors. msMS suppresses the insertion of error recovery rules into the "many" and "some"
constructs, i.e., for the suffixes * and + and their list forms, respectively. Finally,
oNT causes error recovery rules to be inserted for the suffixes ?, *, and +, respectively, which are more likely to lead to conflicts in parser generation.
-Jay operates just like -jay; however, the parser generated by jay will not
store null whenever empty input is reduced. This behavior can be toggled at runtime, see
below.
-prolog and -epilog emit the first and last section of the parser definition which is
usually used to embed the other pieces into a class.
-main emits a main program for a compiler which instantiates and runs a scanner and a parser.
If an argument is supplied to the main program, the parser's behavior as defined by -jay vs.
-Jay is toggled, i.e., the presence of an argument controls whether or not the parser will store
null whenever empty input is reduced. The main program assumes that the scanner class is static, named yyLex, and immediately nested into the parser class. The tree factory class must
be named yyTree. If this is not the case, -main can be used to suggest a main program and
later omitted, and a more suitable -- and probably simpler -- main program can be included with the
parser source for -build.
If animate is specified the parser can be traced or animated: a digit 0 through
3 invokes yyAnim, otherwise yyDebugAdapter produces a
trace.
If tree is specified the parser is connected to a tree factory and the resulting tree, if
any, is serialized to standard output. tree can be followed by a class name; an instance of that
class is used in place of the tree factory yyTree created by the command -tree. The class
must have a constructor which requires no arguments.
-scanner emits Java code for a scanner class yyLex created by JLex and based on the parser tree
associated with name. Alternatively, -scanner can apply JLex to input from a file name.lex
and emit the resulting Java code.
-Scanner shows the input that would be sent to JLex; this is useful to decide if there
are ambiguity conflicts between tokens and literals.
Finally, -tree generates a tree factory yyTree and tree classes derived from AbstractList which can be used as a builder for the parser associated with name.
All code fragments are read from a properties file which is read as a resource. The property Gen.properties can be specified, e.g., with the option -D for java, to override the stem
of the file name, see Gen. The property value pj2/Xml requests a properties file
which will create an XML document, see Building with XML. The property value pj2/Visitor requests a properties file which is configured to generate interfaces Visitor and
Visit internal to yyTree and implement Visit in each tree class to support the
classic visitor pattern. The properties file can be copied and reconfigured to add a parameter to all
methods.
| -prolog name -jay name [flags] -Jay name [flags] -main [animate [0|1|2|3]|tree [builder]] -scanner name[.lex] -Scanner name -tree name -epilog name |
| # check a grammar for arithmetic expressions java -jar pj2.jar -build expr expr.bnf \ -jay expr > jay.tmp jay <&- -v jay.tmp # create a package directory mkdir pj2 # build and generate jay code for the default parser, avoid conflicts java -jar pj2.jar -build rfc rfc.rfc \ -prolog rfc -jay rfc S -scanner yyLex.lex -epilog rfc > jay.tmp # run jay and compile the result jay < pj2.skeleton jay.tmp > pj2/RfcParser.java javac -cp .:pj2.jar pj2/RfcParser.java |
-generate is a shorthand for the most common use of the commands described in the previous
section, see Launcher. It reads the specified grammar file and emits the code for the driver,
scanner, and tree factory to standard output. The option is send to the -main command and
controls if there is animation or a tree builder. The result has to be processed with jay and a
pj2-specific skeleton file and finally compiled with javac.
-Generate operates just like -generate; however, the parser generated by jay
will not store null whenever empty input is reduced. This behavior can be toggled at
runtime as described in the previous section.
| -generate file [option] -Generate file [option] |
| # compile a tree builder for arithmetic expressions java -jar pj2.jar -generate expr.bnf tree > jay.tmp jay < pj2.skeleton jay.tmp > test/Test.java javac test/Test.java # compile and dump an arithmetic expression echo '1+2-3*4' | \ java test.Test | \ java -cp .:pj2.jar pj2.RfcParser -dump # compile an animated parser for arithmetic expressions java -jar pj2.jar -generate expr.bnf anim > jay.tmp jay < pj2.skeleton -t jay.tmp > test/Test.java javac -cp .:yydebug.jar test/Test.java # animate parsing an arithmetic expression java -cp .:yydebug.jar test.Test 3 |
As discussed, the input languages can describe themselves (and each other); however, the scanner and
the tree factory are hand-crafted. Trees for arithmetic expressions can be built
solely from the annotated grammar.
The annotated source tree and an executable archive are here.
|
(c) 2008 Axel T. Schreiner |
|||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||