|
Object-oriented Parser System v3.6 |
|||||||||
| PREV NEXT | FRAMES NO FRAMES | |||||||||
mops, a parser system based on LL(n) grammars and functional
programming; and of scanner, a scanner system based only on Perl-like
regular expressions.
See:
Description
| Packages | |
|---|---|
| gl | Example: two interpreters for the Guarded Language. |
| lambda | Example: two input languages and an interpreter for the Lambda Calculus -- with thanks to Andreas Borchert. |
| mops | Monadic LL(n) parser generator system, based on Monadic Parsing using JavaScript . |
| oops3 | The oops3 parser generator. |
| scanner | An implementation of scanners which only requires Perl-like regular expressions. |
Here are the homepages of oops3, an object-oriented parser system
based on LL(1) grammars, factory, observer, and visitor design patterns; of
mops, a parser system based on LL(n) grammars and functional
programming; and of scanner, a scanner system based only on Perl-like
regular expressions. oops3 generates scanners, recognizers, tree
builders, builder interfaces, and tree factories with optional visitor
interfaces automatically. mops and scanner can be used
invisibly if oops3 generates code for a functional parser; mops
additionally implements a monadic notation for parsers which can be translated
into Java and JavaScript.
Core algorithms (lookahead, grammar checking, parsing, and code generation) are implemented using the visitor pattern. This allows oops3 to be bootstrapped with a hand-crafted tree, and the input language for grammars can easily be changed.
There are, in fact, three grammar description languages, each of which can describe itself:
Boot, which is
used to implement the other two. mops. The lookahead algorithm detects left recursion, unconnected rules, and ambiguous choices for LL(1). The parsing algorithm for LL(1) is greedy and therefore deterministic. Full LL(1) checking is also supported. There is no checking for functional parsers; however, the lookahead algorithm can be employed to make sure there is no left recursion.
The following diagram is linked to the class documentation and illustrates the data and control flow when creating a LL(1) parser to build trees from arithmetic expressions. Command lines to carry out these operations can be found below.
Boot calls on ParserFactory to
construct a parser tree for an RFC grammar over Parser
and its inner classes. The serialized tree is available as resource
rfc.ser.
Lookahead visits the bootstrap parser tree and
adds lookahead sets. Build visits the bootstrap
parser tree and obtains symbols from yyLex which
tokenizes a grammar for arithmetic expressions contained
in the file expr.rfc. Build uses RfcBuilder as a builder which
in turn calls on ParserFactory to construct a parser
tree for the grammar for arithmetic expressions over Parser. Lookahead visits the new parser tree
and adds lookahead sets. The final tree is serialized and positioned
as a resource expr.ser.
Gen visits the new parser tree and generates
Java code for a main program, a JLex-based
scanner, and a tree factory which will act as a builder and
contains classes derived from AbstractList to
represent arithmetic expressions. The code is stored in a file
Expr.java and compiled.
The generated (and compiled) main program reads the parser
tree for arithmetic expressions as a resource. Build
visits the parser tree and obtains symbols from the generated
scanner which tokenizes an arithmetic expression contained in the
file expr.txt. Build
uses the generated tree factory as a builder and constructs a
tree for the expression over the generated tree classes.
The following diagram is linked to the class documentation and illustrates the data and control flow when creating a functional LL(n) parser to build trees from arithmetic expressions. Command lines to carry out these operations can be found below.
This aspect is identical to generating an LL(1) parser:
Lookahead visits the bootstrap parser tree and
adds lookahead sets. Build visits the bootstrap
parser tree and obtains symbols from yyLex which
tokenizes a grammar for arithmetic expressions contained
in the file expr.rfc. Build uses RfcBuilder as a builder which
in turn calls on ParserFactory to construct a parser
tree for the grammar for arithmetic expressions over Parser.
Gen and MopsGen visit the new parser tree
and generate Java code for a main program, a scanner-based factory
for functional token parsers, mops-based functional parsers for
the grammar rules which execute value collection as described below, and a tree factory which will act as
a builder and contains classes derived from AbstractList
to represent arithmetic expressions. The code is stored in a file
Expr.java and compiled.
The generated (and compiled) main program calls the parser for the start symbol of the grammar and obtains symbols from the generated scanner to tokenize an arithmetic expression contained in the file expr.txt. The parser uses the generated tree factory as a builder and constructs a tree for the expression over the generated tree classes.
oops3 implements a greedy, LL(1)-based algorithm for recognition and
can use mops to implement a functional LL(n) parser. Therefore, the
input language for oops3 is some variant of extended BNF. However,
oops3 is implemented using oops3 -- the first parser is
serialized by running Boot -- and the input language can easily
be changed. This section describes the input language RFC which is a form of extended BNF
inspired by the Internet RFCs.
The input format for oops3 ultimately borrows from yacc; however, rules can use extended BNF (including alternatives nested into parentheses) to avoid left recursion and only the result of a rule is (optionally) typed. Literals are enclosed in single quotes and appear inline. Tokens must be defined as patterns before they are used and must be typed; void indicates that matches are not passed to the parser. Patterns are enclosed in single quotes; backslash escape sequences follow Java conventions.
Input to oops3 has three sections, separated by lines containing %% only. Text in the first and last section is arbitary; it can be used for Java code to embed code generated by oops3, see below.
The second section is free format. White space and comments ranging from # to the end of the line are ignored.
The second section contains token definitions and grammar rules in any order. However, a token must be defined before it can be used in a rule and the first grammar rule defines the start symbol of the grammar.
| # token definition <type> name = string; |
| # integer represented as Long <Long> Number = '[0-9]+'; # comment and white space to be skipped <void> comment = '#.*'; <void> spaces = '[ \n\r\t]+'; |
A token definition associates a token name with a pattern string. Strings are enclosed in single quotes and use Java-style backslash escapes. For an LL(1) parser the pattern string can be handed to JLex as a macro definition for scanner generation, see below.
A token definition must be typed. If the type is void
in any form, the token definition specifies text to be discarded
by the scanner. Otherwise the type must be a Java class name
which has a constructor accepting a single String argument. Matching input will be collected as an object
of the class, see below.
In case of ambiguity the order of token definitions is important:
a scanner tries to recognize the longest possible input string; in
case of ambiguities, a multi-character literal has precedence
over any token, an earlier token definition takes precedence over
a later one, and a single-character token definition takes
precedence over a single-character literal. If necessary, the
scanner can be executed under control of testScanner, a functional parser can be traced,
or the input for JLex can
be produced and checked for problems, see below.
| # grammar rule <type> ? name : explanation; |
| # bit sequence bits: ('0' | '1')*; # arithmetic expressions sum: product (('+'|'-') product)*; product: term / ('*'|'/'); term: Number | '(' sum ')'; |
A grammar rule associates a nonterminal name with an explanation consisting of literals, token names, nonterminal names, iteration and selection constructs, and parentheses for grouping. Every nonterminal must appear as the name of exactly one rule; the first rule presented to oops3 explains the start symbol of the grammar.
A grammar rule may be typed, but not as void in any
form. The type is an unqualified Java class name which
should be returned by the builder method used to filter data
collected by the rule, see below,
and which would be generated as a subclass of AbstractList if oops3 generates a
tree factory, see below. If the
rule is typed the type can be marked with ?. If marked
and if oops3 generates a tree factory, the factory method
will pass a single descendant through and not generate a tree
node. This provides an effective means to flatten parse trees,
see below.
The type itself may be omitted. In this case Build or
a functional parser create a left-associative tree from the collected nodes,
see below.
A literal is a string, enclosed in single quotes and using Java-style backslash escapes. A token name must have been defined before use as described above. A nonterminal name must eventually be explained with a rule. The selection and iteration constructs for the default input language of oops3 are described next. The description uses the fact that oops3 can describe its own input language.
| # exclusive-or selection rule: Id ':' xor ';'; xor: term+ | xor '|' term+; |
| # single bit bit: '0' | '1'; # if/else (functional parser) stmt: 'if' 'condition' stmt 'else' stmt | 'if' 'condition' stmt | 'other'; |
Exclusive-or selection has low precedence and uses | to separate alternatives. The exclusive-or selection matches if exactly one alternative matches. Alternatives can be sequences of literals, etc. as well as other constructs, subject to the following restriction: for an LL(1) parser any two alternatives must not be able to start with the same literal or token so that the selection among alternatives can be deterministic, for a functional parser the longer alternative must be specified first because there is no backtracking once an alternative succeeds.
The oops3 algorithms support other selections such as a subset to describe any permutation of some alternatives (inclusive or) or a set (and) to describe any permutation of all alternatives; however, the RFC input language described in this section cannot represent these selections and a functional parser cannot implement them.
| # iterations term: item ( '?' | ('+'|'*') ('/' item)? )?; |
| # non-empty and empty bit sequences bits: '0'+ | '1'*; # comma-separated, non-empty list of bit sequences list: bits +/ ','; |
Iterations have precedence over selections. ? denotes that an item is optional (can but does not have to appear in the input). + denotes that an item must appear one or more times. * is the combination of both, i.e., an item can appear zero or more times. Finally, / is used to specify a delimiter to appear between iterated items. Iterations are subject to the restriction that the body of a mandatory iteration cannot be optional; this implies that a list can have an optional body or an optional delimiter but not both.
The oops3 algorithms support other iterations such as a counted iteration with an explicit range or delimited permutations; however, the RFC input language described in this section cannot represent these iterations and a functional parser cannot implement them.
| # grouping item: Id | literal | '(' xor ')'; |
| # bit sequences bits: (zero | '1')+; zero: '0'; |
Simple items are literals which usually represent themselves, (previously defined) token names which usually represent a category of inputs selected by patterns, and (eventually explained) nonterminal names. Additionally, any of the constructs can be enclosed in parentheses to achieve highest precedence, and the result counts as another item.
LL(1) parsing cannot handle left recursion, see below. Therefore, to describe a grammar for LL(1) parsing, BNF is usually extended with constructs for grouping and iteration. The EBNF input language described in this section supports | for exclusive-or selection, parentheses for grouping, brackets for optional items, and braces for iterations of one(!) or more. [This is not the "classic" definition of braces; however, with this variant there is no need for code duplication to express a non-empty iteration.]
| # iterations item: Id | literal | '(' xor ')' | '[' xor ']' | '{' xor '}'; |
| # non-empty and empty bit sequences bits: { '0' } | [{ '1' }]; # comma-separated, non-empty list of bit sequences list: bits {[ ',' bits ]}; |
Brackets and braces can be nested to express arbitrary iterations, equivalent to * in the RFC input language. The EBNF input language does not support a construct for delimited lists; however, it would not be too hard to add one.
The XBNF input language described in this section is a superset of the RFC input language described above. The actual extensions are not supported by a functional parser. In addition to ? as a suffix for optional items and + and * for mandatory and optional iterations it supports explicitly counted iteration ranges which can be combined with delimiters:
| # iterations term: item range?; range: '?' | ( '+' | '*' | Number | Number? '..' Number? ) list?; list: '/' item; |
| # non-empty and empty bit sequences bits: '0' 1.. | '1' ..; # comma-separated list of two to four bit sequences list: bits 2..4 / ','; |
A counted range consists of one or two positive integers, the latter separated by a pair of periods. One integer denotes an exact count, two integers denote a lower and an upper bound; the range may not be empty. An omitted lower bound represents zero and an omitted upper bound denotes no limit.
The XBNF input language also supports permuted subsets (inclusive or) or full sets (and) of alternatives, which can be delimited. The alternatives are exclusive-or selections. Neither the alternatives nor the delimiter may accept an empty sequence of input symbols.
| # iterations item: Id | literal | '(' xor ')' | or | and; or: '[' xor 2../ ',' ']' list?; and: '{' xor 2../ ',' '}' list?; |
| # some or all of the letters, in any order some: [ 'a', 'b', 'c', 'd' ]; all: { 'a', 'b', 'c', 'd' }; # comma-separated list of six digits in any order list: { '1' 1, '2' 2, '3' 3 } / ','; # up to two pairs, one of letters and one of bits, in any order # elements separated by commas, pairs by semicolon pairs: [ ('a'|'b') 2/',', ('0'|'1') 2/',' ] / ';'; |
If alternatives of a permutation are iterations they are counted at the level of the permutation, i.e., the iterated items can be permuted. This means that the list above accepts 1,2,2,3,3,3 as well as 3,2,3,2,1,3 and other arrangements.
However, any other kind of alternative, specifically including delimited lists, must be completed as one element in the permutation, i.e., the constituents of any other kind of alternative cannot be permuted. This means that the pairs above can be a,b;0,0 or 1,1;b,a etc. but not a,0;b,0 etc.
A parser can build a tree to represent the recognized input. By default, a
parser will ignore all recognized literals and collect all input values
recognized as tokens in input order into a flat ArrayList. If there is no input a parser will return null. The grammar
| <void> spaces = '[ \n\r\t]+'; <Integer> Number = '[0-9]+'; sequence: Number sequence*; |
describes a scanner which will ignore white space and represent
decimal digit strings as Integer values. The (ambiguous)
grammar rule defines a sequence of numbers. If this parser is
compiled and executed as described below, it will create an ArrayList with the input numbers which Dump
might display as follows:
| ArrayList Integer 1 Integer 2 Integer 3 null |
By default, the building algorithm implemented by Build or by
a functional parser will add null whenever a construct recognizes no
input; here this is the case when the second iteration of sequence
recognizes 3 and has to accept no input for another
sequence*. The default behavior of either parser can be changed by
construction parameters and from the command line of a generated main program,
see below.
By default, a parser will collect a flat list. A builder can be used to structure the list as a tree using problem-specific classses and it can even reshape the tree.
A Build object performs recognition and tree
building by visiting a parser tree and receiving literals and
tokens from a scanner. A builder is an object specified as an
argument during construction of the Build object.
A functional parser has a global reference to a builder object.
A builder is called by (silent) reflection on the name of a
rule:
Build object or functional parser completes
recognition of the right-hand side of a rule and if the builder implements a
public method with the name of the rule and an int and a
List -- or only a List --
argument, the method receives the input position (if possible) and the
List of values collected for the rule. Build object or functional parser then considers the
result of a successful call as the value(s) collected for the right-hand
side of the rule and adds them to the parent collection. ArrayList its elements are added, otherwise the result itself
is added. Therefore, a builder can add any number of values to represent the
right-hand side of a rule, simply by wrapping them into an ArrayList, or it can create a tree branch, e.g., by
wrapping values into a subclass of AbstractList other then ArrayList. ArrayList will be part of
the resulting tree only at the top level (if the start rule is
untyped) or if a builder wraps it into something else. As described below, a builder
and result classes derived from AbstractList can be generated just by annotating grammar rules
with result types. The grammar
| <void> spaces = '[ \n\r\t]+'; <Integer> Number = '[0-9]+'; <Item> sequence: Number sequence*; |
describes a parser and tree factory which create the following result from the same input as before:
| 1 Item Integer 1 1 Item Integer 2 1 Item Integer 3 null |
The generated tree nodes record the input position associated
with the node and Dump reflects to display it.
Using an additional, typed rule a sequence of items can easily be collected as a tree branch. The grammar
| <void> spaces = '[ \n\r\t]+'; <Integer> Number = '[0-9]+'; <Item> sequence: Number nest*; <Nest> nest: sequence; |
describes a parser and tree factory which add a level of nesting when presented with the same input as before:
| 1 Item Integer 1 1 Nest 1 Item Integer 2 1 Nest 1 Item Integer 3 null |
The XBNF input language
can be used to specify permutations of subsets and full sets of
alternatives. To simplify further processing Build
arranges for the alternatives to be collected in the order of
specification, not in the order in which permuted input is received;
empty alternatives are represented by null. If permutations are
separated by delimiters, a delimiter is collected together with
the alternative it follows -- normally, delimiters are expected
to be literals which are ignored in any case. This grammar and input
file demonstrate how various iterations are collected, for
example for the rule
| <void> spaces = '[ \n\r\t]+'; <Integer> Number = '[0-9]+'; <OrList> orList: 'orList' [ a, b 1..2, c 1..3, d 2/'.' ]/Number; <A> a: 'a'; ... |
and the input orList d.d 1 a 2 b 3 b Dump
displays the following tree:
| 21 OrList 21 A Integer 2 21 B Integer 3 21 B null 21 D 21 D Integer 1 |
Permutations in any form are not supported by functional parsers.
If possible, a grammar for arithmetic expressions defines left-associative operators using left-recursive grammar rules. Unfortunately, LL(1) parsing cannot handle left recursion. However, left recursion is employed so that the resulting parse tree reflects left association of the operators. It is still possible to construct a left-associative tree even if the grammar rule uses an iteration construct rather then left recursion.
As discussed above, the grammar
| <void> spaces = '[ \n\r\t]+'; <Integer> Number = '[0-9]+'; <Item> sequence: Number sequence*; |
describes a parser and tree factory which create the following (right-associative) result:
| 1 Item Integer 1 1 Item Integer 2 1 Item Integer 3 null |
The modified grammar
| <> sequence: Number item*; <Item> item: Number; |
describes a parser and tree factory which create the following left-associative result:
| 1 Item 1 Item Integer 1 Integer 2 Integer 3 |
The generated tree factory contains classes such as Item to
represent branch nodes. These classes are derived from AbstractList and are mutable lists. If a rule is typed
and the type itself is omitted, and if the builder does not implement a method
corresponding to the rule, Build or a functional parser inserts
each collected value as a descendant of it's right neighbor. Effectively, such
a rule builds a left-associated tree.
The next section shows how a similar effect could be accomplished by subclassing the tree factory (or by implementing a builder from scratch).
The grammar for the Guarded Language demonstrates how to create
left-associative trees for typical arithmetic operators
and how to leverage greedy repetition to implement embedded,
right-associative assignment.
A tree factory can be subclassed to systematically replace one or more tree classes or to change the shape of a tree. This grammar
| <void> spaces = '[ \n\r\t]+'; <Integer> Number = '[0-9]+'; <Item> ? sequence: Number+; |
builds a flat tree with one Item holding all Integer objects representing input. However,
| %% public static class LeftTree extends Tree { public Object sequence (List list) { if (list.size() == 1) return list.get(0); Left left = new Left(list.subList(0, 2)); for (int n = 2; n < list.size(); ++ n) { list.set(n-1, left); left = new Left(list.subList(n-1, n+1)); } return left; } public class Left extends Item { protected Left (List list) { super(list); } } } } |
if this subclass and factory method are placed in the third section of the input to oops3 and if LeftTree is called as the builder (see below), the resulting tree represents input based on left association:
| 1 Left 1 Left 1 Left Integer 1 Integer 2 Integer 3 Integer 4 |
Note that the rule type Item in the grammar has to
be followed with ? in this case, because the factory
method must be able to return Object
rather then Item (or Left) in the subclass if
there is only a single descendant.
Builders can be generated or written from scratch. As described below, an interface can be generated from an annotated grammar which serves as an advisory for implementing a builder. A generated builder is implemented as a tree factory, i.e., it is designed to be extended by subclassing. Builders are accessed by (silent) reflection in order to further facilitate subclassing.
A LL(1) parser is created by an instance of EbnfBuilder,
RfcBuilder, or XbnfBuilder as a
builder, depending on the input language. The sources of these
builders provide examples of a builder and subclasses written
entirely from scratch.
Finally, Build is a subclass of Recognize which implements recognition. This architecture
was chosen deliberately so that an entirely different observation
framework could be implemented. RIT
oops, the predecessor of oops3, uses a different, much
more event-oriented framework. It is possible to recreate
this with a different subclass of Recognize,
see below.
Although oops3 is intended to generate trees which can
be processed with jag, it can also support the classic
visitor pattern. A tree factory can be generated to contain a
visitor interface, see below. This
grammar for arithmetic expressions
| <void> spaces = '[ \n\r\t]+'; <Integer> Number = '[0-9]+'; <Sum>? sum: product (add | subtract)*; <Add> add: '+' product; <Subtract> subtract: '-' product; <Product>? product: term (multiply | divide)*; <Multiply> multiply: '*' term; <Divide> divide: '/' term; term: Number | '(' sum ')'; |
creates from the expression 1 + 2 * 3 / ( 4 - 5 ) the following tree (without storing null):
| 1 Sum Integer 1 1 Add 1 Product Integer 2 1 Multiply Integer 3 1 Divide 1 Sum Integer 4 1 Subtract Integer 5 |
oops3/doc-files/Eval.java is a visitor
which uses double arithmetic to evaluate the tree.
eval is called with the root of a tree and visits a
branch node or evaluates a leaf node:
| public Double eval (Object o) { try { return (Double)((Visit)o).visit(this); } catch (ClassCastException e) { return new Double(o.toString()); } } |
Only two visit methods need to be implemented, one for Sum and a very similar one for Product nodes:
| public Object visit (Sum node) { Double result = null; for (Object child: node) if (result == null) result = eval(child); else try { result = (double)result + eval(((Add)child).get(0)); } catch (ClassCastException e) { result = (double)result - eval(((Subtract)child).get(0)); } return result; } |
All other visit methods of the generated Visitor interface can be defaulted:
| public Object visit (Add node) { return visit((Visit)node); } // ... public Object visit (Visit node) { throw new UnsupportedOperationException("should not visit"); } |
Note that this is not a very elegant solution. The grammar
could specify a Leaf class so that an Integer
leaf node could be "visited" with impunity. A builder could
shape the tree into the more familiar binary shape, see the
interpreters for the Guarded Language.
oops3 is primarily intended to automate the process of
representing input as a tree. However, Observe shows
how to extend Recognize to implement a classic
observer pattern to react to events triggered by recognition.
An observer for Observe must implement Observer. To simplify implementation
there is an Adapter which implements
all required methods to do nothing. Additionally, trace can be used to proxy observers to trace
all calls. For example, the grammar
| <void> spaces = '[ \n\r\t]+'; <Integer> Number = '[0-9]+'; sequence: Number sequence*; |
produces the following observation trace when two numbers are input:
| init sequence shift Number 1 init sequence shift Number 2 reduce shift Observer reduce shift Observer |
In principle, three kinds of events can be observed: init
rule indicates that recognition is about to activate a rule,
variants of shift indicate that a Literal or a Token has just been
accepted, reduce indicates that a rule has been completed;
the last event is followed by another shift to indicate
that the observer for a rule for a nonterminal has just completed
its job.
Observer is declared to be a
factory: init rule must return the observer to which the
events for the rule -- up to and including reduce -- will
be directed. While the Adapter always
returns itself, this can be used to introduce rule- or
activation-specific observers. The latter is particularly convenient
because it avoids the need for an explicit stack: an observer can
compute a value while watching input and nonterminal completion
for its rule activation; following reduce the observer is
sent to its own creator with shift and the creator can
retrieve the computed value. Note that init rule can be
used to pass information from the creator to a new observer.
Interpret contains an observer which uses double arithmetic to evaluate expressions as they are recognized:
| public abstract class Interpret { public interface Eval extends Observer { double value(); } public static class Impl extends Adapter implements Eval { protected double value; // current result public double value () { return value; } public String toString() { return ""+value; } // ... } } |
Following recognition, the top-level observer can be asked for
the value which it computed. init is implemented to return rule-specific subclasses:
| public Observer init (int position, String name) { name = name.intern(); if (name == "sum" // sum: product (add | subtract)*; || name == "product") // product: term (multiply | divide)*; return new Impl(scanner); if (name == "add") // add: '+' product; return new Impl(this) { public void shift (int position, Observer sender) { value += ((Eval)sender).value(); } }; // ... if (name == "term") // term: Number | '(' sum ')'; return new Impl(scanner) { public void shift (int position, Token sender) { value = ((Number)scanner.value()).doubleValue(); } }; throw new RuntimeException(position+": unexpected rule "+name); |
Because of the use of Adapter,
only those methods have to be implemented which will actually be
called and which work to compute the expression value. For a
Number Token the actual value
has to be retrieved from the scanner, which is a construction
parameter for the outermost observer. Observers for arithmetic
rules such as add are constructed with the initial value
as the left operand
| protected final Scanner scanner; public Impl (Scanner scanner) { this.scanner = scanner; } protected Impl (Impl left) { this(left.scanner); value = left.value; } |
and retrieve the right operand from the corresponding observer once it has completed its activities. The other rules simply fetch the current value from a completed observer back to its creator:
| public void shift (int position, Observer sender) { value = ((Eval)sender).value(); } |
At first glance Interpret
may look convoluted; however, it has proven to be a pretty typical
architecture for these observers. The interface Eval is
needed when trace is used to trace
interpretation. In this case all visible observers are proxies
which can only be cast to an interface such as Eval and
not to the actual class Impl.
toString should be implemented
if trace is used because it is called
to show the state of the observer during each traced event. Here
is the trace for the input 1+2*3:
| init sum 0.0 init product 0.0 init term 0.0 shift Number 1 1.0 reduce 1.0 shift Observer 1.0 reduce 1.0 shift Observer 1.0 init add 1.0 shift + 1.0 init product 0.0 init term 0.0 shift Number 2 2.0 reduce 2.0 shift Observer 2.0 init multiply 2.0 shift * 2.0 init term 0.0 shift Number 3 3.0 reduce 3.0 shift Observer 6.0 reduce 6.0 shift Observer 6.0 reduce 6.0 shift Observer 7.0 reduce 7.0 shift Observer 7.0 reduce 7.0 shift Observer 7.0 |
oops3 is packaged as an executable jar archive
which by default runs Main which implements a command
line interface for the various algorithms implemented in oops3.
Any number of commands can be given to one invocation of the
program. At this point all commands can be abbreviated to their
first letter.
| java -jar oops3.jar -command argument... |
Main manipulates a collection of named trees. New
trees can be loaded from files or resources, or they can be
generated by invoking a parser. Trees can be displayed or serialized
into files. Java code for scanners and tree factories can be
generated from parser trees. There is a generate command to batch
the typical commands for creating a serialized parser and the
Java code for a driver, scanner, and tree factory from a single input
file.
Generated LL(1) parsers require parts of oops3 as runtime support,
functional parsers require mops and scanner.
| -input name file -Input name resource |
| # load default parser java -jar oops3.jar -Input boot rfc.ser |
Either version of the command loads a serialized tree and
assigns it a name for future reference within this run of
Main. A resource is found relative to
the classpath (e.g., in the jar archive). The output from
Boot, i.e., a parser to generate parsers from
grammar specifications, is available as resource rfc.ser.
Similarly, the resources ebnf.ser and xbnf.ser
represent the EBNF and XBNF input languages. One of these resources
is typically loaded with the name boot.
| -dump name -Dump name -Print name |
| # load and display default parser java -jar oops3.jar -Input boot rfc.ser \ -Print boot -dump boot |
Either version of -dump displays any List-based tree associated with the name
as an indented list of (unqualified) class names on standard
output. More information is produced if the tree is a Parser; specifically, -Dump will also
display the lookahead set index and the lookahead set for each
node.
-Print displays the patterns and rules of a Parser using the XBNF input language.
| -lookahead name |
| # add lookahead to default parser java -jar oops3.jar -Input boot rfc.ser \ -lookahead boot |
This command adds lookahead sets to the Parser associated with name. The sets are necessary before
the tree can be used for checking, recognition, or building. The algorithm
also detects ambiguities in the grammar, see above.
| -check name -Check name -follow name |
| # check the default parser java -jar oops3.jar -Input boot rfc.ser \ -lookahead boot -Check boot |
Once a Parser contains lookahead sets, these commands
validate that the grammar is suitable for LL(1) parsing.
-check analyzes if the rules are infinitely recursive,
see Recursive. This example
shows what needs to be avoided.
-Check additionally analyzes if the grammar is LL(1); basically this means that lookahead and follow sets of each node are different when the parser otherwise would have to rely on a greedy algorithm. This example, the infamous dangling else, shows what should be avoided; however, it also shows that the greedy LL(1) algorithm or a functional parser may well be acceptable.
-follow dumps the parser tree and shows the lookahead set index (which is the same as the follow index) and the lookahead and follow set for each node.
| -recognize name file [scanner-class] -Recognize name file [scanner-class] |
| # recognize the parser's own description java -jar oops3.jar -Input boot rfc.ser -lookahead boot \ -recognize boot rfc.rfc # trace recognizing a parser for arithmetic expressions java -jar oops3.jar -Input boot rfc.ser -lookahead boot \ -Recognize boot expr.rfc # trace recognizing an arithmetic expression java -cp expr.jar:oops3.jar oops3.Main -input expr expr.ser \ -Recognize expr expr.txt expr.ExprParser\$yyLex |
Either version of the command uses the Parser associated
with name to recognize text from file. The second
version of the command provides a trace during recognition.
The default scanner-class is yyLex, i.e.,
the scanner used with Parser to recognize grammars.
| -observe new name file [observer-class [scanner-class]] -Observe new name file [observer-class [scanner-class]] |
| # observe recognizing an arithmetic expression java -cp expr.jar:oops3.jar oops3.Main -input expr expr.ser \ -Observe new expr expr.txt null expr.ExprParser\$yyLex # interpret an arithmetic expression java -cp expr.jar:oops3.jar oops3.Main -input expr expr.ser \ -observe value expr expr.txt expr.Interpret\$Impl expr.ExprParser\$yyLex \ -dump value |
Either version of the command uses the Parser
associated with name to recognize the file and
associate the initial observer with the name new. If
specified, observer-class can be null or the
(fully qualified) name of a class implementing Observer which is instantiated and used
as an observer for the recognition process. The second version
of the command uses trace to proxy
the observers for tracing.
The default scanner-class is yyLex, i.e.,
the scanner used with Parser to recognize grammars.
The scanner is optionally supplied as an argument for instantiating
the observer.
| -build new name file [builder-class [scanner-class]] -Build new name file [builder-class [scanner-class]] |
# build and rebuild the parser from its own description
java -jar oops3.jar -Input boot rfc.ser -lookahead boot \
-build rfc boot rfc.rfc oops3.RfcBuilder \
-lookahead rfc \
-build copy rfc rfc.rfc oops3.RfcBuilder
# trace building a flat tree from a grammar for arithmetic expressions
java -jar oops3.jar -Input boot rfc.ser -lookahead boot \
-Build expr boot expr.rfc
# build a parser for arithmetic expressions
java -jar oops3.jar -Input boot rfc.ser -lookahead boot \
-build expr boot expr.rfc oops3.RfcBuilder
# build a tree for an arithmetic expression
java -cp expr.jar:oops3.jar oops3.Main -input expr expr.ser \
-build tree expr expr.txt \
expr.ExprParser\$Tree expr.ExprParser\$yyLex |
Either version of the command uses the Parser
associated with name to recognize the file and build
a tree associated with the name new. If specified,
builder-class can be null or the (fully qualified)
name of a class which is instantiated and used as a builder to
influence the building process. The second version of the command
shows a trace during the build process.
The default scanner-class is yyLex, i.e.,
the scanner used with Parser to recognize grammars.
oops3 can generate LL(1) and functional parsers. This section and the following section describe the commands to generate code for LL(1) parsers. Two more sections deal with functional parsers.
| -write name file -prolog name -main resource [builder-class] -Main resource [builder-class] -actions name -scanner name -Scanner name -tree name [builder-class] -epilog name |
# store default parser complete with lookahead
java -jar oops3.jar -Input boot rfc.ser -lookahead boot \
-write boot new.ser
# create the package directory
mkdir expr
# generate a parser tree, check the grammar,
# and generate code for a parser for arithmetic expressions
java -jar oops3.jar -Input boot rfc.ser -lookahead boot \
-build expr boot expr.rfc oops3.RfcBuilder \
-lookahead expr \
-Check expr \
-write expr expr.ser \
-prolog expr -main expr.ser Tree -scanner expr \
-tree expr Tree -epilog expr > expr/Expr.java
# compile the parser
javac -classpath .:oops3.jar expr/Expr.java
# test the scanner
java -cp .:oops3.jar expr.ExprParser\$yyLex < expr.txt
# execute the parser, display a full expression tree
java -cp .:oops3.jar oops3.Main -input expr expr.ser \
-build tree expr expr.txt \
expr.ExprParser\$Tree expr.ExprParser\$yyLex \
-dump tree
# execute the parser, display a simplified expression tree
java -cp .:oops3.jar expr.ExprParser SIMPLE < expr.txt | \
java -cp .:oops3.jar oops3.Main -input tree /dev/fd/0 -dump tree |
-write stores the tree associated with the name into the file. This command is normally used to store a new parser for later use by the -input command or by the main program generated for an LL(1) parser.
The remaining commands generate Java code, usually for the Parser associated with name, on standard output.
-prolog and -epilog emit the first and last section of the parser definition which is usually used to embed the other pieces into a class.
-main creates a main program for a LL(1)-based compiler which
will load a serialized tree from a resource, run Build with an instance of builder-class (or
null) as the builder, and send the serialized new tree
to standard output. -Main rather then -main
defines the default so that null is not stored
to indicate a match of no input. This can be toggled by specifying
an argument (such as SIMPLE above) on the command line
for the generated main program.
-scanner generates Java code for a scanner class yyLex created by JLex and based on the parser tree associated with name. The scanner is only useful for a LL(1)-based parser. -Scanner shows the input that would be sent to JLex; this is useful to decide if there are ambiguity conflicts between tokens and literals.
-actions generates an interface which a builder for the parser associated with name should implement. However, the interface is advisory only, because the builder is contacted by reflection which is allowed to fail silently.
-tree generates a tree factory Tree and tree
classes derived from AbstractList
which can be used as a builder for the parser associated with
name. The class name defaults to Tree.
All code fragments are read from a properties file which is
read as a resource. The property Gen.properties can be
specified, e.g., with the option -D for java, to
override the stem of the file name, see Gen. The
value oops3/Visitor requests a properties file which is
configured to generate interfaces Visitor and Visit
internal to Tree and implement Visit in each
tree class to support the classic visitor pattern, see above. The properties file can
be copied and reconfigured to add a parameter to all methods.
Generating a LL(1)-based frontend
| -generate [path]name.style > source.java -Generate [path]name.style > source.java |
| # create the package directory mkdir expr # generate a parser tree, check the grammar, # and generate code for a parser for arithmetic expressions java -jar oops3.jar -generate expr.rfc > expr/Expr.java # generate a parser for arithmetic expressions # which generates trees without null and with a visitor pattern java -DGen.properties=oops3/Visitor -jar oops3.jar \ -generate expr.rfc > expr/ExprParser.java |
-generate is a shorthand for the most common use of
the commands described in the previous
section, see Main. It reads the specified grammar
file, checks the grammar, serializes the parser into a file
name.ser and emits the code for the driver,
scanner, and tree factory to standard output.
-Generate is the same shorthand but invokes -Main rather then -main, i.e, the resulting parser will not store null for no input by default.
oops3 can generate LL(1) and functional parsers. This section and the following section describe commands to generate Java code for functional parsers.
| -Functional -main resource [builder-class] # null for empty input -Main resource [builder-class] # nothing for empty input -Scanner name -scanner name -write name |
# create the package directory
mkdir expr
# generate a parser tree, check the grammar,
# and generate code for a functional parser for arithmetic expressions
java -jar oops3.jar -Input boot rfc.ser -lookahead boot \
-build expr boot expr.rfc oops3.RfcBuilder \
-Functional \
-prolog expr -scanner expr -write expr -main expr mopsTreeFactory \
-tree expr mopsTreeFactory -epilog expr > expr/ExprParser.java
# compile the parser
javac -classpath .:mops.jar expr/ExprParser.java
# test the scanner
java -cp .:mops.jar expr.ExprParser trace < expr.txt > /dev/null
# execute the parser, display an expression tree with null
java -cp .:oops3.jar:mops.jar expr.ExprParser dump < expr.txt
# execute the parser, serialize and display a simplified expression tree
java -cp .:mops.jar expr.ExprParser null < expr.txt | \
java -cp .:oops3.jar oops3.Main -input tree /dev/fd/0 -dump tree |
-Functional toggles the functional
flag which is off by default. If it is set, the commands described below
generate the specific Java code based on Parser for a functional
parser to implement the Parser tree associated with name,
on standard output. The examples above illustrate that other commands
described previously, such as -Input, -lookahead, and
-build to represent a grammar as a tree, and -prolog,
-tree and -epilog to emit the first and last section of the
parser definition and generate a tree factory as a builder, are usually used
to complete the Java code.
Once functional is set -main or
-Main creates a main program for a compiler based on a functional
parser which runs the parser with an instance of builder-class (or
null) as the builder and sends the serialized new tree to standard
output. The main program supports command line arguments: dump
displays the new tree and suppresses serialization, null will toggle
storing null if empty input is accepted, and trace provides
diagnostic output from the scanner.
-Scanner calls on scanner to generate a parser factory
for literals and tokens in the grammar, -scanner additionally
generates Java code for the token parsers. The result is only useful as
building blocks for a functional parser.
Finally, -write generates the code for the functional parsers corresponding to the grammar rules which call on the builder set up by the main program for collection.
All code fragments are read from a properties file which is
read as a resource. The property Gen.properties can be
specified, e.g., with the option -D for java, to
override the stem of the file name, see Gen. The
value oops3/Visitor requests a properties file which is
configured to generate interfaces Visitor and Visit
internal to the tree factory and implement Visit in each
tree class to support the classic visitor pattern, see above. The properties file can
be copied and reconfigured to add a parameter to all methods.
Generating a functional frontend
| -Functional -generate [path]name.style > source.java -Generate [path]name.style > source.java |
| # create the package directory mkdir expr # generate a parser tree, check the grammar, # and generate code for a functional parser for arithmetic expressions java -jar oops3.jar -Functional -Generate expr.rfc > expr/ExprParser.java |
With functional set, -generate and
-Generate are shorthands for the most common use of the commands
described above, see Main. Either command reads the specified
grammar file and emits Java code for the main program, scanner, functional
parser, and tree factory. For -generate the parser will store
null for empty input, for -Generate it will not store
anything for empty input; either behavior can be toggled with null as
a command line argument when executing the main program.
As discussed, the input languages can describe themselves (and each other);
however, yyLex and the builders are hand-crafted. Trees for arithmetic expressions can be built
solely from the annotated grammar.
The annotated source tree and an executable archive are here. The C# version of oops3 is here.
|
(c) 2008 Axel T. Schreiner |
|||||||||
| PREV NEXT | FRAMES NO FRAMES | |||||||||