Package pj

This is the homepage of pj, a preprocessor for jay and JLex which largely automates the generation of parse trees from programs.

See:
          Description

Interface Summary
Dump.Helper can be implemented if a class wants to do it's own dump.
Leaf what a leaf in a pj tree can do.
Rfc.yyActions interface between grammar and actions.
Rfc.yyInput must be implemented by a scanner object to supply input to the parser.
 

Class Summary
Dump output an indented tree.
PJ preprocessor for jay and JLex.
PJServlet Deprecated. Replaced by Servlet.
Rfc RFC-style grammar for pj.
Rfc.Scanner embedded pj scanner.
SubProcess arranges to store text and process it through a native subprocess.
SubProcess.Copy daemon thread to copy a stream to and from the subprocess.
 

Exception Summary
Log thrown to back out of a servlet with a report.
Rfc.yyException thrown for irrecoverable syntax errors and stack overflow.
 

Package pj Description

This is the homepage of pj, a preprocessor for jay and JLex which largely automates the generation of parse trees from programs.

pj reads a grammar specified in BNF, extended with optional and iteration constructs and annotated only with class names, and generates a scanner with JLex, jay grammar rules for error recovery, an interface between the parser constructed by jay and actions carried out during recognition, and an implementation for this interface which can build parse trees that are suitable for processing with jag.

pj is available through wcs and for binary and source download.

Command Line

java -jar pj.jar [options] [--] [source|-]

By default pj will either process an input file specified on the command line or standard input and will produce a grammar for jay as standard output. If the input is annotated with types, the output will include an action interface. If the input is correct, the exit code is zero.

The following options can be specified in any order:

-dump dump the parser tree.
-flatten squeeze out a single ArrayList parameter.
-jay[=file] postprocess with jay, keep jay input file.
-packages list package names to search for base classes; the first character of the list is the delimeter between names. The default is :java.lang:java.util.
-recover generate error recovery for iterations.
-scanner[:static][=file] - generate nested scanner class, static, keep JLex input file.
-scanner[:static] file insert nested scanner class from JLex source file, static.
-trace[:0|1|2|3] postprocess with jay and compile jay trace, with animation simulating nothing, input, and/or output.
-tree[:static] - generate nested tree factory class, static.
-tree file only generate tree factory and insert into source file to replace a line containing %%. This option can only be combined with -dump.
-verbose postprocess with jay and create parser description.
-wcs allow System spoofing.

Input Format

pj borrows heavily from jay. Input is mostly free format with Java-style comments. However, the input file consists of several sections which are separated by symbols that must appear at the beginning of lines which are otherwise empty. The sections and the symbols are as follows:

    prolog
    directives
    macro definitions
  %%
    rules
  %%
    epilog

Prolog and Epilog

prolog and epilog together with their delimiting symbols are optional. However, the typical use of pj is to create a Java source file which starts with the prolog and ends with the epilog. Each is copied verbatim without the delimiting symbols; in this case the prolog must at least contain a class header and the epilog must terminate this class.

The class name from the first class header in the prolog is assumed to be the class name of the parser, i.e., the parser should not be an inner class.

Directives

directives are optional. They are useful to control the behaviour of a JLex scanner generated by pj and to simplify a grammar which pj forwards to jay. Each kind of directive can be specified multiple times and their effect is cumulative. Each directive starts with a specific symbol at the beginning of a line.

Directives can be commented in the style of Java.

In strings, \ is used in the style of Java to escape a backslash, single and double quote, backspace, formfeed, newline, return and tab character.

%scan "string" ...

scan can specify (double-quoted) strings. This is only useful to influence the order of JLex rules if pj is used to generate a scanner.

%screen {macro-name} ...

screen is used to specify patterns to use for screening for literals if pj is used to generate a scanner.

screen designates one or more macros which must be defined later. Input matching the regular expression of any such macro will be checked against a hashtable of all literals (single quoted strings) used in the grammar specification. Screening literals can significantly reduce the size of a scanner generated by pj.

screen must be specified if multi-character literals are used. Unfortunately, at this point there is no check that the macro definitions actually find all literals. Single-character literals will at least be found by the default rule generated by pj for JLex.

%skip {macro-name} ...

skip is used to describe comments and other ignorable text if pj is used to generate a scanner.

skip designates one or more macros which must be defined later. Input matching the regular expression of any such macro will be ignored by the JLex scanner generated by pj.

%token <classname> name representation ...

token is used to provide information about input symbols. Typically, it will at least be used to define names for symbol categories such as numbers, strings, and user-defined identifiers.

token can start with one (qualified) Java classname. In this case the scanner's value() method must provide an object of this class whenever the token() method returns one of the tokens defined in this directive. If pj is used to generate the scanner, the class must have a constructor that accepts a string; this is true for most of the wrapper classes.

If token does not start with a classname the input symbols will later not be made available to actions. The directive is still useful, however, to combine a symbol name with a representation.

token contains one or more unique, alphanumeric input symbol names, which must start with a letter. Each name can then be used in the rules as a terminal symbol. Each name can be followed by a representation: a double- or single-quoted literal string (the latter may require a screen directive), or a macro defined later to designate a regular expression that describes how the input symbol is represented, i.e., in this case a name can represent many similar inputs in the rules.

If a name is not followed by a representation, it is not known how this input symbol would be represented. This can still be useful, as long as the scanner is not generated using pj.

%left symbol ...
%right symbol ...
%nonassoc symbol ...

These directives are used to designate symbols as left-, right-, or non-associative operators and to assign a precedence to a group of such symbols. The precedence increases with each successive such directive.

Each directive contains one or more symbols. A symbol is a name which may be defined with token or a literal string in single or double quotes.

These directives are useful to specify more compact grammars for jay. The tutorial Step 2 demonstrates how they simplify the specification of arithmetic expressions.

%prec symbol ...

This precedence directive follows the items on the right hand side of a rule and references the precedence table created by the aforementioned directives. It uses the referenced precedence to resolve conflicts involving the right hand side.

Macro Definitions

macro definitions are optional and can only follow the directives. They are useful to assign names to regular expressions when pj is used to generate a scanner with JLex. Each macro definition consists of a single line:

name = regular expression

The name must be unique, alphanumeric, and start with a letter. The regular expression must conform to the requirements of JLex as described by the link above. It is usually advisable to enclose a regular expression in parentheses.

Comments can only appear on lines between macro definitions.

Rules

The rules section is preceded by the symbol %% (which is not optional) and contains one or more rules:

<classname> name : right-hand-side <actionname> |... ;

A rule can start with a Java classname. The default is java.lang.Object. Each action in this rule has to return an object from this class. It is advisable to fully qualify this classname or to use the -packages option on the command line. Each right-hand side of a rule can end with an actionname in angle brackets.

A rule defines a unique, alphanumeric non-terminal name which must start with a letter. The definition consists of one or more right-hand-sides, separated by |, and terminated with a semicolon. The first rule defines the start symbol of the grammar.

A right-hand-side is a sequence of zero or more items, optionally followed by an inline precedence directive and/or an action. When the grammar is processed by jay at most one right hand side should be empty.

An item can be a literal string in single or double quotes, a name defined as a token, a name that has already or will be defined as a non-terminal, or finally an otherwise undefined name which will be treated as if it had been defined using a token directive.

An item can also be an optional or iterative construct:

name ? specifies that name appears at most once and produces null or the corresponding value().
name * specifies that name appears zero or more times and produces a (potentially empty) java.util.ArrayList of all occurrences.
name + specifies that name appears one or more times and produces a java.util.ArrayList of all occurrences.
name / item specifies that name appears one or more times, separated by a (non-iterative) item as described above, and produces a java.util.ArrayList of all occurrences.

For iterative constructs pj generates additional non-terminals with appropriate rules and optionally right-hand-sides to support a relatively robust error recovery mechanism. As indicated, most of these rules use java.lang.ArrayList as a classname and provide actions that collect values as required.

Explicit error recovery as implemented by jay is also possible because error is a reserved word that can be used as an item.

Scanner Generation

If requested pj will generate a scanner using JLex and insert the result as a statically nested or member class of the parser.

This scanner class is called yyLex and satisfies the yyInput interface required by the parser generated by jay. macro definitions are simply passed to JLex.

The central part of the scanner is a table of regular expressions and actions. JLex permits the regular expressions to be ambiguous; if there is a conflict the longest possible input string is given to the first regular expression that can recognize it.

pj emits literal strings in double quotes as regular expressions before macro references. Within each group the elements appear in order of appearance in the pj input. As a consequence, the screen and skip expressions will appear relatively early in the second group. If there is no skip directive, white space will be ignored with a very late regular expression. The various directives can be specified in any order in order to influence the order of the regular expressions in the scanner.

A default rule at the end of the table recognizes each single character that has not yet been recognized; these characters correspond to the single quoted characters in the input of pj unless these are found by screening. As a consequence, a single character in double quotes is recognized rather then the same single character in single quotes; this may have unfortunate results if both are specified in the same grammar.

The JLex actions are generated to return appropriate token() values to the parser. For names that were typed using token directives value() will return appropriate objects as long as a constructor with a string argument exists.

A test driver for a scanner is quite trivial to write. If pj generates the scanner with the option -scanner:static - a test driver like the following is included into yyLex:

  public static void main (String[] args) throws Exception {
    yyLex scanner = new yyLex(System.in);
    while (scanner.advance())
      System.out.println(scanner.token()+"\t\""+scanner.value()+"\"");
  }

Parser Generation

pj uses jay to check the grammar and generate a parser. The relevant input to jay consists of some of the precedence directives and of the rules, each modified to accommodate the syntax requirements of jay.

A number of symbols have to be replaced throughout: Single-quoted single characters are acceptable to jay. All other single- and double-quoted literals are replaced by identifiers and mentioned in token directives. The identifiers start with one or two underscores and use the text of an alphanumeric literal or a unique number.

Left-recursive rules are generated for the iterative constructs. If requested with -recover, these rules contain right-hand sides involving error; this may introduce otherwise harmless shift/reduce conflicts. Rules with an empty alternative are generated for optional constructs. The prefix opt_ and the suffix s is used for the new nonterminals; unique numbers are appended to avoid conflicts.

The classname at the beginning of each rule, if any, is mentioned in a type directive. Iterative constructs specify ArrayList. Optional constructs specify the type of the optional item, if any.

Actions

Each right-hand side of a rule can end with an action consisting of an actionname enclosed in angle brackets:

<classname> name : right-hand-side <actionname> |... ;

pj generates an interface yyActions as a member of the parser class. yyActions contains each actionname as a method returning the type specified as classname for the rule. The parameter types of the method result from the typed symbols in the right-hand side, i.e., nonterminals and optional symbols contribute their type, iterative constructs contribute ArrayList, and terminals contribute their type if one was specified with a token directive.

pj also generates a jay action for almost every right-hand side. If there is an actionname, the action is a call on the method declared in the interface yyActions. Otherwise the action collects the values corresponding to the typed symbols in the right-hand side, if any: if there is only a single such symbol it's value is returned by the action, if there are two or more, their values are returned in a ArrayList. Only a right-hand side without any typed symbols has no action.

An action for an optional construct returns null or the value corresponding to the optional symbol. An action for an iterative construct returns a ArrayList of values corresponding to the symbols actually found; for the * and / constructs this list may be empty. It should be noted that only typed symbols can be subject to optional and iterative constructs; if a seperator in the / construct is typed, it is also collected.

It is the user's responsibility to maintain matching types. While actions involving yyActions are declared and generated correctly, pj is not able to check semantic correctness of collection actions. Mismatches will show up as errors once the generated code is submitted to a Java compiler.

Tree Generation

yyActions is intended to specify a factory for a parse tree. If requested with -tree pj will generate the factory and either insert it into an explicitly specified source file or into the parser class. In the latter case, the parser's test driver will use the factory, create a parse tree, and display it with Dump.

Each actionname is generated as an inner class name and the action generated for jay calls on a public factory method which in turn calls on a suitable protected constructor generated for the class. In this fashion the factory can be subclassed and the method overridden to return an object from a different class.

The generated inner class is derived from the return type of the action. It contains access methods named after the symbols which contributed types to the action method; the methods return the values corresponding to the symbols.

The intention is to create parse trees which use List in their branch nodes for the benefit of jag. Again, it is up to the user to maintain suitable types because pj cannot completely anticipate the results.

Files

pj creates temporary and permanent files if any of the -scanner or -jay options are specified. Before these options can be used, however, pj.properties must contain properties that locate the processors. The property pj.properties can be specified, e.g., on the command line, to point to an alternate properties file which would be located as a resource. jay and jay.skeleton are the full paths to jay and the default skeleton, respectively. java and jlex.jar are the full paths to the Java executor and an executable jar file containing JLex, respectively.

-jay
-trace
-verbose
input for jay is stored in a temporary file; jay creates other temporary files and, if requested, stores a parser description in y.output.
-jay=file input for jay is stored in file.
-scanner[:static] - input for JLex is stored in a temporary file; Java code generated by JLex is stored in another temporary file before it becomes part of the standard output or is sent to jay.
-scanner[:static]=file - input for JLex is stored in file; Java code generated by JLex is stored in a temporary file before it becomes part of the standard output or is sent to jay.
-scanner[:static] file input for JLex is read from file; Java code generated by JLex is stored in a temporary file before it becomes part of the standard output or is sent to jay.
-tree file file is read and should contain code so that a single line %% can be replaced by the classes and methods of the tree factory, the result is written as standard output.

Installation

Depending on the command line options, pj runs JLex and jay as subprocesses. The file pj/pj.properties inside pj.jar contains four properties with paths which have to be adjusted for the local system:

  1. In a directory that only contains the archive pj.jar,
  2. jar xf pj.jar pj/pj.properties
    extract the file from the archive,
  3. edit the first four properties so that java points to the Java VM interpreter, jlex.jar points to an executable archive with JLex, jay points to the jay executable, and jay.skeleton points to the default Java skeleton for jay,
  4. jar u pj/pj.properties < pj.jar > new.jar
    update the archive,
  5. and run a test with new.jar.

To produce the executable JLex archive, download the file Main.java into a directory JLex, apply this patch, compile, create a file manifest containing only Main-Class: JLex.Main and

jar cfm jlex.jar manifest JLex/*.class

To produce the jay executable, download and compile the C sources with an ANSI C compiler.

Projects

Downloads

Version:
1.5, Pittsford, December 2005.
Author:
Axel T. Schreiner.