Package step4

Step 4: Checking Recognition.

See:
          Description

Interface Summary
expr.yyInput must be implemented by a scanner object to supply input to the parser.
If.yyInput must be implemented by a scanner object to supply input to the parser.
inc.yyInput must be implemented by a scanner object to supply input to the parser.
 

Class Summary
expr recognition for lines with arithmetic expressions.
expr.yyLex  
If lexical analysis for lines with open if statements and arithmetic expressions with identifiers.
If.yyLex  
inc recognition for lines with arithmetic expressions.
inc.yyLex  
 

Exception Summary
expr.yyException thrown for irrecoverable syntax errors and stack overflow.
If.yyException thrown for irrecoverable syntax errors and stack overflow.
inc.yyException thrown for irrecoverable syntax errors and stack overflow.
 

Package step4 Description

Step 4: Checking Recognition.

Step 2 presented a typical grammar for lines with arithmetic expressions. Step 3 showed how pj can be used to generate and test a scanner that assembles input characters to represent symbols such as a Number. A parser takes input symbols from a scanner and matches them to a grammar. It is now time to generate a parser and check if plausible lines with expressions actually conform to the grammar.

The good news is that all it takes is to add a scanner to a grammar to be able to check if the grammar and plausible programs fit together. The bad news is that there could be very little information if things go wrong.

Parsing

Parsing is the process of matching input symbols to a grammar. The grammar has a start symbol; for pj this is the first nonterminal in the input. A syntax tree uses the start symbol as the root, the branch nodes are nonterminal symbols, and the leaves are terminal symbols. The tree is ordered: the descendants of each branch node must form one right hand side of the grammar rule for the node's nonterminal.

Considering the grammar expr.pj from step 1, the rule

term: '+' term | '-' term | '(' sum ')' | Number;

permits the following tree fragments:

The sequence of leaves of a syntax tree is called a sentence; it consists only of terminal symbols. A parser is faced with a potential sentence, i.e., a sequence of terminal symbols provided by a scanner. The parser has to compute (but not necessarily store) a syntax tree to prove that the sequence is in fact a sentence. For example, for expr.pj the input

1 - 2

results in the following syntax tree:

As an aside: Depending on the grammar, a syntax tree might not be unique. The grammar with the single rule

sum: Number | sum '+' sum;

can produce the following different syntax trees for the same sentence:

Such a grammar is called ambiguous. Step 2 showed that a similarly left- and right-recursive grammar badexpr.pj has lots of shift/reduce conflicts.

If jay does not report conflicts for a grammar, a syntax tree computed for a sentence will be unique; therefore, jay can prove that a grammar is not ambiguous. (The converse is not true: jay is not happy with the grammar badbits.pj from step 1 but the grammar does produce unique syntax trees.)

The First Parser

expr.pj is the grammar for lines with arithmetic expressions developed previously. If pj generates a scanner with JLex and a parser with jay it also includes a main program to test the parser.

$ mkdir step4
$ java -jar pj.jar -jay -scanner:static - expr.pj > step4/expr.java

The parser remains silent if the input is acceptable:

$ javac step4/expr.java
$ java step4.expr
- 10 + 20 * 30 / (40 % 7)

Incorrect input produces a more violent reaction:

$ java step4.expr
- 10 ** 2
(1) at "*": syntax error, expecting '(' '+' '-' Number
Exception in thread "main" step4.expr$yyException: irrecoverable syntax error
        at step4.expr.yyparse(expr.java:348) ...
The generated parser produces a single, reasonably meaningful error message, but then it throws a yyException. Step 5 will deal with discovering more then one error.

Error Position

If a parser is checked interactively, the input line causing the error is usually easy to find. However, real grammars and real programs are a bit bigger and it is helpful if the parser error message contains some information to locate the position of the error. The scanner needs to be asked, because the parser has no idea where the unexpected input symbol came from.

Step 3 discussed the interface between the scanner and the parser. To keep the scanner as simple as possible there is no explicit provision for a position indication. If the scanner reads, e.g., from a graphical user interface a position indication might not even be a string value. However, step 3 illustrated that both, a scanner generated by pj and a scanner derived from StreamTokenizer, provide a good textual position indication as the result of toString().

Such a position indication should be made part of the parser error message which is always produced by the method yyerror(String, String[]) which jay generates as part of the parser. Therefore, if pj generates a scanner, it also generates a main program to test the parser which overrides the error message method:

public static void main (String[] args) throws yyException, IOException {
  final yyInput scanner = new yyLex(System.in);
  expr parser = new expr() {
    public void yyerror (String message, String[] expecting) {
      super.yyerror(scanner+": "+message, expecting);
    }
  };
  parser.yyparse(scanner);
}

Tracing

While the example above demonstrated an error reaction by the parser, the actual error was caused by defective input, not by a defective grammar. Using expr.pj there simply is no way to build a syntax tree for

-10 ** 2

because a single * is an input symbol and there is no rule or combination of rules that permits two * adjacent to each other.

inc.pj adds identifiers and pre- and postfix increment operations to the arithmetic expressions:

expr: expr '+' expr
  | ...
  | Number
  | Id
  | "++" Id
  | Id "++"
  ;

$ java -jar pj.jar -jay -scanner:static - inc.pj > step4/inc.java

A classic C puzzle is the following:

$ javac step4/inc.java
$ java step4.inc
a +++++ b
(1) at "++": syntax error, expecting '\n' '%' '*' '+' '-' '/'

The problem is the greedy scanner, as a trace of the scanner shows:

$ java step4.inc'$'yyLex
scanner         token   value
a +++++ b
(1) at "a"      Id
(1) at "++"     _0
(1) at "++"     _0
(1) at "+"      '+'
(1) at "b"      Id
(1) at "\n"     '\n'

Scanners usually try to assemble the longest possible input symbols; therefore, the five + are assembled into two consecutive postfix increment operations, an arrangement that is not supported by the grammar. The following does not provoke an error:

$ java step4.inc
a +++ ++ b
A similar puzzle is
$ java step4.inc
a +++ b
The input matches the grammar, but it might be useful to know if ++ is applied to a or to b. Tracing the scanner provides a clue
$ java step4.inc'$'yyLex
scanner         token   value
a +++ b
(1) at "a"      Id
(1) at "++"     _0
(1) at "+"      '+'
(1) at "b"      Id
(1) at "\n"     '\n'
The increment operator is assembled before the addition operator; therefore, it will have to be matched to the grammar following a and not preceding b.

However, there is a more reliable way to check the exact behavior of a parser:

$ java -jar pj.jar -jay -scanner:static - -trace -verbose inc.pj > step4/inc.java

If the -trace option is set for pj, the parse method yyparse() can be called with an additional parameter, an observer yyDebug that will be informed about every action that the parser takes:

yyLoop:
yyDebug.push
yyDiscarded:
yyInput.advance, yyDebug.lex
operation from table
shift
-- yyErrorFlag (to 0)
yyDebug.shift
continue yyLoop
accept
yyDebug.accept, return last $$
error
yyErrorFlag
0
yyerror(), yyDebug.error
1,2
yyErrorFlag = 3
error
  on top of stack
 
yyDebug.shift error
continue yyLoop
yyDebug.pop
until stack empty
yyDebug.reject, yyException
3
if eof reject, yyException
yyDebug.discard
continue yyDiscarded
reduce
yyDebug.reduce
{ action }
yyDebug.shift // goto
continue yyLoop

Compilation and execution of the parser now requires the archive yydebug.jar which contains the package jay.yydebug defining the observer interface yyDebug. The package also contains a simple implementation yyDebugAdapter which writes messages to diagnostic output and is included by pj if the -trace option is set:

$ javac -classpath .:yydebug.jar step4/inc.java
$ java -cp .:yydebug.jar step4.inc
push    state 0 value null
reduce  state 0 uncover 0       rule (3) lines :
goto    from state 0 to 2
push    state 2 value null
a +++ b
lex     state 2 reading Id      value null
shift   from state 2 to 3       0 left to recover
push    state 3 value null
lex     state 3 reading _0      value null
shift   from state 3 to 12      0 left to recover
push    state 12        value null
reduce  state 12        uncover 2       rule (16) expr : Id _0
...

There is a lot of output which provides the most information if it is carefully analyzed together with the parser description in y.output. Often it is sufficient to just extract the lines beginning with lex and reduce which show which input symbol the parser reads and which right-hand side of a rule is used to compute the syntax tree:

$ java -cp .:yydebug.jar step4.inc |
> egrep '^(lex|reduce)'
reduce  state 0 uncover 0       rule (3) lines :
a +++ b
lex     state 2 reading Id      value null
lex     state 3 reading _0      value null
reduce  state 12        uncover 2       rule (16) expr : Id _0
lex     state 11        reading '+'     value null
lex     state 18        reading Id      value null
lex     state 3 reading '\n'    value null
reduce  state 3 uncover 18      rule (14) expr : Id
reduce  state 24        uncover 2       rule (5) expr : expr '+' expr
...
Clearly, the increment operator (which has been given the name _0 by pj) is connected to the first identifier, a, using right-hand side (16) and the result is connected into a sum using right-hand side (5) much later.

Tracing the parser also provides the definitive answer to the dangling else question. If.pj was originally introduced in step 1 and meanwhile combines arithmetic expressions with if statements without a trailing fi. The grammar has the usual shift/reduce conflict:

$ java -jar pj.jar -jay -scanner:static - -trace -verbose If.pj > step4/If.java
jay: 1 shift/reduce conflict.

Tracing demonstrates how an else is attached in a nest of if statements:

$ javac -classpath .:yydebug.jar step4/If.java
$ java -cp .:yydebug.jar step4.If |
> grep '^reduce.*stmt'
if 1 then if 2 then 3; else 4;
reduce ... rule (7) stmt : expr ';'
reduce ... rule (7) stmt : expr ';'
reduce ... rule (6) stmt : __if expr __then stmt __else stmt
reduce ... rule (5) stmt : __if expr __then stmt
reduce ... rule (9) opt_stmt : stmt
reduce ... rule (2) line : opt_stmt '\n'

The parser always computes the syntax tree bottom-up, i.e., beginning with the leaves and ending with the root. The fact that right-hand side (6) is used before (5) documents that the else is attached to the innermost if statement.

Animation

yyAnim is another implementation of the observer interface for the parser which provides animation. pj includes animation if a flag value is attached to -trace. The value includes 1 if standard input and 2 if standard output should be simulated in a text area in the animation window.

$ java -jar pj.jar -jay -scanner:static - -trace:3 If.pj > step4/expr.java

3 allows for interactive use of the parser animation:

$ javac -classpath .:yydebug.jar step4/expr.java
$ java -cp .:yydebug.jar step4.expr

In this case input can be typed into the text area at the bottom. A line can be edited using backspace and control-U; once newline or control-D is typed the line is made available to the scanner. Following Unix conventions, control-D at the beginning of a line acts as end of file.

The remaining areas show the operation of the parser. The comment area shows more or less the same information as the trace described earlier. The animation stops if something changes in an area where the checkbutton is set. A click on the continue button continues the animation.

Summary

Once a scanner is generated or hand-crafted, pj can be used to test if plausible programs match a proposed grammar. If there are error messages, scanner and parser can be traced to see exactly how the grammar and the program fit together. It is particularly helpful to compare the trace and y.output to follow the parser operation.

If a grammar has conflicts, they should be carefully checked. It might be advisable to write little programs and trace their recognition by the parser to verify that conflict resolution works as intended.