Package step5

Step 5: Error Recovery.

See:
          Description

Interface Summary
a.yyInput must be implemented by a scanner object to supply input to the parser.
b.yyInput must be implemented by a scanner object to supply input to the parser.
many.yyInput must be implemented by a scanner object to supply input to the parser.
 

Class Summary
a direct error recovery.
a.yyLex  
b indirect error recovery.
b.yyLex  
many generated error recovery.
many.yyLex  
 

Exception Summary
a.yyException thrown for irrecoverable syntax errors and stack overflow.
b.yyException thrown for irrecoverable syntax errors and stack overflow.
many.yyException thrown for irrecoverable syntax errors and stack overflow.
 

Package step5 Description

Step 5: Error Recovery.

error

An input error results when the current input does not match the grammar. Step 4 demonstrated that the parser will throw a yyException when it detects the first input error thus terminating the parser.

This behavior is often undesirable. Therefore, there is a special reserved input symbol error which will match an input error (and which significantly changes the parsing algorithm as discussed below). As a first example, consider a.pj which defines a sequence of numbers and permits an input error within the sequence:

seq: // empty
  | seq Number
  | seq error;
Some experimenting shows that this grammar will cope with any kind of input, because whenever there is an input error the grammar is already expecting it, thanks to the error symbol specified in the right place:
$ mkdir step5
$ java -jar pj.jar -jay -scanner:static - a.pj > step5/a.java
$ javac step5/a.java
$ java step5.a
1 - 2 - 3 4 5 - 6
(1) at "-": syntax error, expecting Number
(1) at "-": syntax error, expecting Number

The input shown above contains three errors: the three minus signs. However, the parser only produces two error messages. Careful inspection of the trace shows why:

$ java step5.a 2>1 |
> egrep 'left to recover|expecting|rule .3.'
1 - 2 - 3 4 5 - 6
shift   from state 1 to 3       0 left to recover
(1) at "-": syntax error, expecting Number
reduce  state 2 uncover 0       rule (3) seq : seq error
shift   from state 1 to 3       2 left to recover
reduce  state 2 uncover 0       rule (3) seq : seq error
shift   from state 1 to 3       2 left to recover
shift   from state 1 to 3       1 left to recover
shift   from state 1 to 3       0 left to recover
(1) at "-": syntax error, expecting Number
reduce  state 2 uncover 0       rule (3) seq : seq error
Three input errors are duly found and matched to the grammar using rule 3. However, the parser maintains a counter yyErrorFlag which is set to three whenever an input error is detected and decremented by one back to zero whenever a real input is matched. A syntax error message is produced only if the counter is set from zero to three. This technique only reports errors that are separated by three or more valid input symbols and avoids annoying bursts of many similar messages.

error Search

error is useful if the exact position of an input error can be anticipated. However, as it stands one would have to add rules replacing just about every input symbol by error in order to make the parsing process reasonably robust. The resulting grammars would certainly be unreasonable and likely to be loaded with conflicts. Therefore, the parsing process is changed when an input error needs to be matched.

Recall from step 1 that jay can produce a file y.output describing the parser that it constructs. y.output documents the parsing process: The parser starts in state 0. Every input 'shifts' to a new state which is placed on the state stack. y.output shows that conceptually a marker is moved through the applicable rules. Once the marker reaches the end of a rule, the rule is 'reduced', i.e., its states are removed from the stack, and by 'goto' the parser reaches a new state reflecting the nonterminal to which the rule is reduced. Therefore, at any time the state stack contains all the marked positions in all the applicable rules.

If an input error happens, the parser tries to match error not only in the top state on the stack, but also in all states, one after another, that are below the top state. As the parser searches for a match for error, the state stack is popped. Parsing can continue, with or without an error message, if a match is found before the stack is empty:

yyLoop:
yyDebug.push
yyDiscarded:
yyInput.advance, yyDebug.lex
operation from table
shift
-- yyErrorFlag (to 0)
yyDebug.shift
continue yyLoop
accept
yyDebug.accept, return last $$
error
yyErrorFlag
0
yyerror(), yyDebug.error
1,2
yyErrorFlag = 3
error
  on top of stack
 
yyDebug.shift error
continue yyLoop
yyDebug.pop
until stack empty
yyDebug.reject, yyException
3
if eof reject, yyException
yyDebug.discard
continue yyDiscarded
reduce
yyDebug.reduce
{ action }
yyDebug.shift // goto
continue yyLoop

b.pj defines a sequence of numbers, each followed by a semicolon, and permits an input error within the sequence:

seq: // empty
  | seq item
  | seq error;

item: Number ';';
This grammar does not explicitly anticipate that a number is not followed by a semicolon. However, some experimenting shows that this grammar will still cope with any kind of input, because the error search will always eventually uncover rule 3 where the grammar expects an input error:
$ java -jar pj.jar -jay -scanner:static - b.pj > step5/b.java
$ javac step5/b.java
$ java step5.b
1; 2 + 3; 4 - 5; 6; * 7;
(1) at "+": syntax error, expecting ';'
(1) at "-": syntax error, expecting ';'
(1) at "*": syntax error, expecting Number

The trace shows that for two of the three input errors the state stack has to be popped in order to match error:

(1) at "+": syntax error, expecting ';'
pop     state 3 on error
(1) at "-": syntax error, expecting ';'
pop     state 3 on error
(1) at "*": syntax error, expecting Number

Robust Iterations

The previous example suggests important principles for an efficient placement of error: If error is used close to the start symbol of the grammar it is likely that the parser will not terminate by throwing yyException. If error is used in iterations it is likely that the parser will recover from input errors within the iterated items.

Step 2 introduced the optional and iterative constructs which pj supports:

name ? specifies that name appears at most once.
name * specifies that name appears zero or more times.
name + specifies that name appears one or more times.
name / item specifies that name appears one or more times, separated by a (non-iterative) item such as a literal symbol.

many.pj is a simple example that exercises each notation:

program: // zero or more
  | program many | program some | program opt | program list;

many: "many" m*;       // zero or more 'm'
some: "some" s+;       // one or more 's'
opt: "opt" o?;         // zero or one 'o'
list: "list" l / ',';  // one or more 'l', separated by commas

m: "m";
s: "s";
o: "o";
l: "l";
If the option -recover is specified pj will add error rules when the iterative constructs are expanded. Inspecting the jay input is a good way to see how typical iterations can be made robust:

$ java -jar pj.jar -recover many.pj

Zero or more items require one extra rule to be completely robust:

ms: // zero or more
  | ms m
  | ms error
  ;
One or more items require two extra rules to catch an input error in either the first or a subsequent item:
ss: s // one or more
  | ss s
  | error
  | ss error
  ;
A delimited list requires more rules:
ls: l // one or more, separated by ','
  | ls ',' l
  | error
  // likely to introduce r/r conflicts
  // | ls error
  | ls error l
  | ls ',' error
  ;
This shows the limits of automating error recovery: A trailing input error without a preceding delimeter is likely to cause reduce/reduce conflicts when iterative constructs are nested. pj cannot provide exact control over the order of rules presented to jay. It is not at all clear that these conflicts would be resolved in a useful fashion. Therefore, pj does not provide totally robust recovery for delimeted lists.

The same argument, unfortunately, applies to optional items, too:

opt_o: // null or one
  | o
  // likely to introduce r/r conflicts
  // | error
  ;
error is a reserved token that can be specified within rules where it has the usual effect, see [1]. However, if a right hand side includes error explicitly its action will return null. pj supports a systematic way to use error within all iterations; therefore, explicit use of error should hardly be necessary.

If the property pj.recover is true pj adds error tokens and manages yyErrorFlag so that the iteration rules are very likely to recover from any input error. Unfortunately, the technique tends to introduce shift/reduce conflicts into the resulting grammar.

The conflicts should be investigated but they are likely to be harmless -- an input error is usually reduced in the innermost active iteration, i.e., most of the remaining parse process will still work as intended.