Package step3

Step 3: Generating the Scanner.

See:
          Description

Interface Summary
a.yyInput must be implemented by a scanner object to supply input to the parser.
badIf.yyInput must be implemented by a scanner object to supply input to the parser.
expr.yyInput must be implemented by a scanner object to supply input to the parser.
If.yyInput must be implemented by a scanner object to supply input to the parser.
screen.yyInput must be implemented by a scanner object to supply input to the parser.
 

Class Summary
a invalid lexical analysis for lines with arithmetic expressions.
a.yyLex  
badIf invalid lexical analysis for lines with open if statements and arithmetic expressions with identifiers.
badIf.yyLex  
expr lexical analysis for lines with arithmetic expressions.
expr.yyLex  
If lexical analysis for lines with closed if statements and arithmetic expressions with identifiers.
If.yyLex  
Scanner hand-crafted lexical analysis for lines with arithmetic expressions.
screen lexical analysis for lines with closed if statements and arithmetic expressions with identifiers.
screen.yyLex  
 

Exception Summary
a.yyException thrown for irrecoverable syntax errors and stack overflow.
badIf.yyException thrown for irrecoverable syntax errors and stack overflow.
expr.yyException thrown for irrecoverable syntax errors and stack overflow.
If.yyException thrown for irrecoverable syntax errors and stack overflow.
screen.yyException thrown for irrecoverable syntax errors and stack overflow.
 

Package step3 Description

Step 3: Generating the Scanner.

Step 2 presented a typical grammar for lines with arithmetic expressions. The next step is to check if input that should be legitimate actually conforms to the grammar. There are two problems: input characters have to be assembled to represent symbols such as a Number, and the sequence of input symbols has to be matched to the grammar. A scanner deals with the first problem, a parser uses a scanner and deals with the second problem. pj can generate both.

The First Scanner

Lexical analysis is the process of assembling input characters to represent the symbols that the grammar rules expect, and to silently discard, e.g., white space or comments. pj can almost automate the creation of a scanner class which performs lexical analysis for typical programming languages. In order to compile and use this class and more, Java code has to be added to the input for pj.

a.pj adds a prolog and an epilog to the expression grammar of Step 2:

package step3;

public class a {

%left '+' '-'
%left '*' '/' '%'
%%

program: ...

%%
}
The prolog is Java code until the first % at the beginning of a line (and not within a comment). This code is output first if pj and jay generate a Java file; therefore the prolog can contain package and import statements.

The prolog must start a Java class which the epilog concludes. pj assumes that the class name in the first class header in the prolog is the name of the parser class.

The epilog is Java code which follows the grammar rules after a line containing %%. This code is output last if pj and jay generate a Java file. It can contribute to the class started in the prolog and it could contain additional, non-public classes as well.

Prolog and epilog should not introduce identifiers that start with the letters yy because pj's tools JLex and jay reserve this prefix for their own use.

The -scanner option is used to insert a scanner into the output of pj and jay. The output should be placed into the appropriate Java file and compiled.

$ mkdir step3
$ java -jar pj.jar -jay -scanner:static=a.lex - a.pj > step3/a.java

With this form of the option the scanner is a static inner class of a and pj can insert a main function so that the scanner can be tested:

$ javac step3/a.java
$ java step3.a'$'yyLex
scanner         token   value
- 10
(1) at "-"      '-'
(1) at "1"      49
(1) at "0"      48

Unfortunately, these are not the symbols that the grammar expects: 1 and 0 have not been assembled as a Number and all whitespace, including the newline characters which the grammar expects at the end of each line, has been discarded.

pj generates input for JLex to create the scanner. The command above saved this input as a.lex. The last few lines show how the scanner works:

%%
[\ \b\f\n\r\t]+ {
  }
.|\n {
    token = yytext().charAt(0); return true;
  }
Regular expressions try to match input characters. The longest match, or the first of several equal matches, triggers a piece of Java code which can access the input by calling yytext(). By convention the symbol value -- for single characters the character value itself -- is assigned to token and an information object could be assigned to value. The code then returns true.

The last few lines of a.lex show why this scanner does not assemble the expected symbols: Thus far pj has generated only two rules, one to ignore white space and the almost mandatory last rule to pass the remaining single characters. Newline \n is specified in both rules, but in such a case the earlier rule is used.

It is important to test the scanner. A test program can be quite simple:

public static void main (java.lang.String[] args) throws Exception {
  yyLex scanner = new yyLex(System.in);
  while (scanner.advance())
    System.out.println(scanner.token()+"\t\""+scanner.value()+"\"");
}
As long as advance() is successful, token and value can be retrieved as method values. a.lex contains a more sophisticated test program which pj supplies. It asks the scanner for position information, avoids printing invisible characters, and it uses a table yyNames[] in which jay maps expected token numbers to symbolic names.

Input Categories

expr.pj adds two directives and macro definitions so that Number has a reasonable representation:

%skip {comment} {space}
%token Number {digits}

comment = ("#".*)
space   = [\ \t\b\015]+
digits  = [0-9]+
Macro definitions immediately follow the directives and are sent unchanged to JLex if pj generates a scanner. A macro definition gives a name to a regular expression.

A generated scanner will ignore white space by default. However, the default can be replaced with a skip directive which specifies one or more macro names in braces, JLex-style. Once skip is used, only input matching the regular expressions of these macros will be ignored by the generated scanner.

Number represents a category of inputs in the grammar. A token directive is used to define Number as a terminal symbol and, optionally, to associate a macro, or even a string, with it. Input equal to the string, or matching the regular expression of the macro, is then collected and presented as one symbol to the grammar.

$ java -jar pj.jar -jay -scanner:static=expr.lex - expr.pj > step3/expr.java

Again, the Java program has to be put into an appropriate file and compiled, and then the scanner can be tested. It will now ignore comments and white space, assemble numbers, and return newlines in addition to the operators:

$ javac step3/expr.java
$ java step3.expr'$'yyLex
scanner         token   value
# this is a comment
- 10
(1) at "\n"     '\n'
(2) at "-"      '-'
(2) at "10"     Number
(2) at "\n"     '\n'
Matching Order

badIf.pj combines the if statements sketched in Step 1 with arithmetic expressions and identifiers. As before, there is one shift/reduce conflict, but the Java program can still be created, compiled, and tested:

$ java -jar pj.jar -jay -scanner:static=badIf.lex - badIf.pj > step3/badIf.java
jay: 1 shift/reduce conflict.

Unfortunately, this scanner does not work as intended:

$ javac step3/badIf.java
$ java step3.badIf'$'yyLex
scanner         token   value
if a then 2 else c
(1) at "if"     Id
(1) at "a"      Id
(1) at "then"   Id
(1) at "2"      Number
(1) at "else"   Id
(1) at "c"      Id
(1) at "\n"     '\n'

if and the other symbols are recognized as identifiers, not as reserved words with a special meaning to the grammar. badIf.pj uses macros and a token directive to define the reserved words for the if statement:

%token Id {id}
%token IF {if} THEN {then} ELSE {else} // used too late!

if      = "if"
then    = "then"
else    = "else"

id      = ([a-zA-Z][a-zA-Z]*)
%%

stmt: IF expr THEN stmt
  | IF expr THEN stmt ELSE stmt
  | expr ';'
  ;
pj generates the matching rules for JLex in the order in which the symbols appear in directives and grammar rules. The JLex rules in badIf.lex show the problem:
{id} {
    token = Id; value = "{id}"; return true;
  }
{if} {
    token = IF; value = "{if}"; return true;
  }
Directives like token and skip can appear in the input of pj in any order, preceding the macro definitions. Here, however, {id} is used before {if}, etc.; therefore, the JLex rule to match {id} also appears before the rule to match {if}. The input if to the scanner matches both regular expressions and in such a case the first JLex rule wins, i.e., if is interpreted as an Id and not as an IF for the parser.

This problem can be repaired in the input to pj in two ways: either by reordering the directives so that the more specific macros are referenced first, or by making use of the fact that pj deliberately generates rules for all double-quoted literal strings before rules for macro references.

As an aside: while pj and jay only reserve the name error for their own use, token identifiers will show up in JLex rules and from there in the final Java code, i.e., it would not have been a good idea to use a word such as if that is reserved in Java as a token name in this grammar; by convention, token names are usually spelled in upper case for symbols or with a leading upper case character for input categories.

Strings before Macros

If.pj defines almost the same language as badIf.pj but it repairs the shift/reduce conflict and arranges for a correct scanner to be generated:

$ java -jar pj.jar -jay -scanner:static=If.lex - If.pj > step3/If.java

This scanner works as intended:

$ javac step3/If.java
$ java step3.If'$'yyLex
scanner         token   value
if a then 2 else c fi
(1) at "if"     __if
(1) at "a"      Id
(1) at "then"   __then
(1) at "2"      Number
(1) at "else"   __else
(1) at "c"      Id
(1) at "fi"     __fi
(1) at "\n"     '\n'

if and the other symbols are now recognized as reserved words because they have been specified as double-quoted literals in the grammar and pj generates JLex rules for such literals before rules for macros:

stmt: "if" expr "then" stmt "fi"
  | "if" expr "then" stmt "else" stmt "fi"
  | expr ';'
  ;
The shift/reduce conflict has disappeared because the grammar requires that each if statement has to be terminated with fi so that there can be no question as to where an else belongs.

Screening

Double-quoted literals and their precedence over macros in the generated JLex rules provide a very convenient way to specify reserved words in a grammar. A large number of reserved words, however, can result in a somewhat bloated scanner.

Screening is an alternative: if some input matches a regular expression specified as a macro the input is used as a key for a hashtable. The value from the table, if any, provides the grammar symbol which the input represents. If there is no value, a default symbol is sent to the grammar.

pj supports screening through screen directives and single-quoted literals. screen.pj defines the same language as If.pj and uses screening:

$ java -jar pj.jar -jay -scanner:static=screen.lex - screen.pj > step3/screen.java

This scanner works as intended, too:

$ javac step3/screen.java
$ java step3.screen'$'yyLex
scanner         token   value
if a then 2 else c fi
(1) at "if"     _if
(1) at "a"      Id
(1) at "then"   _then
(1) at "2"      Number
(1) at "else"   _else
(1) at "c"      Id
(1) at "fi"     _fi
(1) at "\n"     '\n'

a and c represent the same token which classifies them as identifiers, while reserved words like if all represent the different single-quoted literals that have been specified in the grammar:

%screen {id}
%token Id {id}

...

stmt: 'if' expr 'then' stmt 'fi'
  | 'if' expr 'then' stmt 'else' stmt 'fi'
  | expr ';'
  ;
expr: expr '+' expr
  ...
A screen directive specifies macro references that result in JLex rules with code consulting a global hashtable. All single-quoted literals are entered into the hashtable and can be found if input matches the regular expression of one of the macros. If the matching input cannot be found in the hashtable and if the macro is also associated with a token name such as Id above, the input is considered to represent the token name, otherwise an error results.

Input which does not match any of the screen macros' regular expressions will not be looked up in the hashtable. This may be unfortunate; however, remaining single input characters will still be found by the last JLex rule and they represent single-quoted single characters in the grammar rules.

StreamTokenizer as a Scanner

Scanner and parser communicate using an interface which contains three functions: advance() returns true if it was able to move to the next input symbol, token() returns the integer value which the parser expects to represent the symbol, and value() can return an object with more information about the symbol.

A scanner need not be generated by pj. In fact, pj itself requires a scanner Rfc.Scanner which uses the state feature of JLex to employ different kinds of scanning rules in different parts of the input.

Scanner.java shows how a scanner can be derived from a StreamTokenizer. This could perhaps be advantageous if JLex is not available; however, the base class has shortcomings that cannot readily be circumvented:

$ javac step3/Scanner.java
$ java step3.Scanner
scanner         token   value
if a10 then b.20.30 c fi
Token[if], line 1       Id      if
Token[a10], line 1      Id      a10
Token[then], line 1     Id      then
Token[b], line 1        Id      b
Token['.'], line 1      46
Token[n=20.3], line 1   Number  20.3
Token[fi], line 1       Id      fi
Token[EOL], line 2      '\n'

pj provides tracing for a generated scanner; this can be reused in a hand-crafted scanner. However, screening has to be implemented explicitly.

One of the advantages of StreamTokenizer is that backslash escapes in strings are interpreted automatically and that skipping for both kinds of Java-style comments can be turned on. However, the implementation of StreamTokenizer.parseNumbers() adds periods and minus signs as possible characters to numbers and identifiers -- this usually has to be countermanded. The code above shows that a leading period can then not be made part of a number.

Similarly, there is no clean way to scan operators that consist of more then one special character. There is StreamTokenizer.pushBack() so that with look-ahead an operator could be composed of two special characters, but there is would be no control over whitespace and comments between the characters.

Summary

A scanner assembles input character into symbols for the grammar and discards, e.g., whitespace and comments. pj can generate input for JLex to create a scanner. This usually requires a skip directive referencing macro definitions for white space and for comments and token directives for input categories such as Number or Id referencing macros describing approriate input through regular expressions. Reserved words can be specified as double-quoted literals directly in the grammar.

The scanner can be tested if pj is used to create a Java program with a scanner as a static inner class. The class header for the Java program is specified in the prolog for pj and the epilog must at least close the class definition.

JLex uses macro definitions with regular expressions from the input to pj and has a table of rules where input matching a regular expression triggers the execution of some Java code to generate the grammar symbol and additional information. The order of these JLex rules is important: the scanner tries for the longest match but breaks a tie in favour of the first JLex rule that can match the input.

token directives in pj pair symbol names and macro references or literals. They result in JLex rules to match double-quoted literals and macros where JLex rules for double-quoted literals are generated first. skip directives reference macros and result in JLex rules where input is ignored. scan directives can also be specified; they take literals as arguments and can be used to influence the order of JLex rules because all but the precedence directives (see Step 2) can be specified in any order. Finally, literals can appear in the grammar rules in pj.

pj generates JLex rules first for all double-quoted literals in the order in which they appear in the directives and grammar rules, and then for all macro references, again in order of appearance. There is a last, default rule that catches the remaining single characters.

Screening can improve a scanner. A screen directive contains macro references. Input matching such a macro is looked up in a hashtable constructed from all single-quoted literals in the pj specification. If the input cannot be found and the macro is also associated with a token, the input is considered to represent the token name.