|
|||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | ||||||||
See:
Description
| Interface Summary | |
|---|---|
| a.yyInput | must be implemented by a scanner object to supply input to the parser. |
| badIf.yyInput | must be implemented by a scanner object to supply input to the parser. |
| expr.yyInput | must be implemented by a scanner object to supply input to the parser. |
| If.yyInput | must be implemented by a scanner object to supply input to the parser. |
| screen.yyInput | must be implemented by a scanner object to supply input to the parser. |
| Class Summary | |
|---|---|
| a | invalid lexical analysis for lines with arithmetic expressions. |
| a.yyLex | |
| badIf | invalid lexical analysis for lines with open if statements and arithmetic expressions with identifiers. |
| badIf.yyLex | |
| expr | lexical analysis for lines with arithmetic expressions. |
| expr.yyLex | |
| If | lexical analysis for lines with closed if statements and arithmetic expressions with identifiers. |
| If.yyLex | |
| Scanner | hand-crafted lexical analysis for lines with arithmetic expressions. |
| screen | lexical analysis for lines with closed if statements and arithmetic expressions with identifiers. |
| screen.yyLex | |
| Exception Summary | |
|---|---|
| a.yyException | thrown for irrecoverable syntax errors and stack overflow. |
| badIf.yyException | thrown for irrecoverable syntax errors and stack overflow. |
| expr.yyException | thrown for irrecoverable syntax errors and stack overflow. |
| If.yyException | thrown for irrecoverable syntax errors and stack overflow. |
| screen.yyException | thrown for irrecoverable syntax errors and stack overflow. |
Step 3: Generating the Scanner.
StreamTokenizer as a Scanner
Step 2 presented a typical grammar for lines
with arithmetic expressions. The next step is to check if input
that should be legitimate actually conforms to the grammar. There
are two problems: input characters have to be assembled to represent
symbols such as a Number, and the sequence of input
symbols has to be matched to the grammar. A scanner deals with
the first problem, a parser uses a scanner and deals with the
second problem. pj can generate both.
Lexical analysis is the process of assembling input characters
to represent the symbols that the grammar rules expect, and to
silently discard, e.g., white space or comments. pj can
almost automate the creation of a scanner class which performs
lexical analysis for typical programming languages. In order to
compile and use this class and more, Java code has to be added to
the input for pj.
a.pj adds a
prolog and an epilog to the expression grammar of Step 2:
package step3;
public class a {
%left '+' '-'
%left '*' '/' '%'
%%
program: ...
%%
}
The prolog is Java code until the first % at the beginning
of a line (and not within a comment). This code is output first
if pj and jay generate a Java file; therefore the
prolog can contain package and import statements.
The prolog must start a Java class which the epilog concludes.
pj assumes that the class name in the first class header
in the prolog is the name of the parser class.
The epilog is Java code which follows the grammar rules after a
line containing %%. This code is output last if pj
and jay generate a Java file. It can contribute to the
class started in the prolog and it could contain additional,
non-public classes as well.
Prolog and epilog should not introduce identifiers that start
with the letters yy because pj's tools JLex
and jay reserve this prefix for their own use.
The -scanner option is used to insert a scanner into
the output of pj and jay. The output should be
placed into the appropriate Java file and compiled.
$ mkdir step3
$ java -jar pj.jar -jay -scanner:static=a.lex - a.pj > step3/a.java
With this form of the option the scanner is a static
inner class of a and pj can insert a main
function so that the scanner can be tested:
$ javac step3/a.java $ java step3.a'$'yyLex scanner token value - 10 (1) at "-" '-' (1) at "1" 49 (1) at "0" 48
Unfortunately, these are not the symbols that the grammar expects: 1 and 0 have not been assembled as a Number and all whitespace, including the newline characters which the grammar expects at the end of each line, has been discarded.
pj generates input for JLex to create the scanner.
The command above saved this input as a.lex. The last few
lines show how the scanner works:
%%
[\ \b\f\n\r\t]+ {
}
.|\n {
token = yytext().charAt(0); return true;
}
Regular expressions try to match input characters. The longest
match, or the first of several equal matches, triggers a piece
of Java code which can access the input by calling yytext().
By convention the symbol value -- for single characters the
character value itself -- is assigned to token and an
information object could be assigned to value. The code
then returns true.
The last few lines of a.lex show why this scanner
does not assemble the expected symbols: Thus far pj has
generated only two rules, one to ignore white space and the almost
mandatory last rule to pass the remaining single characters. Newline
\n is specified in both rules, but in such a case the
earlier rule is used.
It is important to test the scanner. A test program can be quite simple:
public static void main (java.lang.String[] args) throws Exception {
yyLex scanner = new yyLex(System.in);
while (scanner.advance())
System.out.println(scanner.token()+"\t\""+scanner.value()+"\"");
}
As long as advance() is successful, token and
value can be retrieved as method values. a.lex
contains a more sophisticated test program which pj supplies.
It asks the scanner for position information, avoids printing
invisible characters, and it uses a table yyNames[] in
which jay maps expected token numbers to symbolic names.
expr.pj adds two directives and macro definitions so that Number has a reasonable representation:
%skip {comment} {space}
%token Number {digits}
comment = ("#".*)
space = [\ \t\b\015]+
digits = [0-9]+
Macro definitions immediately follow the directives and are sent
unchanged to JLex if pj generates a scanner. A macro
definition gives a name to a regular
expression.
A generated scanner will ignore white space by default. However, the default can be replaced with a skip directive which specifies one or more macro names in braces, JLex-style. Once skip is used, only input matching the regular expressions of these macros will be ignored by the generated scanner.
Number represents a category of inputs in the grammar. A token directive is used to define Number as a terminal symbol and, optionally, to associate a macro, or even a string, with it. Input equal to the string, or matching the regular expression of the macro, is then collected and presented as one symbol to the grammar.
$ java -jar pj.jar -jay -scanner:static=expr.lex - expr.pj > step3/expr.java
Again, the Java program has to be put into an appropriate file and compiled, and then the scanner can be tested. It will now ignore comments and white space, assemble numbers, and return newlines in addition to the operators:
$ javac step3/expr.java $ java step3.expr'$'yyLex scanner token value # this is a comment - 10 (1) at "\n" '\n' (2) at "-" '-' (2) at "10" Number (2) at "\n" '\n'
badIf.pj
combines the if statements sketched in Step 1
with arithmetic expressions and identifiers. As before, there
is one shift/reduce conflict, but the Java program can still be
created, compiled, and tested:
$ java -jar pj.jar -jay -scanner:static=badIf.lex - badIf.pj > step3/badIf.java
jay: 1 shift/reduce conflict.
Unfortunately, this scanner does not work as intended:
$ javac step3/badIf.java $ java step3.badIf'$'yyLex scanner token value if a then 2 else c (1) at "if" Id (1) at "a" Id (1) at "then" Id (1) at "2" Number (1) at "else" Id (1) at "c" Id (1) at "\n" '\n'
if and the other symbols are recognized as identifiers, not as reserved words with a special meaning to the grammar. badIf.pj uses macros and a token directive to define the reserved words for the if statement:
%token Id {id}
%token IF {if} THEN {then} ELSE {else} // used too late!
if = "if"
then = "then"
else = "else"
id = ([a-zA-Z][a-zA-Z]*)
%%
stmt: IF expr THEN stmt
| IF expr THEN stmt ELSE stmt
| expr ';'
;
pj generates the matching rules for JLex in the
order in which the symbols appear in directives and grammar rules. The
JLex rules in badIf.lex show the problem:
{id} {
token = Id; value = "{id}"; return true;
}
{if} {
token = IF; value = "{if}"; return true;
}
Directives like token and skip can appear in the
input of pj in any order, preceding the macro definitions.
Here, however, {id} is used before {if}, etc.;
therefore, the JLex rule to match {id} also appears
before the rule to match {if}. The input if to
the scanner matches both regular expressions and in such a case
the first JLex rule wins, i.e., if is interpreted
as an Id and not as an IF for the parser.
This problem can be repaired in the input to pj in two
ways: either by reordering the directives so that the more specific
macros are referenced first, or by making use of the fact that
pj deliberately generates rules for all double-quoted
literal strings before rules for macro references.
As an aside: while pj and jay only reserve the
name error for their own use, token identifiers
will show up in JLex rules and from there in the final
Java code, i.e., it would not have been a good idea to use a word
such as if that is reserved in Java as a token
name in this grammar; by convention, token names are usually
spelled in upper case for symbols or with a leading upper case
character for input categories.
If.pj defines almost the same language as badIf.pj but it repairs the shift/reduce conflict and arranges for a correct scanner to be generated:
$ java -jar pj.jar -jay -scanner:static=If.lex - If.pj > step3/If.java
This scanner works as intended:
$ javac step3/If.java $ java step3.If'$'yyLex scanner token value if a then 2 else c fi (1) at "if" __if (1) at "a" Id (1) at "then" __then (1) at "2" Number (1) at "else" __else (1) at "c" Id (1) at "fi" __fi (1) at "\n" '\n'
if and the other symbols are now recognized as reserved
words because they have been specified as double-quoted literals
in the grammar and pj generates JLex rules for such
literals before rules for macros:
stmt: "if" expr "then" stmt "fi" | "if" expr "then" stmt "else" stmt "fi" | expr ';' ;The shift/reduce conflict has disappeared because the grammar requires that each if statement has to be terminated with fi so that there can be no question as to where an else belongs.
Double-quoted literals and their precedence over macros in the generated JLex rules provide a very convenient way to specify reserved words in a grammar. A large number of reserved words, however, can result in a somewhat bloated scanner.
Screening is an alternative: if some input matches a regular expression specified as a macro the input is used as a key for a hashtable. The value from the table, if any, provides the grammar symbol which the input represents. If there is no value, a default symbol is sent to the grammar.
pj supports screening through screen directives
and single-quoted literals. screen.pj defines the same language
as If.pj and uses
screening:
$ java -jar pj.jar -jay -scanner:static=screen.lex - screen.pj > step3/screen.java
This scanner works as intended, too:
$ javac step3/screen.java $ java step3.screen'$'yyLex scanner token value if a then 2 else c fi (1) at "if" _if (1) at "a" Id (1) at "then" _then (1) at "2" Number (1) at "else" _else (1) at "c" Id (1) at "fi" _fi (1) at "\n" '\n'
a and c represent the same token which classifies them as identifiers, while reserved words like if all represent the different single-quoted literals that have been specified in the grammar:
%screen {id}
%token Id {id}
...
stmt: 'if' expr 'then' stmt 'fi'
| 'if' expr 'then' stmt 'else' stmt 'fi'
| expr ';'
;
expr: expr '+' expr
...
A screen directive specifies macro references that result
in JLex rules with code consulting a global hashtable. All
single-quoted literals are entered into the hashtable and can be
found if input matches the regular expression of one of the macros.
If the matching input cannot be found in the hashtable and if the
macro is also associated with a token name such as Id
above, the input is considered to represent the token name,
otherwise an error results.
Input which does not match any of the screen macros' regular expressions will not be looked up in the hashtable. This may be unfortunate; however, remaining single input characters will still be found by the last JLex rule and they represent single-quoted single characters in the grammar rules.
StreamTokenizer as a Scanner
Scanner and parser communicate using an interface which contains three functions: advance() returns true if it was able to move to the next input symbol, token() returns the integer value which the parser expects to represent the symbol, and value() can return an object with more information about the symbol.
A scanner need not be generated by pj. In fact, pj itself requires a scanner Rfc.Scanner which uses the state feature of JLex to employ different kinds of scanning rules in different parts of the input.
Scanner.java shows how a scanner can be derived from a StreamTokenizer. This could perhaps be advantageous if JLex is not available; however, the base class has shortcomings that cannot readily be circumvented:
$ javac step3/Scanner.java $ java step3.Scanner scanner token value if a10 then b.20.30 c fi Token[if], line 1 Id if Token[a10], line 1 Id a10 Token[then], line 1 Id then Token[b], line 1 Id b Token['.'], line 1 46 Token[n=20.3], line 1 Number 20.3 Token[fi], line 1 Id fi Token[EOL], line 2 '\n'
pj provides tracing for a generated scanner; this can
be reused in a hand-crafted scanner. However, screening has to be
implemented explicitly.
One of the advantages of StreamTokenizer is
that backslash escapes in strings are interpreted automatically and
that skipping for both kinds of Java-style comments can be turned on.
However, the implementation of StreamTokenizer.parseNumbers() adds periods and minus
signs as possible characters to numbers and identifiers -- this
usually has to be countermanded. The code above shows that a
leading period can then not be made part of a number.
Similarly, there is no clean way to scan operators that consist
of more then one special character. There is StreamTokenizer.pushBack() so that with look-ahead an
operator could be composed of two special characters, but there
is would be no control over whitespace and comments between the
characters.
A scanner assembles input character into symbols for the grammar
and discards, e.g., whitespace and comments. pj can generate
input for JLex to create a scanner. This usually requires
a skip directive referencing macro definitions for white
space and for comments and token directives for input
categories such as Number or Id referencing
macros describing approriate input through regular expressions. Reserved
words can be specified as double-quoted literals directly in the
grammar.
The scanner can be tested if pj is used to create a
Java program with a scanner as a static inner class. The
class header for the Java program is specified in the prolog for
pj and the epilog must at least close the class definition.
JLex uses macro definitions with regular expressions from the
input to pj and has a table of rules where input matching
a regular expression triggers the execution of some Java code to
generate the grammar symbol and additional information. The order
of these JLex rules is important: the scanner tries for
the longest match but breaks a tie in favour of the first JLex
rule that can match the input.
token directives in pj pair symbol names and
macro references or literals. They result in JLex rules
to match double-quoted literals and macros where JLex rules
for double-quoted literals are generated first. skip
directives reference macros and result in JLex rules where
input is ignored. scan directives can also be specified;
they take literals as arguments and can be used to influence the
order of JLex rules because all but the precedence directives
(see Step 2) can be specified in any order. Finally,
literals can appear in the grammar rules in pj.
pj generates JLex rules first for all double-quoted
literals in the order in which they appear in the directives and
grammar rules, and then for all macro references, again in order
of appearance. There is a last, default rule that catches the
remaining single characters.
Screening can improve a scanner. A screen directive
contains macro references. Input matching such a macro is looked
up in a hashtable constructed from all single-quoted literals in
the pj specification. If the input cannot be found and the
macro is also associated with a token, the input is
considered to represent the token name.
|
|||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | ||||||||