|
Language Processing v2.0 |
|||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||
See:
Description
| Interface Summary | |
|---|---|
| Visitor | what a visitor for the XML-based arithmetic expression must do. |
| Class Summary | |
|---|---|
| Adder | a visitor to compute the value of an XML-based arithmetic expression. |
| AddingMachine | combine XMLBuilder and Adder. |
| NaiveScannerDemo | try to extract numbers and operators with a Scanner. |
| Parser | check if standard input is suitable for the adding machine, i.e., consists of numbers separated by operators. |
| ScannerDemo | extract numbers and operators with a Scanner. |
| StreamTokenizerDemo | extract numbers and operators with a StreamTokenizer. |
| XMLBuilder | represent adding machine input using XML. |
1: How to structure a compiler.
As an overview to illustrate the structure of a compiler consider a simple adding machine which
reads a sequence of numbers separated by + and - signs and produces the total:
The following implementation is overly complicated but it is structured like a compiler would be: the input character sequence is partitioned into numbers and operators, the resulting token sequence is grouped into phrases and represented as a tree, and then the tree is evaluated.
Originally, Java provided the StreamTokenizer class which "takes an input stream and
parses it into "tokens", allowing the tokens to be read one at a time." The class is quite useful as long
as only ASCII characters have to be considered and if the partitioning problem fits the design of the
class. For example, if the adding machine is to process integers, StreamTokenizer already
has to be coerced a bit:
StreamTokenizer scanner = new StreamTokenizer(new InputStreamReader(System.in));
scanner.resetSyntax(); // collect every single character, but...
scanner.whitespaceChars('\0', ' '); // ignore from '\0' to space
scanner.wordChars('0', '9'); // collect digits as a word
StreamTokenizer configures sets of characters: whitespace is ignored, words are
delivered as strings, string delimiters enclose a string with Java-like escape conventions, and other
characters are delivered one at a time:
while (scanner.nextToken() != StreamTokenizer.TT_EOF)
switch (scanner.ttype) {
case StreamTokenizer.TT_WORD: // collected digits
System.out.println(new Integer(scanner.sval));
break;
case '+': // collected single-character operator
case '-':
System.out.println((char)scanner.ttype);
break;
default: // collected other character
System.out.println("error: "+(char)scanner.ttype);
}
Java 5 introduced Scanner, "a simple text scanner which can parse primitive types
and strings using regular expressions." This class works with Unicode
and seems to be very easy to use:
Pattern operator = Pattern.compile("[-+]"); // recognizes the operators
Scanner scanner = new Scanner(System.in); // scans standard input
// partition standard input
for (;;)
if (scanner.hasNextInt()) // (signed) integer available
System.out.println(scanner.nextInt());
else if (scanner.hasNext(operator)) // operator available
System.out.println(scanner.next(operator));
else if (scanner.hasNext()) // other string available
System.out.println("error: "+scanner.next());
else break;
Unfortunately, the Scanner architecture (almost) requires that tokens are separated
by delimiters and it takes significant coaxing to extract operators and integer values which are not
necessarily separated by blanks or other mandatory delimiters. The following solution involves different
delimiter settings, including a delimiter which can even match no input at all:
Pattern spaces = Pattern.compile("\\s*"); // (possibly no) whitespace
Pattern notADigit = Pattern.compile("[^0-9]"); // anything but a digit
Pattern operator = Pattern.compile("[-+*/]"); // recognizes the operators
Scanner scanner = new Scanner(System.in); // scans standard input
// partition standard input
for (;;)
if (scanner.useDelimiter(spaces).hasNextInt()) // (signed?) integer
System.out.println(scanner.skip(spaces).useDelimiter(notADigit).nextInt());
else if (scanner.useDelimiter(spaces).hasNext(operator)) // operator
System.out.println(scanner.next(operator));
else if (scanner.hasNext()) // other stuff
System.out.println("error: "+scanner.next());
else break;
Unfortunately, neither StreamTokenizer nor Scanner are good building
blocks when it comes to partitioning input for a compiler. From the examples it should be clear that a
more comprehensive tool to forge Java code for input partitioning will come in handy. Such a tool will be
discussed in step 3.
The frontend of a compiler checks that the input is in fact a program and represents it in a form which is easy for a backend to further analyze, convert to code, or interpret. Parser is a class which shows how to check input:
protected static Scanner scanner;
public static void main (String... arg) {
scanner = new Scanner(System.in);
sum();
}
The main program sets up input partitioning with a Scanner and
calls sum which knows about a sum as a sequence of numbers and operators. sum in turn calls number which uses pieces of the input partitioning
logic shown earlier to obtain a number:
public static void sum () {
number();
while (summand())
;
}
public static void number () {
if (!scanner.useDelimiter(spaces).hasNextInt())
throw new RuntimeException("expecting a number");
scanner.skip(spaces).useDelimiter(notADigit).nextInt();
}
Next, sum repeatedly calls summand which knows about a
summand as a combination of an operator and another number. summand uses pieces of
the input partitioning logic shown earlier to obtain an operand or return false once the end of
the input is reached:
public static boolean summand () {
if (scanner.useDelimiter(spaces).hasNext(operator)) {
scanner.next(operator);
number();
return true;
} else if (scanner.hasNext()) // something unexpected
throw new RuntimeException("expecting + or -");
else
return false; // end of input
}
It is quite striking how closely the code of sum and summand
corresponds to the linguistic definitions of a sum and a summand. Step 5 discusses tools
which arrange to produce the recognition functions directly from the language descriptions.
Binary trees are a natural way to represent arithmetic expressions such as the ones accepted by the
adding machines. XML, in turn, is a popular notation to describe
trees. XMLBuilder is a frontend which reads input for the adding machine from standard input and
writes a XML-based representation to standard output.
XMLBuilder closely follows the architecture of Parser. The main
program again sets up sets up input partitioning with a Scanner and it contains the
boilerplate from the Java API for XML Processing
(JAXP) to construct a XML-based Document and to use a trivial XSLT Transformer to send the
document to standard output:
protected static Scanner scanner;
protected static Document tree;
public static void main (String... arg) throws Exception {
scanner = new Scanner(System.in);
tree = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
tree.appendChild(sum());
TransformerFactory.newInstance().newTransformer()
.transform(new DOMSource(tree), new StreamResult(System.out));
}
sum, number, and summand
are based on their counterparts from Parser, but now they create Element
objects which represent the input that has been recognized:
public static Element sum () {
Element result = number(); // leftmost number
for (Element e; (e = summand()) != null; result = e)
e.insertBefore(result, e.getFirstChild()); // combine with next summand
return result;
}
public static Element number () {
if (!scanner.useDelimiter(spaces).hasNextInt())
throw new RuntimeException("expecting a number");
int number = scanner.skip(spaces).useDelimiter(notADigit).nextInt();
Element result = tree.createElement("number"); // <number>
result.appendChild(tree.createTextNode(""+number)); // digits
return result;
}
public static Element summand () {
if (scanner.useDelimiter(spaces).hasNext(operator)) {
Element result = tree.createElement( // <add> or <subtract>
scanner.next(operator).equals("+") ? "add" : "subtract");
result.appendChild(number()); // <number> digits
return result;
} else if (scanner.hasNext()) // something unexpected
throw new RuntimeException("expecting + or -");
else
return null; // end of input
}
XMLBuilder should really be a subclass or an observer of Parser. Step 6
explores this further and discusses tools which can eliminate the need for XML and generate
language-specific classes to represent programs.
A visitor is an object which receives a stucture and applies node-specific functions to
nodes in the structure. Typically this involves "visiting" descendants of the nodes, i.e., the visitor
will send descendants of a node to itself. Classic applications of the visitor design pattern include tree traversal,
where the visitor applies an operation exactly once to each node of a tree in a particular order, and in
particular the evaluation of arithmetic expressions. In Java, a visitor is normally described with an
interface such as Visitor which has one method for each class of objects found in the structure to
be visited:
public interface Visitor {
Object document (Document d); // deal with the Document node
Object element (Element e); // deal with every Element node
}
Adder is a Visitor which calculates the value of an expression which XMLBuilder has represented using XML. The main program uses boilerplate from the Java API for XML Processing to load the XML-based tree from standard
input and sends the Document to an object implementing Visitor:
public static void main (String... arg) throws Exception {
System.out.println(new Adder().document(
DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(System.in)
));
}
A visitor should apply a node-specific function to each incoming node. The Visitor interface
tries to "divide and conquer" with one method per node class (which leads to complications once more
classes are added to a structure library). However, in a XML-based tree most nodes are Elements; therefore, Adder uses the element tag and reflection to send Element nodes to methods based on the element name:
public Integer document (Document d) {
return element(d.getDocumentElement());
}
public Integer element (Element e) {
try {
return (Integer)getClass().getMethod(e.getTagName(), Element.class).invoke(this, e);
} catch (Exception ex) {
throw new RuntimeException("invalid tree", ex);
}
}
Just as an aside: an interface such as Visitor ought to consider what exceptions the methods
might raise. The code above illustrates how easy it is to circumvent Java's requirement of checked
exceptions. It is a sad fact that just because a method does not declare an exception it does not really
mean that the method will not throw a serious one.
Applied properly, the visitor design pattern is a very successful approach to divide and conquer. The remaining methods in
Adder are small and deal with one Element name at a time:
public Integer add (Element e) { // <add> (like <subtract)
return new Integer(
(int)element((Element)e.getFirstChild())
+ (int)element((Element)e.getLastChild())
);
}
public Integer number (Element e) { // <number>
return new Integer(e.getTextContent());
}
If a compiler frontend represents a program as a tree, visitors are an elegant mechanism to structure
algorithms for further analysis, code generation, or interpretation. Step 7 explores this
further and discusses a tool which helps to avoid the restrictions which the visitor pattern tends to
impose on the class hierarchy used to represent the program.
XMLBuilder and Adder are intended to be run in a command pipeline. Alternatively,
AddingMachine is a subclass of XMLBuilder with a new main program which combines frontend
and backend of the adding machine:
public static void main (String... arg) throws Exception {
scanner = new Scanner(System.in);
tree = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
tree.appendChild(sum());
System.out.println(new Adder().document(tree));
}
It would have been much easier to implement the adding machine as a simple loop over the input characters or perhaps numbers and operator symbols which the adding machine is expected to process. The separation of concerns into layers for partitioning input, recognizing, representing phrases as a tree, and finally visitors for further processing such as evaluation was intended as a blueprint for much more complicated compiler programs.
A significant amount of code in a compiler is usually generated with tools which require some theory, e.g., about grammars. The adding machine example tried to show where and why tools will avoid tedious and error-prone manual coding.
Finally, compilers are probably confronted with more incorrect than correct programs and they (or the tools) have to expend a significant effort on dealing with user errors. The examples above have largely ignored user errors and exception handling in an attempt to make the overall structure of a compilation process more apparent.
|
(c) 2008 Axel T. Schreiner |
|||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | |||||||||