Language Processing
v2.0

Package step01

1: How to structure a compiler.

See:
          Description

Interface Summary
Visitor what a visitor for the XML-based arithmetic expression must do.
 

Class Summary
Adder a visitor to compute the value of an XML-based arithmetic expression.
AddingMachine combine XMLBuilder and Adder.
NaiveScannerDemo try to extract numbers and operators with a Scanner.
Parser check if standard input is suitable for the adding machine, i.e., consists of numbers separated by operators.
ScannerDemo extract numbers and operators with a Scanner.
StreamTokenizerDemo extract numbers and operators with a StreamTokenizer.
XMLBuilder represent adding machine input using XML.
 

Package step01 Description

1: How to structure a compiler.

As an overview to illustrate the structure of a compiler consider a simple adding machine which reads a sequence of numbers separated by + and - signs and produces the total:

Run the adding machine.

The following implementation is overly complicated but it is structured like a compiler would be: the input character sequence is partitioned into numbers and operators, the resulting token sequence is grouped into phrases and represented as a tree, and then the tree is evaluated.

Partitioning Input

Originally, Java provided the StreamTokenizer class which "takes an input stream and parses it into "tokens", allowing the tokens to be read one at a time." The class is quite useful as long as only ASCII characters have to be considered and if the partitioning problem fits the design of the class. For example, if the adding machine is to process integers, StreamTokenizer already has to be coerced a bit:

StreamTokenizer scanner = new StreamTokenizer(new InputStreamReader(System.in));
scanner.resetSyntax();              // collect every single character, but...
scanner.whitespaceChars('\0', ' '); //   ignore from '\0' to space
scanner.wordChars('0', '9');        //   collect digits as a word

StreamTokenizer configures sets of characters: whitespace is ignored, words are delivered as strings, string delimiters enclose a string with Java-like escape conventions, and other characters are delivered one at a time:

while (scanner.nextToken() != StreamTokenizer.TT_EOF)
  switch (scanner.ttype) {
  case StreamTokenizer.TT_WORD: // collected digits
    System.out.println(new Integer(scanner.sval));
    break;

  case '+': // collected single-character operator
  case '-':
    System.out.println((char)scanner.ttype);
    break;

  default: // collected other character
    System.out.println("error: "+(char)scanner.ttype);
  }

Edit and run a StreamTokenizer example.

Java 5 introduced Scanner, "a simple text scanner which can parse primitive types and strings using regular expressions." This class works with Unicode and seems to be very easy to use:

Pattern operator = Pattern.compile("[-+]"); // recognizes the operators
Scanner scanner = new Scanner(System.in);   // scans standard input

// partition standard input
for (;;)
  if (scanner.hasNextInt())                 // (signed) integer available
    System.out.println(scanner.nextInt());
  else if (scanner.hasNext(operator))       // operator available
    System.out.println(scanner.next(operator));
  else if (scanner.hasNext())               // other string available
    System.out.println("error: "+scanner.next());
  else break;

Edit and run a naive Scanner example.

Unfortunately, the Scanner architecture (almost) requires that tokens are separated by delimiters and it takes significant coaxing to extract operators and integer values which are not necessarily separated by blanks or other mandatory delimiters. The following solution involves different delimiter settings, including a delimiter which can even match no input at all:

Pattern spaces = Pattern.compile("\\s*");      // (possibly no) whitespace
Pattern notADigit = Pattern.compile("[^0-9]"); // anything but a digit
Pattern operator = Pattern.compile("[-+*/]");  // recognizes the operators
Scanner scanner = new Scanner(System.in);      // scans standard input

// partition standard input
for (;;)
  if (scanner.useDelimiter(spaces).hasNextInt())            // (signed?) integer
    System.out.println(scanner.skip(spaces).useDelimiter(notADigit).nextInt());
  else if (scanner.useDelimiter(spaces).hasNext(operator))  // operator
    System.out.println(scanner.next(operator));
  else if (scanner.hasNext())                               // other stuff
    System.out.println("error: "+scanner.next());
  else break;

Edit and run a more useful Scanner example.

Unfortunately, neither StreamTokenizer nor Scanner are good building blocks when it comes to partitioning input for a compiler. From the examples it should be clear that a more comprehensive tool to forge Java code for input partitioning will come in handy. Such a tool will be discussed in step 3.

Recognizing Phrases

The frontend of a compiler checks that the input is in fact a program and represents it in a form which is easy for a backend to further analyze, convert to code, or interpret. Parser is a class which shows how to check input:

protected static Scanner scanner;

public static void main (String... arg) {
  scanner = new Scanner(System.in);
  sum();
}

The main program sets up input partitioning with a Scanner and calls sum which knows about a sum as a sequence of numbers and operators. sum in turn calls number which uses pieces of the input partitioning logic shown earlier to obtain a number:

public static void sum () {
  number();
  while (summand())
    ;
}

public static void number () {
  if (!scanner.useDelimiter(spaces).hasNextInt())
    throw new RuntimeException("expecting a number");
  scanner.skip(spaces).useDelimiter(notADigit).nextInt();
}

Next, sum repeatedly calls summand which knows about a summand as a combination of an operator and another number. summand uses pieces of the input partitioning logic shown earlier to obtain an operand or return false once the end of the input is reached:

public static boolean summand () {
  if (scanner.useDelimiter(spaces).hasNext(operator)) {
    scanner.next(operator);
    number();
    return true;
  } else if (scanner.hasNext()) // something unexpected
    throw new RuntimeException("expecting + or -");
  else
    return false; // end of input
}

Edit and run the Parser.

It is quite striking how closely the code of sum and summand corresponds to the linguistic definitions of a sum and a summand. Step 5 discusses tools which arrange to produce the recognition functions directly from the language descriptions.

Representing Phrases

Binary trees are a natural way to represent arithmetic expressions such as the ones accepted by the adding machines. XML, in turn, is a popular notation to describe trees. XMLBuilder is a frontend which reads input for the adding machine from standard input and writes a XML-based representation to standard output.

XMLBuilder closely follows the architecture of Parser. The main program again sets up sets up input partitioning with a Scanner and it contains the boilerplate from the Java API for XML Processing (JAXP) to construct a XML-based Document and to use a trivial XSLT Transformer to send the document to standard output:

protected static Scanner scanner;
protected static Document tree;

public static void main (String... arg) throws Exception {
  scanner = new Scanner(System.in);
  tree = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
  tree.appendChild(sum());
  TransformerFactory.newInstance().newTransformer()
    .transform(new DOMSource(tree), new StreamResult(System.out));
}

sum, number, and summand are based on their counterparts from Parser, but now they create Element objects which represent the input that has been recognized:

public static Element sum () {
  Element result = number(); // leftmost number
  for (Element e; (e = summand()) != null; result = e)
    e.insertBefore(result, e.getFirstChild()); // combine with next summand 
  return result;
}

public static Element number () {
  if (!scanner.useDelimiter(spaces).hasNextInt())
    throw new RuntimeException("expecting a number");
  int number = scanner.skip(spaces).useDelimiter(notADigit).nextInt();
  Element result = tree.createElement("number");       // <number>
  result.appendChild(tree.createTextNode(""+number));  //   digits
  return result;
}

public static Element summand () {
  if (scanner.useDelimiter(spaces).hasNext(operator)) {
    Element result = tree.createElement(  // <add> or <subtract>
      scanner.next(operator).equals("+") ? "add" : "subtract");
    result.appendChild(number());         //   <number> digits
    return result;
  } else if (scanner.hasNext()) // something unexpected
    throw new RuntimeException("expecting + or -");
  else
    return null; // end of input
}

XMLBuilder should really be a subclass or an observer of Parser. Step 6 explores this further and discusses tools which can eliminate the need for XML and generate language-specific classes to represent programs.

Edit and run the XMLBuilder.

Processing Trees

A visitor is an object which receives a stucture and applies node-specific functions to nodes in the structure. Typically this involves "visiting" descendants of the nodes, i.e., the visitor will send descendants of a node to itself. Classic applications of the visitor design pattern include tree traversal, where the visitor applies an operation exactly once to each node of a tree in a particular order, and in particular the evaluation of arithmetic expressions. In Java, a visitor is normally described with an interface such as Visitor which has one method for each class of objects found in the structure to be visited:

public interface Visitor {
  Object document (Document d);  // deal with the Document node
  Object element (Element e);    // deal with every Element node
}

Adder is a Visitor which calculates the value of an expression which XMLBuilder has represented using XML. The main program uses boilerplate from the Java API for XML Processing to load the XML-based tree from standard input and sends the Document to an object implementing Visitor:

public static void main (String... arg) throws Exception {
  System.out.println(new Adder().document(
    DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(System.in)
  ));
}

A visitor should apply a node-specific function to each incoming node. The Visitor interface tries to "divide and conquer" with one method per node class (which leads to complications once more classes are added to a structure library). However, in a XML-based tree most nodes are Elements; therefore, Adder uses the element tag and reflection to send Element nodes to methods based on the element name:

public Integer document (Document d) {
  return element(d.getDocumentElement());
}

public Integer element (Element e) {
  try {
    return (Integer)getClass().getMethod(e.getTagName(), Element.class).invoke(this, e);
  } catch (Exception ex) {
    throw new RuntimeException("invalid tree", ex);
  }
}

Just as an aside: an interface such as Visitor ought to consider what exceptions the methods might raise. The code above illustrates how easy it is to circumvent Java's requirement of checked exceptions. It is a sad fact that just because a method does not declare an exception it does not really mean that the method will not throw a serious one.

Applied properly, the visitor design pattern is a very successful approach to divide and conquer. The remaining methods in Adder are small and deal with one Element name at a time:

public Integer add (Element e) { // <add> (like <subtract)
  return new Integer(
    (int)element((Element)e.getFirstChild())
    + (int)element((Element)e.getLastChild())
  );
}

public Integer number (Element e) { // <number>
  return new Integer(e.getTextContent());
}

Edit and run the Adder.

If a compiler frontend represents a program as a tree, visitors are an elegant mechanism to structure algorithms for further analysis, code generation, or interpretation. Step 7 explores this further and discusses a tool which helps to avoid the restrictions which the visitor pattern tends to impose on the class hierarchy used to represent the program.

XMLBuilder and Adder are intended to be run in a command pipeline. Alternatively, AddingMachine is a subclass of XMLBuilder with a new main program which combines frontend and backend of the adding machine:

public static void main (String... arg) throws Exception {
  scanner = new Scanner(System.in);
  tree = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
  tree.appendChild(sum());
  System.out.println(new Adder().document(tree));
}

Edit and run the adding machine.

Summary

It would have been much easier to implement the adding machine as a simple loop over the input characters or perhaps numbers and operator symbols which the adding machine is expected to process. The separation of concerns into layers for partitioning input, recognizing, representing phrases as a tree, and finally visitors for further processing such as evaluation was intended as a blueprint for much more complicated compiler programs.

A significant amount of code in a compiler is usually generated with tools which require some theory, e.g., about grammars. The adding machine example tried to show where and why tools will avoid tedious and error-prone manual coding.

Finally, compilers are probably confronted with more incorrect than correct programs and they (or the tools) have to expend a significant effort on dealing with user errors. The examples above have largely ignored user errors and exception handling in an attempt to make the overall structure of a compilation process more apparent.



(c) 2008 Axel T. Schreiner