Department of Computer Science
Rochester Institute of Technology
http://www.cs.rit.edu/~ats/talks/xml/ [ pdf ] [ code ] [ code.zip ]
XML is hailed to replace ASCII as a human-readable way to represent structured data; however, the design placed only a low priority on the need to generate XML documents by hand. Still, XSLT and other programming languages use XML syntax and tend to be cumbersome for humans to read and write.
This talk describes prototypes for an imperative and a declarative language to manipulate XML documents and tries to show that less can be more. Neither of these little languages uses XML syntax and that seems to be quite an improvement...
|
Terminology — XML — Extensible Markup Language, SAX, DOM, XSLT |
XML [ W3C 1998 ] is a simplified version of SGML [ ISO 1986 ] and inherits it's confused syntax.
"The design goals for XML are:
|
It shall be easy to write programs which process XML documents. |
|
|
The number of optional features in XML is to be kept to the absolute minimum, ideally zero. |
|
[From the Recommendation .]
The following XML-based document could be the result of a database query:
<?xml version='1.0' encoding='ISO-8859-1'?> <!DOCTYPE people [ <!ELEMENT people (person*)> <!ELEMENT person (first,last)> <!ATTLIST person pid ID #REQUIRED boss IDREF #IMPLIED sex (male|female) #REQUIRED > <!ELEMENT first (#PCDATA)> <!ELEMENT last (#PCDATA)> ]> <people> <person pid="i4711" boss="i4712" sex="female" > <first>Jane</first> <last>Doe</last> </person> <person pid="i4712" sex="male"> <first>John</first> <last>Doe</last> </person> </people>
SAX [ Megginson 1998 ] essentially is an observer interface for an XML parser, i.e., as the parser recognizes XML elements etc. it calls certain methods in certain handlers.
yacc
calls actions at the phrase level.
SAX calls the handler at the lexical level.
XML parsers supporting SAX are readily available; it is thus trivial to write a Java program to sequentially process an XML-based document.
DOM [ W3C 1998 ] is an IDL -based system of interfaces for objects to represent and manipulate XML-based documents.
Given a Java implementation of DOM and a SAX parser it is trivial to write a Java program to represent an XML-based document as a tree of generic objects.
XML parsers supporting DOM are readily available.
DOM permits tree traversal and queries of an XML-based document.
package doc; import java.io.File; import java.net.URL; import javax.xml.parsers.DocumentBuilderFactory; import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.NodeList; public class Boss { /** finds all people working for each boss. */ public static void main (String args []) throws Exception { URL cwd = new File(System.getProperty("user.dir")).toURL(); Document doc = DocumentBuilderFactory.newInstance() .newDocumentBuilder() .parse(System.in, cwd.toString()); NodeList person = doc.getDocumentElement() .getElementsByTagName("person"); // recursive... // find people without 'boss'... for (int c = 0; c < person.getLength(); ++ c) { Element boss = (Element)person.item(c); if (boss.getAttributeNode("boss") == null) { System.out.print(first(boss)+":"); String bossPid = boss.getAttribute("pid"); // find people matching 'boss' to bossPid... for (int s = 0; s < person.getLength(); ++ s) { Element sub = (Element)person.item(s); if (bossPid.equals(sub.getAttribute("boss"))) System.out.print(" "+first(sub)); } System.out.println(); } } } /** encapsulates access to text of 'first'. */ public static String first (Element person) { person = (Element)person.cloneNode(true); // avoid side effect person.normalize(); return person.getElementsByTagName("first") .item(0).getFirstChild().getNodeValue(); } }
XSL [ W3C 2001 ] uses XSLT and Formatting Objects to describe high-quality rendering of XML-based documents.
XSLT [ W3C 1999 ] is an XML-based functional language to describe tree transformations.
XSLT looks like macro processor code: output material interspersed with instructions.
<?xml version="1.0"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <!-- convert tree to HTML table --> <xsl:template match="/"> <html><body><table> <xsl:for-each select="/people/person"> <tr> <td><xsl:value-of select="substring(@pid, 2)"/></td> <td> <xsl:if test="@sex='female'">Mrs.</xsl:if> <xsl:if test="@sex='male'">Mr.</xsl:if> </td> <td><xsl:value-of select="first"/></td> <td><xsl:value-of select="last"/></td> <td><xsl:value-of select="substring(@boss, 2)"/></td> <td> <xsl:call-template name="first"> <xsl:with-param name="boss" select="@boss"/> </xsl:call-template> </td> </tr> </xsl:for-each> </table></body></html> </xsl:template> <!-- return first name of subordinate --> <xsl:template name="first"> <xsl:param name="boss"/> <xsl:value-of select="/people/person[@pid=$boss]/first"/> </xsl:template> </xsl:stylesheet>
Don't create a language with a new syntax and new features [
Hoare 1974
].
XSLT combines pattern matching and macro processing wrapped into XML syntax.
XSLT is essentially a successor to DSSSL for SGML which had a LISP-like syntax.
SL is XSLT without the X. It uses a prefix notation with braces for lists with more than one element. Dependencies can be shown by indentation but whitespace is irrelevant.
// convert tree to HTML table main <html> <body> <table> for "/people/person" <tr> { <td> value-of "substring(@pid, 2)" <td> if "@sex='female'" "Mrs." elif "@sex='male'" "Mr." fi <td> value-of "first" <td> value-of "last" <td> value-of "substring(@boss, 2)" <td> first(boss = "@boss") } // return first name of subordinate proc first ( boss ) value-of "/people/person[@pid=$boss]/first"
SL is simply a frontend for the algorithms underlying XSLT.
oops
creates and serializes a Java parser from an EBNF grammar for SL.
The parser creates a DOM tree for an SL program using a DOM implementation like
Xerces
.
The DOM tree is run as a transformer using
JAXP
and an XSLT implementation like
Saxon
.
pairs = { // a global 'variable' containing nodes <pair> { <x> 36 <y> 54 } <pair> { <x> 54 <y> 36 } } main { message { "pairs: " copy-of "$pairs" } for "$pairs/pair" { result = gcd(x = "x", y = "y") <gcd value="{$result}"/> } } proc gcd ( x, y ) if "$x = $y" value-of "$x" elif "$x < $y" gcd(x = "$x", y = "$y - $x") else gcd(x = "$x - $y", y = "$y") fi
proc "add" () { // pattern for <add>: binary operator node left = apply-to "*[1]" right = apply-to "*[2]" typed(arg = "$left + $right") } proc "minus" () { // pattern for <minus>: unary operator node right = apply-to "" typed(arg = "- $right") } proc "literal" () // pattern for <literal value="n"> value-of "@value" proc typed ( arg ) { // typed evaluation, looks for <?eval.sl int?> pi = value-of "processing-instruction('eval.sl')" if "$pi = 'int'" if "$arg >= 0" value-of "floor($arg)" else value-of "ceiling($arg)" fi else value-of "$arg" fi }
awk combines a processing model (records/lines with fields), patterns, hashtables, and some conventional programming language concepts into a powerful report generator for flat data.
print name(list = input("people.xml")) ":\n"; // get document and print root name for (boss in list) // show boss and subordinates if (! (attributes(boss) contains boss)) { // a boss has no reference to boss print text(boss:first) ":"; for (person in list) if (person.boss == boss.pid) // a subordinate references the boss print " " text(person:first); print "\n"; }
Syntax and control structures are similar to C and
awk
.
Usual arithmetic expressions , string concatenation, and embedded assignment.
Dynamically typed variables can contain numbers, strings, hashtables, and nodes.
Period references node attributes (like structure components).
Colon references contained nodes (like Mac path components).
A XOML program is an object tree which operates on a stack of Java objects, among them numbers, strings, hashtables, and nodes.
oops
creates a Java parser from an EBNF grammar for XOML.
The parser creates an executable object tree for a XOML program.
input
uses a SAX parser to map an XML-based document to nodes consisting of an
ArrayList
referencing nested nodes and
String
texts and a
HashMap
for attributes.
for (person in input("people.xml")) // count sexes in hashtable count[person.sex] = count[person.sex] + 1; for (sex in count) // loop over hashtable print count[sex] " " sex "\n";
function traverse ( indent, tree ) { print indent name(tree); // show name for (attr in attributes(tree)) // show attributes if any print " " attr "='" tree.(attr) "'"; t = trim(tree); // show text if any if (t != "") print ": " t; if (name(tree) == "person") // signal bosses if (tree.boss == "") print ": boss"; print "\n"; for (node in tree) // show content traverse(indent " ", node); } traverse("", input("people.xml"));
list = input("people.xml"); do { // assume each element is a person for (a in attributes(list:person)) // remember each attribute name... attr[a] = 1; // ...in an associative array for (a in attr) // delete each attribute delete list:person.(a); print list:person "\n"; delete list:person; // print and delete each person } while (elements(list) > 0);
xhtml = new("html"); // new root node xhtml.id = "value"; // new attribute if (xhtml contains id) print "subnode?\n"; // no subnode by tag name if (xhtml contains 0) print "index 0?\n"; // no subnode by position if (attributes(xhtml) contains id) xhtml.id = "mmr"; // replace attribute xhtml:0 = new("body"); // new subnodes into positions p = xhtml:body:0 = new("p"); p:0 = "Happy "; // add text p:1 = new("b"); // add subnode p:b:0 = "Birthday"; // add text p:2 = "!"; // add text p:2 = ", " xhtml.id p:2; // replace with attribute text print xhtml "\n"; // <html id="mmr"><body><p>Happy <b>Birthday</b>, mmr!</p></body></html>
SL is just syntactic sugar; it does not improve on XSLT's inefficiencies or learning curve.
SL tends to unearth the algorithms behind XML's white noise imposed upon XSLT.
SL could be improved if XPath expressions were integrated into the syntax and not just included as strings.
XOML is a conventional programming paradigm, hopefully with a flat learning curve.
Extending XOML is the subject of a master's thesis at Osnabrück.
Adding regular expressions was trivial.
Adding XPath expressions (tree pattern matching) is more difficult; it requires DOM to represent input.