5.2 Representation

The Recommendation refers to Unicode (precisely ISO/IEC 10646):

[2]
Char
::=
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD]| [#x10000-#x10FFFF]

A document can be represented with a different character set which is specified in the prolog of the document. Upper and lower case is distinguished.

Markup

A document entity

[1]
document
::=
prolog element Misc*

consists of markup, that is start tags , end tags , empty-element tags , entity references , character references , comments , CDATA section delimiters , document type declarations and processing instructions , and finally character data .

References constitute a preprocessor and are used, among other things, to represent special characters; they start with & , the rest of the markup starts with < and must not contain < .

White space

White space consists of spaces, tabs, RETURN and LINEFEED:

[3]
S
::=
(#x20 | #x9 | #xD | #xA)+

Within markup white space is processed; outside of markup it is significant. Theoretically a parser has to produce all content. Only if a parser validates does it have to signal if white space is significant or not — this is supposed to be indicated with the attribute xml:space . It remains unclear, if an empty element contains white space or not.

Line separations should be canonicalized to LINEFEED during recognition.

Names

A Name (identifier) is defined as follows:

[4]
NameChar
::=
Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender
[5]
Name
::=
(Letter | '_' | ':') (NameChar)*
[6]
Names
::=
Name (S Name)*

Letter , Digit , CombiningChar and Extender are certain ranges in Unicode — they include at least the relevant parts of ASCII. The sequence XML in any spelling at the beginning of names is reserved. The colon is used for name spaces; however, they are not defined as part of XML.

Strings

Markup itself only contains white space and NameChar as well as literal data , i.e., arbitrary strings in single or double quotes:

[10]
AttValue
::=
'"' ([^<&"] | Reference)* '"'| "'" ([^<&'] | Reference)* "'"
[67]
Reference
::=
EntityRef | CharRef
[68]
EntityRef
::=
'&' Name ';'
[66]
CharRef
::=
'&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';'

These strings cannot contain < & and the respective quote, but they can use references

&#x22;

&quot;

doublequote

&#x26;

&amp;

&

&#x27;

&apos;

single quote

&#x3c;

&lt;

<

&#x3e;

&gt;

>

Comments

Comments can contain arbitrary characters except the sequence -- . They must end with --> :

[15]
Comment
::=
'<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'

Comments cannot be nested and they cannot appear everywhere.

Processing Instructions

PIs are meant for programs processing XML. They can contain everything except ?> :

[16]
PI
::=
'<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
[17]
PITarget
::=
Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))

Once again, XML in any spelling may not be used for a target — but it is for xml-stylesheet .

CDATA Sections

CDATA is used to include arbitrary text and still remain well-formed :

[18]
CDSect
::=
CDStart CData CDEnd
[19]
CDStart
::=
'<![CDATA['
[20]
CData
::=
(Char* - (Char* ']]>' Char*))
[21]
CDEnd
::=
']]>'

Char cannot contain Reference, i.e., one cannot include the terminating ]]> in CDATA . Therefore, CDATA cannot be nested.