The Recommendation refers to Unicode (precisely ISO/IEC 10646):
[2] |
Char |
::= |
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD]| [#x10000-#x10FFFF] |
A document can be represented with a different character set which is specified in the prolog of the document. Upper and lower case is distinguished.
[1] |
document |
::= |
prolog element Misc* |
consists of markup, that is start tags , end tags , empty-element tags , entity references , character references , comments , CDATA section delimiters , document type declarations and processing instructions , and finally character data .
References constitute a preprocessor and are used, among other things, to represent special characters; they start with & , the rest of the markup starts with < and must not contain < .
White space consists of spaces, tabs, RETURN and LINEFEED:
[3] |
S |
::= |
(#x20 | #x9 | #xD | #xA)+ |
Within markup white space is processed; outside of markup it is significant. Theoretically a parser has to produce all content. Only if a parser validates does it have to signal if white space is significant or not — this is supposed to be indicated with the attribute xml:space . It remains unclear, if an empty element contains white space or not.
Line separations should be canonicalized to LINEFEED during recognition.
A Name (identifier) is defined as follows:
[4] |
NameChar |
::= |
Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender |
[5] |
Name |
::= |
(Letter | '_' | ':') (NameChar)* |
[6] |
Names |
::= |
Name (S Name)* |
Letter , Digit , CombiningChar and Extender are certain ranges in Unicode — they include at least the relevant parts of ASCII. The sequence XML in any spelling at the beginning of names is reserved. The colon is used for name spaces; however, they are not defined as part of XML.
Markup itself only contains white space and NameChar as well as literal data , i.e., arbitrary strings in single or double quotes:
[10] |
AttValue |
::= |
'"' ([^<&"] | Reference)* '"'| "'" ([^<&'] | Reference)* "'" |
[67] |
Reference |
::= |
EntityRef | CharRef |
[68] |
EntityRef |
::= |
'&' Name ';' |
[66] |
CharRef |
::= |
'&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';' |
These strings cannot contain < & and the respective quote, but they can use references
Comments can contain arbitrary characters except the sequence -- . They must end with --> :
[15] |
Comment |
::= |
'<!--' ((Char - '-') | ('-' (Char - '-')))* '-->' |
Comments cannot be nested and they cannot appear everywhere.
PIs are meant for programs processing XML. They can contain everything except ?> :
[16] |
PI |
::= |
'<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>' |
[17] |
PITarget |
::= |
Name - (('X' | 'x') ('M' | 'm') ('L' | 'l')) |
Once again, XML in any spelling may not be used for a target — but it is for xml-stylesheet .
CDATA is used to include arbitrary text and still remain well-formed :
[18] |
CDSect |
::= |
CDStart CData CDEnd |
[19] |
CDStart |
::= |
'<![CDATA[' |
[20] |
CData |
::= |
(Char* - (Char* ']]>' Char*)) |
[21] |
CDEnd |
::= |
']]>' |
Char cannot contain Reference, i.e., one cannot include the terminating ]]> in CDATA . Therefore, CDATA cannot be nested.