Regular Expressions

abc non-whitespace characters denote themselves unless...
"abc" characters in double quotes denote themselves.
\" denotes double quote only within double quotes.
\n \t \b \f \r denote newline, tab, backspace, formfeed, and carriage return.
\ooo \xhh \uhhhh denote characters as octal, hexadecimal, and Unicode values. pj does not use JLex to support Unicode.
\^C denotes the corresponding control character.
\x denotes any other character x.
. matches any character but newline.
[abd-x...] matches any character from the class; minus denotes a range unless it appears as first character; backslash escapes can be used.
[^abd-x...] matches any character not from the class.
$ matches the end of a line.
^ matches the beginning of a line, but a scanner generated by JLex discards the preceding newline.
x* matches zero or more occurrences of x.
x+ matches one or more occurrences of x.
x? matches zero or one occurrence of x.
xy concatenation, matches x and then y.
x|y alternative, matches x or y (but not both), has lower precedence then concatenation.
(...) grouping, has highest precedence.
{name} references a macro.

As a rule of thumb, all literal information should be enclosed in double quotes and macros should be used liberally to build expressions from simple blocks. Regular expressions in acro definitions should normally be enclosed in parentheses to avoid precedence surprises.

Typical Regular Expressions

alpha = [a-zA-Z]
alnum = [a-zA-Z_0-9]
oct =   [0-7]
dec =   [0-9]
hex =   [0-9a-fA-F]
sign =  [-+]?
exp =   ([eE]{sign}{dec}+)
L =     [lL]
X =     [xX]
"/*"([^*]|"*"+[^/*])*"*"+"/"
"//".*$
Java comments.
"{"[^}]*"}"
"(*"([^*]|"*"+[^*)])*"*"+")"
Pascal comments.
\"([^\"\\\n]|\\.|\\\n)*\" string (mostly).
'([^'\n]|'')+' Pascal string.
0{oct}+ octal number.
0{oct}+{L} octal long.
0{X}{hex}+ hexadecimal number.
{dec}+ decimal number.
{dec}+"."{dec}*{exp}?|{dec}*"."{dec}+{exp}?|{dec}+{exp} floating point number.
'([^'\\\n]|\\[^0-7\n]|\\[0-7][0-7]?[0-7]?)' character.
{alpha}{alnum}* identifier.
.|\n rest, any one character.

Ambiguities are acceptable; they are resolved in favour of the longest and then first match. This makes it much simpler, e.g., to write a regular expression to match floating point numbers.