0
$\begingroup$

Quoted from Introduction to the Theory of Computation by Sipser, a regular expression is defined as:

Say that R is a regular expression if R is

  1. a for some a in the alphabet $\Sigma$,
  2. $\epsilon$,
  3. $\emptyset$,
  4. $(R_1 \cup R_2 )$, where $R_l$ and $R_2$ are regular expressions,
  5. $(R_1 \cdot R_2 )$, where $R_l$ and $R_2$ are regular expressions, or
  6. $(R_1^*)$, where $R_1$ is a regular expression.

In items 1 and 2, the regular expressions a and e represent the languages {a} and {$\epsilon$}, respectively. In item 3, the regular expression $\emptyset$ represents the empty language. In items 4, 5, and 6, the expressions represent the languages obtained by taking the union or concatenation of the languages $R_l$ and $R_2$ , or the star of the language $R_1$, respectively.

Reading the definition and Wikipedia's version, I was wondering if a regular expression is a string/word, or a set of strings, i.e., a formal language? Thanks!

  • 0
    @Tim: al$r$ight :) Good to see you back he$r$e!2011-07-11

3 Answers 3

2

Both, in a sense. You have to separate syntax and semantics.

The syntax of regular expressions is what is defined by 1. - 6. in your definition. It clearly defines a (non-regular) set of strings $Reg$ in an inductive way.

The semantics of regular expressions is what is defined in the blob of text; you can also write the same thing down more formally by defining an interpretation $i : Reg \to 2^{\Sigma^*}$ of regular expressions to languages:

  1. $i(a) := \{a\}$ for any $a \in \Sigma$,
  2. $i(\epsilon) := \{\epsilon\}$,
  3. $i(\emptyset) := \{\}$,
  4. $i((R_1 \cup R_2)) := i(R_1) \cup i(R_2)$, where $R_1, R_2 \in Reg$,
  5. $i((R_1 \cdot R_2)) := i(R_1) \cdot i(R_2)$, where $R_1, R_2 \in Reg$,
  6. $i((R^∗_1)) := (i(R_1))^*$ where $R_1 \in Reg$.

The difference is more apparent if you use other typical syntaces, e.g. with $+$ or $|$ instead of $\cup$; the right hand side of $i$ remains the same while the left changes. We still have overloading which makes the difference hard to see; in particular, I use $(,)$ in two different functions in 1. to 6.: as symbols in regular expressions and as usual part of function application.

So, formally, if you want to talk about the language generated by a regular expression $r$, you have to write $i(r)$. In practice, though, people often drop the $i$ and identify a regular expression and its interpretation if unambiguous.

Note that this separation and pairing of syntax and semantics is a general scheme for finite representations of infinite objects: bit strings and numbers, formal grammars and languages, automata and languages, program code and functions, ... Therefore, I think it is worthwhile to look at it in a clear and formally correct way at least once. The distinction becomes really, really important when you deal with (mathematical) logics, because you suddenly reason about reasoning.

2

The regular expression itself can be considered a string. However we mainly use it as a shorthand expression that generates a set of strings.

For example $a^{*}b^{*}$ generates $(\epsilon, a, b, ab, aabb,...)$.

2

As the definition says, a regular expression represents a language. You can regard it as a string if you like; not a string in the language being represented, but a string in the language of regular expressions for that language. It's not different in this respect from other expressions; for instance, you can regard the expression "$3+5$" as a string if you like, as long as you're aware that it represents a number and not a string. The possible confusion in the case of regular expressions arises because the regular expression may itself contain letters of the alphabet of the language being represented, but note that it also contains other symbols ('$\cup$', '$*$') which are not taken from that alphabet -- so if you want to regard the regular expression itself as a string in some language, then the alphabet of that language is a superset of the alphabet of the language being represented.