2
$\begingroup$

Problem: Given an arbitrary, finite alphabet $\Sigma$ with $|\Sigma| > 1$, define the language

$\qquad L = \{w \in \Sigma^* \mid w \text{ has no subword of the form } aaa, a \in \Sigma\}$.

Let $L_n = L \cap \Sigma^n$. Determine $|L_n|$; either a closed form or efficient computation scheme is desired.

Example:

Let $\Sigma = \{a, b\}$. Then, $aabaa, ababa, baaba \in L_5$ but $aaabb, baaab,aabbb \notin L_5$.

Note:

This isn't homework or anything... I am a developer at Oklahoma University and this question came up in a discussion with some fellow developer friends of mine. I have some C# code that uses recursion but I believe it to be far from efficient. I can post it if anybody is interested in looking at it as well.

  • 0
    Can (e.g.) 'A' appear in C twice? If so, how does that affect constraints 2 and 3?2011-12-08
  • 0
    Are 'string' and 'string permutation' the same here?2011-12-08
  • 0
    @msh210: "Twice in a row" is idiomatic. It means "two times in succession". There's no actual row.2011-12-08
  • 0
    Define 'row", and no characters in C[] are unique... editing post to reflect that.2011-12-08
  • 0
    Original posted edited for clarification.2011-12-08
  • 0
    @Brandon:I guess you are using some sort of backtracking or dynamic programming to brute-force the results?2011-12-08
  • 0
    @MaX I am not sure what you mean. I just wrote a simple recursive method in C# that I am sure is less than efficient.2011-12-08
  • 0
    You could use dynamic programming with memorization that would increase your efficiency :)2011-12-08
  • 0
    @MaX I say it is less than efficient because I calculate ALL possible permutations without constraint using the characters in C[] and than using String methods filter out the ones I do not want.2011-12-08
  • 0
    @MaX Not sure what you mean by "memorization" and "dynamic programming". New concepts on me! It is amazing how I still learn new stuff every day even after being a programmer for 10 years.2011-12-08
  • 0
    @Brandon:It's elementary algorithm analysis :) [Here](http://www.topcoder.com/tc?d1=tutorials&d2=dynProg&module=Static) is a nice application based tutorial. In short if you have the recurrence relation you can always use dynamic programming to make it fast! :)2011-12-08
  • 0
    @Brandon:Sorry for my initial spelling mistake I meant [Memoization](http://en.wikipedia.org/wiki/Memoization)2011-12-08
  • 0
    So, to put it in terms perhaps more familiar to math people, you have a set $C$ of $n$ letters and want to know how many words of length $L\ge n$ there are with no $3$ successive letters identical.2011-12-08
  • 0
    ...oh. And what those words are.2011-12-08

3 Answers 3

2

Assume there are $\ell+1\geqslant2$ letters. Call $A_n$ the set of words of length $n$ terminating with twice the same letter, $B_n=L_n\setminus A_n$, $a_n=|A_n|$ and $b_n=|B_n|$, hence $|L_n|=a_n+b_n$.

Each word in $A_{n+1}$ corresponds to a unique word in $B_n$ to which its last letter is appended, hence $a_{n+1}=b_{n}$. Each word in $B_{n+1}$ corresponds to a unique word in $L_n$ to which a letter other than its last letter is appended, hence $b_{n+1}=\ell(a_n+b_{n})$.

Thus, $b_{n+2}=\ell b_{n+1}+\ell b_n$ for every $n\geqslant0$, with $b_0=0$ and $b_1=\ell+1$. The rest is routine. Using $k=\sqrt{\ell^2+4\ell}$, one gets for every $n\geqslant1$, $$ |L_n|=\frac{\ell+1}{k}\left(\left(\frac{\ell+k}2\right)^n-\left(\frac{\ell-k}2\right)^n+\left(\frac{\ell+k}2\right)^{n-1}-\left(\frac{\ell-k}2\right)^{n-1}\right). $$ One can check that this is indeed an integer by expanding each of the four binomial terms and noting that the odd powers of $k$ disappear, leaving only the even powers of $k$, which are in fact polynomials in $k^2=\ell^4+4\ell$.

The fastest way to compute $|L_n|$ might be to note that $|L_n|=\ell_n\pm\vartheta^{n-1}(1-\vartheta)$ with $$ \vartheta=\dfrac{k-\ell}2=\dfrac{2\ell}{\ell+k},\qquad \ell_n=\frac{(\ell+1)(\ell+k+2)}{k(\ell+k)}\left(\frac{\ell+k}2\right)^{n}. $$ Since $\vartheta\in(0,1)$, for every $n\geqslant2$, the error term is at most $\pm\frac14$. One knows that $|L_n|$ is an integer, hence $|L_n|$ is the unique integer in the interval of length $\frac12$ centered at $\ell_n$.

  • 0
    Unfortunately, computing this result might be more expensive than evaluating the recursion with memoisation. Depends on the implementation of $.^n$; I guess most compilers do it naively.2012-01-04
  • 0
    Not to mention those huge floating-point numbers!2012-01-04
  • 0
    Oh, you won't even be able to use integer arithmetics (thanks to $k$). That's bad (for the OP). Note that I appreciate the solution in itself; it is just no good for using it in a program.2012-01-04
  • 0
    @TonyK: huge numbers will occur using any algorithm as the result grows exponentially.2012-01-04
  • 0
    Sorry for the nitpicking, but $L_0=\{\epsilon\}$ thus $|L_0|=1$ however the formular above gives $|L_0|=2$.2012-01-07
  • 1
    @uli You might wish to read again: there is no "formula above" for $|L_0|$.2012-01-08
1

Let $a_n$ the number of words in $L_n$ that end in $ba$ for $a,b \in \Sigma, a\neq b$ and $b_n$ similarly the number of words in $L_n$ that end in $baa$. Obviously, $|L_n| = a_n + b_n$.

The following recursions hold (with $s = |\Sigma|$):

$\begin{align*}\qquad a_1 &= s \\ \qquad a_n &= (a_{n-1} + b_{n-1})\cdot(s-1) \qquad \text{ for } n > 1\end{align*}$

and

$\begin{align*}\qquad b_1 &= 0 \\ \qquad b_2 &= s \\ \qquad b_n &= (a_{n-2} + b_{n-2})\cdot(s-1) \qquad \text{ for } n > 2\end{align*}$

If you have trouble seeing that, imagine taking valid strings of length $n-1$ and $n-2$ and appending symbols up to length $n$ while maintaining validity. Now, with $c_n = a_n + b_n$ we get

$\begin{align*}\qquad c_1 &= s \\ c_2 &= s^2 \\ c_n &= (c_{n-2} + c_{n-1}) \cdot (s-1) \qquad \text{ for } n>2\end{align*}$

With the ansatz $C(z) = \sum_{n>0}c_nz^n$ you get that

$\displaystyle\qquad C(z) = \frac{sz(z + 1)}{(1-s)z^2 + (1-s)z + 1}$

Now $|L_n| = c_n = [z^n]C(z)$; unfortunately, the latter expression seems to be hard to find and/or ugly.

Fortunately, recursions of this kind can always be computed by using memoisation in time $\mathcal{O}(n)$ and space $\mathcal{O}(1)$:

calc(n : int, s : int ) {
  c = new int[] {s, s*s}

  if ( n <= 0 ) return 0
  else if ( n == 1 ) return c[0]
  else {
    for ( i=3; i<=n; i++ ) {
      ci = (c[0] + c[1]) * (s-1)
      c[0] = c[1]; c[1] = ci
    }
    return c[1]
  }
}

Note that this scheme also works if you can not find a nice recursion for $c_n$; just compute $a_n$ and $b_n$ simultaneously.

If you expect many calls to calc with different parameters, it might be worthwhile to store all values ever computed. Depending on your queries, this might result in amortised constant evaluation time for the price of as much memory as your largest query.

  • 1
    With $\Sigma=\{a,b\}$ we find $L_0=\{\epsilon\}$, $L_1=\{a,b\}$, $L_2=\{aa,ab,ba,bb\}$, $L_3=\{aab,aba,abb,baa,bab,bba\}$, thus $|L_0|=1$, $|L_1|=2$, $|L_2|=4$, $|L_3|=6$ however $c_3$ as defined above is $2c_1+c_2=2\cdot2+4=8$ thus the recursion is wrong.2012-01-07
  • 0
    Oh, thanks. The recursion for $b_n$ counts words multiple times, e.g. $abb$ is counted as both $ab \cdot b$ and $a \cdot bb$. Easy to fix, in fact.2012-01-08
  • 0
    Hope it's correct now. The recursion for $c_n$ seems obvious now, even without the detour via $a_n, b_n$; I leave those in the answer for illustration, as this is how I proceeded to find the solution.2012-01-08
0

A string in $L_n$ contains $r$ doublets and $n-2r$ singlets, for some $r$ with $0 \le 2r \le n$. Let $S = |\Sigma|$. Then such a string has $S$ choices for its first character, and $S-1$ choices for each of the subsequent $n-r-1$ changes of character. Also, the number of possible positions of the doublets is $\binom{n-r}{r}$.

So the number of strings containing $r$ doublets is $\binom{n-r}{r}S(S-1)^{n-r-1}$. To get the total number of strings in $L_n$, just sum this over all valid $r$.

Edited to add: As uli points out in the comments, $n=0$ is a special case: $|L_0|$ is equal to $1$, not $S/(S-1)$.

  • 0
    Unfortunately, computing this result is more expensive than evaluating the recursion with memoisation.2012-01-04
  • 0
    @Raphael: You can evaluate the sum progressively, updating $\binom{n-r}{r}$ and $S(S-1)^{n-r-1}$ as you go. So I don't think there will be much difference between the two methods.2012-01-04
  • 0
    Right, did not think of that. You'd have to code this explicitly, though, resulting in far messier code (I guess).2012-01-04
  • 0
    @Raphael: I agree that your method is simpler.2012-01-04
  • 1
    Sorry for the nitpicking, but $L_0=\{\epsilon\}$ thus $|L_0|=1$ however with $0\le 2r\le 0$ and $S=2$ we find $\binom{0-0}{0}2=2$.2012-01-08
  • 0
    @uli: You are absolutely right! I have edited my answer accordingly.2012-01-08