3
$\begingroup$

Imagine there is a sequence of base-4 numerals. The sequence is random but different numbers have different probabilities of occurring. $a$ and $b$ each have a probability of $40$%, $c$ and $d$ each have a probability of $10$%.

Thus, for any $4$ consecutive numerals, the sub-sequence $X = [abcd]$ has probability $\frac {2}{5}\frac {2}{5}\frac {1}{10}\frac {1}{10} = \frac{1}{625}$ of occurring. Since the probability of $X$ occurring is $\frac{1}{625}$, the next occurrence of it, to my reckoning, should on average begin at the $625$th numeral following the former occurrence. Ie, $[a,b,c,d,n_1,n_2,...n_{624},a,b,c,d]$. Thus, it would seem that the average distance between $X$s is $624$.

I only ask because my genetics module has the answer to a similar question as $625$ rather than $624$. I'm skeptical that that's correct, and if it is then I'm missing something. Maybe they count $[abcdabcd]$ as having 'spacing' of $1$, but that seems semantically wrong to me.

  • 0
    I th$i$nk it's a question of definition. I would define abcdabcd to be a spacing of 4 because the 2nd abcd is 4 numerals to the right of the 1st abcd. I really think we're arguing about vocabulary, not Mathematics.2011-11-24

1 Answers 1

5

Actually, since the probability of $X$ occurring is $1/625$, the next occurrence of it, on average, ends at the 625th numeral following the former occurrence, i.e., $[a,b,c,d,n_1,n_2,\dots,n_{621},a,b,c,d]$. Thus, the average distance between $X$'s is 621.


Update:

Here's a derivation. If you are interested in different methods for solving such problems, I suggest looking at these Math Stack Exchange questions on coin tossing and random bits, and also the references at the end of this post.

To solve your problem, we need to consider four related versions of the game. In any version, we keep adding a randomly selected letter $\{a,b,c,d\}$ to the sequence until we see a new occurrence of the pattern $abcd$. The added letters are represented by the $*$s below. The letters to the left of $\,|\,$ can be used as part of the new pattern, but we only start counting after $\,|\,$.

$\begin{array}{lrl} \mbox{version 1}:& a\,|***\cdots \\ \mbox{version 2}:& a\,b\,|***\cdots \\ \mbox{version 3}:& a\,b\,c\,|***\cdots \\ \mbox{version 4}:& a\,b\,c\,d\,|***\cdots \\ \end{array} $

For example, in version 2 of the game, if the values to the right of $\,|\,$ are $cd\dots$, then the game ends at $2$ steps. Alternatively, if the values to the right of $\,|\,$ are $bdabdcdabcd\dots$, then the game ends at $11$ steps.

Because the pattern $abcd$ has no overlaps, "version 4" is the same as "version 0", i.e., an initial $d$, $cd$, or $bcd$ are all useless, so it's the same as starting from scratch. This is the version you are interested in.

For $1\leq i\leq 4$, let $e_i$ be the expected number of additional steps to see $abcd$ again, using version $i$. Then first-step analysis gives the equations:

$\begin{array}{ll} e_1 &= 1+2e_1/5 + 2e_2/5 + e_4/5 \\ e_2 &= 1+2e_1/5 + e_3/10 +e_4/2 \\ e_3 &= 1+2e_1/5 + e_4/2 \\ e_4 &= 1+2e_1/5 + 3e_4/5 \\ \end{array}$

Let me explain the top row; the others are similar. We are playing version 1 of the game. The $1+$ on the right hand side comes from the first step. If the first letter is $a$ ($p=2/5$), then we are back to version 1. If it is $b$ ($p=2/5$), we have made some progress and are now in version 2. If it is either $c$ or $d$ ($p=1/5$) then we have to start over, and are in version 4.

The linear algebra problem above has solution $[e_1,e_2,e_3,e_4]=[1245/2, 2475/4, 1125/2, 625]$ which gives $e_4=625$.

Note: Having gone through a full derivation, I ought to mention that if your pattern has no overlaps (like $abcd$), then it is not necessary to go through all these calculations. The expected number of trials to get the pattern is simply 1 over the probability of occurrence. If there are overlaps, you have to work a bit harder.


References

  1. Sections 3.6.4, 7.9.1., and example 4.22. of Introduction to Probability Models (10th edition) by Sheldon M. Ross.
  2. Section 8.4 of Concrete Mathematics (2nd edition) by Ronald Graham, Donald Knuth, and Oren Patashnik
  3. Section 1.4 and Chapter 14 of Problems and Snapshots from the World of Probability by Gunnar Blom, Lars Holst, and Dennis Sandell.
  • 0
    Very awesome method. Thank you for taking the time to post it. Now I only wish that I could know the flaw in my reasoning; clearly I was wrong, but I don't see where e$x$actly the error lies.2011-11-27