0
$\begingroup$

The problem is related to nucleotides and genomes, but it's better to simplify it using dice.

EDIT 2: My original question wasn't very clear, so I've reworded it

I know that the probability of getting the 7-digit long sequence "4213433" if I throw a four-sided die 7 times is $\frac{1}{4^7}$

But, what is the expected occurrence of the sequence "4213433" if I throw a four-sided die $4.5 \times 10^6$ times?

EDIT 1: Okay, I think I found my answer. If I throw a four-sided die $4.5 \times 10^6$ times, the sequence 4213433 will occur randomly $\frac{4.5 \times 10^6}{4^7}$ or 274.658 times. Is that correct? EDIT 3:Almost. Like @joriki said in the comments, the exact answer would be $\frac{4.5 \times 10^6 - 6}{4^7}$

  • 0
    @nachocab: Your boldface edit overemphasizes the error in your previous edit. Your only mistake was not to subtract $6$; the six decimal digits you gave are nevertheless correct, as is the idea that the expectation value for the overlapping sequences can be calculated by treating each of the $4.5\times10^6$ chances for an occurrence independently. I suggest to edit the question again, as it currently gives too much of an impression that there's something fundamentally wrong with your approach.2011-07-26

4 Answers 4

0

This does not directly answer your question, but I think it's relevant to note that, if $k$ is the the expected number of successful outcomes in $n$ independent trials, then the probability of getting at least one successful outcome in $n$ trials is approximately $1-\exp(-k)$, independent of $n$, and that this approximation is quite good as long as $n$ is large (say, over 10) and/or $\frac kn$ is small (say, less than 0.1).

Also, if $k$ itself is small (say, less than 0.1), the probability of at least one success is approximately equal to $k$.

3

A combinatorial approach:
If you throw a four sided die $n$ times then you have a total of $4^n$ possible sequences. Now you want to count in how many of those the sequence "$4213433$" exists.
Denote $A_1$ the number of sequences in which the string exists at least once (with multiplicities - if it exists twice, we count it twice). Then $A_1=4^{n-7}(n-7+1)$ (You think of the string as a block, then you write any sequence of length $n-7$ and add the string in any of the $n-7+1$ spaces)
In the same way, denote by $A_j$ the number of sequences in which the string exists at least $j$ times (with multiplicities). Then $A_j=4^{n-7j}\binom{n-7j+1}{j}$.
Clearly $1\leq j\leq\frac{n}{7}$.
Now, the number of strings in which the string appears at least once (without multiplicities) is $\sum_{j=1}^{\left\lfloor \frac{n}{7} \right\rfloor}(-1)^{j+1}A_j=\sum_{j=1}^{\left\lfloor \frac{n}{7} \right\rfloor}(-1)^{j+1}4^{n-7j}\binom{n-7j+1}{j}$ The probability you desire is $\frac{1}{4^n}\sum_{j=1}^{\left\lfloor \frac{n}{7} \right\rfloor}(-1)^{j+1}4^{n-7j}\binom{n-7j+1}{j}=\sum_{j=1}^{\left\lfloor \frac{n}{7} \right\rfloor}\frac{(-1)^{j+1}}{4^{7j}}\binom{n-7j+1}{j}$

  • 0
    @joriki: Thanks for your help. I understand it now.2011-07-26
3

The answer you gave in your edit is not the answer to your original question. It's the (almost but not quite correct) answer to the question "What is the expected value of the number of sequences 4213433 occurring if I throw a four-sided die $4.5\times 10^6$ times?" This is in fact a much easier question because of the linearity of expectation. You don't have to worry about correlations, you can just add up the expectation values for each of the slots at which the sequence might occur. However, there aren't $4.5\times10^6$ of these, only $4.5\times10^6-6$, so the expected number of occurrences is actually $(4.5\times10^6-6)/4^7$, but that's the same up to the decimal places you specified.

Your original question is a bit harder, since to find the probability of at least one such sequence occurring, you need to take into account that the events of the sequence occurring in overlapping slots aren't independent. For instance, the probability of getting the sequence 44 at least once if you roll the die $3$ times is $7/4^3$, whereas the probability of getting the sequence $43$ at least once is $8/4^3$, even though the expected number of occurrences in both cases is $2/4^2$.

2

I've used Mr. Gulko's formula above to work several problems by hand, and I've also compared the results given by his formula to brute force computations done by Python script. I can confirm that his formula returns correct results. However, it requires one minor tweak.

Given an alphabet of size $a$ (e.g., $a=4$ for the alphabet {a,b,c,d}), the probability that a non-overlapping word of length $w$ appears one or more times in a sequence of $n$ randomly chosen letters is given by

$P=\frac{1}{a^n}\sum_{j=1}^{\left\lfloor \frac{n}{w} \right\rfloor}(-1)^{j+1}a^{n-wj}\binom{n-wj+j}{j}$

This solution is basically identical to Mr. Gulko's solution given above, with one exception: I have replaced

$\binom{n-wj+1}{j}$

with

$\binom{n-wj+j}{j}$

This change is necessary because as $j$ increments, we need to compute the number of ways we can place $j$ words in $n-wj+j$ slots, not $j$ words in $n-wj+1$ slots.