0
$\begingroup$

The problem is related to nucleotides and genomes, but it's better to simplify it using dice.

EDIT 2: My original question wasn't very clear, so I've reworded it

I know that the probability of getting the 7-digit long sequence "4213433" if I throw a four-sided die 7 times is $\frac{1}{4^7}$

But, what is the expected occurrence of the sequence "4213433" if I throw a four-sided die $4.5 \times 10^6$ times?

EDIT 1: Okay, I think I found my answer. If I throw a four-sided die $4.5 \times 10^6$ times, the sequence 4213433 will occur randomly $\frac{4.5 \times 10^6}{4^7}$ or 274.658 times. Is that correct? EDIT 3:Almost. Like @joriki said in the comments, the exact answer would be $\frac{4.5 \times 10^6 - 6}{4^7}$

  • 0
    This is a basic question in probability, and right now I'm racking my brain to try and remember how to solve it...or even what the name is... >.<2011-07-26
  • 0
    In response to your edit, you're close, but you actually overshoot by a little bit. Probably close enough for your purposes.2011-07-26
  • 0
    It’s not quite that simple. If you throw a four-sided die $7$ times, and repeat this experiment $4.5 \times 10^6$ times, the expected number of times that the $7$-throw experiment results in $4213433$ is $\frac{4.5 \times 10^6}{4^y}$, but one long string of $4.5 \times 10^6$ tosses doesn’t give you $4.5 \times 10^6$ independent opportunities to get $4213433$.2011-07-26
  • 0
    @Brian, I see my mistake now. So, what would be the formal answer?2011-07-26
  • 1
    @Brian: Independence doesn't enter into it; because of linearity of expectation, the expectation value is indeed simply $(4.5\times10^6-6)/4^7$ occurrences in the one long string.2011-07-26
  • 0
    @joriki: Yep; I was worried about (other) sequences that overlap themselves and missed the easy way to see that it doesn’t matter. (I was also commenting on the missing $6$ starting points, but that got obscured by the ‘independent’.)2011-07-26
  • 0
    @nachocab: Your boldface edit overemphasizes the error in your previous edit. Your only mistake was not to subtract $6$; the six decimal digits you gave are nevertheless correct, as is the idea that the expectation value for the overlapping sequences can be calculated by treating each of the $4.5\times10^6$ chances for an occurrence independently. I suggest to edit the question again, as it currently gives too much of an impression that there's something fundamentally wrong with your approach.2011-07-26

4 Answers 4

0

This does not directly answer your question, but I think it's relevant to note that, if $k$ is the the expected number of successful outcomes in $n$ independent trials, then the probability of getting at least one successful outcome in $n$ trials is approximately $1-\exp(-k)$, independent of $n$, and that this approximation is quite good as long as $n$ is large (say, over 10) and/or $\frac kn$ is small (say, less than 0.1).

Also, if $k$ itself is small (say, less than 0.1), the probability of at least one success is approximately equal to $k$.

3

A combinatorial approach:
If you throw a four sided die $n$ times then you have a total of $4^n$ possible sequences. Now you want to count in how many of those the sequence "$4213433$" exists.
Denote $A_1$ the number of sequences in which the string exists at least once (with multiplicities - if it exists twice, we count it twice). Then $A_1=4^{n-7}(n-7+1)$ (You think of the string as a block, then you write any sequence of length $n-7$ and add the string in any of the $n-7+1$ spaces)
In the same way, denote by $A_j$ the number of sequences in which the string exists at least $j$ times (with multiplicities). Then $A_j=4^{n-7j}\binom{n-7j+1}{j}$.
Clearly $1\leq j\leq\frac{n}{7}$.
Now, the number of strings in which the string appears at least once (without multiplicities) is $$\sum_{j=1}^{\left\lfloor \frac{n}{7} \right\rfloor}(-1)^{j+1}A_j=\sum_{j=1}^{\left\lfloor \frac{n}{7} \right\rfloor}(-1)^{j+1}4^{n-7j}\binom{n-7j+1}{j}$$ The probability you desire is $$\frac{1}{4^n}\sum_{j=1}^{\left\lfloor \frac{n}{7} \right\rfloor}(-1)^{j+1}4^{n-7j}\binom{n-7j+1}{j}=\sum_{j=1}^{\left\lfloor \frac{n}{7} \right\rfloor}\frac{(-1)^{j+1}}{4^{7j}}\binom{n-7j+1}{j}$$

  • 1
    Just plug in $n=4.5\cdot10^6$...2011-07-26
  • 0
    I'm afraid this is way over my head. Does this solve the issue of overlap raised by @joriki?2011-07-26
  • 0
    Yes, that's exactly why we have a $(-1)^{j+1}$ in the sigma - some events overlap, so we count them twice and more, so we have to subtract them. BTW, according to Mathematica, for $n=10^5$ it's already about 0.995...2011-07-26
  • 0
    Thanks Dennis, I'm going to try and rack my brain until I understand your answer. But until then, do you know of any reasonable approximation that is easier to compute? Would it be wrong to assume that if "4213433" occurs more than 274.658 times, it's likely to not be random?2011-07-26
  • 1
    @nachocab: There seems to be a crucial misunderstanding here. This answer computes what you originally seemed to asked for, the probability of getting at least one occurrence of the sequence. Your later edit suggests that you're actually interested in the expected number of occurrences, which seems a lot more useful, since with your numbers it's practically certain there will be at least one occurrence. I raised the issue of overlap *only* with respect to your *original* question, and emphasized that it isn't relevant if you ask for the expected number of occurrences.2011-07-26
  • 0
    @nachocab: You correctly calculated the expected number of occurrences to six decimal digits yourself (despite omitting the small detail of first subtracting $6$ from a huge number), so if what you really want is the expected number of occurrences, you're done, and if you understand why you should have subtracted $6$ first, you've understood all you need to understand about it. Regarding your question "it's likely to not be random?": No, that's more complicated; it could well occur $276$ or even $300$ times and yet be random. See e.g. http://en.wikipedia.org/wiki/Hypothesis_testing.2011-07-26
  • 0
    @joriki: Thanks for your help. I understand it now.2011-07-26
3

The answer you gave in your edit is not the answer to your original question. It's the (almost but not quite correct) answer to the question "What is the expected value of the number of sequences 4213433 occurring if I throw a four-sided die $4.5\times 10^6$ times?" This is in fact a much easier question because of the linearity of expectation. You don't have to worry about correlations, you can just add up the expectation values for each of the slots at which the sequence might occur. However, there aren't $4.5\times10^6$ of these, only $4.5\times10^6-6$, so the expected number of occurrences is actually $(4.5\times10^6-6)/4^7$, but that's the same up to the decimal places you specified.

Your original question is a bit harder, since to find the probability of at least one such sequence occurring, you need to take into account that the events of the sequence occurring in overlapping slots aren't independent. For instance, the probability of getting the sequence 44 at least once if you roll the die $3$ times is $7/4^3$, whereas the probability of getting the sequence $43$ at least once is $8/4^3$, even though the expected number of occurrences in both cases is $2/4^2$.

2

I've used Mr. Gulko's formula above to work several problems by hand, and I've also compared the results given by his formula to brute force computations done by Python script. I can confirm that his formula returns correct results. However, it requires one minor tweak.

Given an alphabet of size $a$ (e.g., $a=4$ for the alphabet {a,b,c,d}), the probability that a non-overlapping word of length $w$ appears one or more times in a sequence of $n$ randomly chosen letters is given by

$$P=\frac{1}{a^n}\sum_{j=1}^{\left\lfloor \frac{n}{w} \right\rfloor}(-1)^{j+1}a^{n-wj}\binom{n-wj+j}{j}$$

This solution is basically identical to Mr. Gulko's solution given above, with one exception: I have replaced

$$\binom{n-wj+1}{j}$$

with

$$\binom{n-wj+j}{j}$$

This change is necessary because as $j$ increments, we need to compute the number of ways we can place $j$ words in $n-wj+j$ slots, not $j$ words in $n-wj+1$ slots.