3
$\begingroup$

Assume x is a letter whose probability is determined by P(x). Given N outcomes, what are the chances of observing m of the possible outcomes together at least once? e.g. For a string of lower-case text reduced to letters only to a length of 20 characters, what is the probability of observing at least one char of each of the following, 'a','b','c','d' at least one. i.e. the text contains 'a', 'b', 'c','d' in any order, duplicates allowed. Thank you very much.

3 Answers 3

1

The probability $Q$ to observe at least once each letter in a set $B$ of size $b$, in a word of length $\ell$ written in an alphabet $A$ of size $a\ge b$ when the word is chosen uniformly at random, is $ Q=\sum_{n=0}^b(-1)^n{b\choose n}\left(1-\frac{n}a\right)^\ell. $ Hence the number of such words is $ a^\ell Q=\sum_{n=0}^b(-1)^n{b\choose n}(a-n)^\ell. $ If $b=4$, $a=26$ and $\ell=20$, W|A says that this sum of $b+1=5$ terms is $ Q=\frac{188,637,413,085,495,043,814,491,263}{2,491,018,611,901,176,144,042,524,672}=7.573\%. $ The proof of the general formula for $Q$ is rather simple if one uses the inclusion-exclusion principle. To wit, for each letter $x$ in $A$, consider the event $C_x$ that the word does not contain $x$. Then, one sees that $Q$ is the probability of the event $ \bigcap_{x\in B}(\Omega\setminus C_x)=\Omega\setminus C,\qquad C=\bigcup_{x\in B}C_x, $ hence $ 1-Q=\mathrm P(A)=\sum_{n=1}^b (-1)^{n-1}\sum_{I:|I|=n} \mathrm P(C_I), \qquad C_I=\bigcap_{x\in I} C_x, $ where the sums are over subsets $I$ of $B$. There are ${b\choose n}$ such subsets $I$ of size $|I|=n$ and for each one of them, $C_I$ is the event that each of the $\ell$ letters of the word is chosen in the alphabet $A\setminus I$ of size $(a-n)$ instead of the whole alphabet $A$ of size $a$, hence $ \mathrm P(C_I)=\left(1-\frac{n}a\right)^\ell. $ Rearranging and adding $1$ as the $n=0$ term yields the formula for $Q$ stated above.

0

Perhaps we should do this in some order. So let's first wonder whether there is an 'a'. We can find that probability easily. Now, we are one branch along in a probability tree.

Next, we ask ourselves: given that there is an 'a' (if there is not an 'a', we don't care - we don't have 1 of each), what is the probability that there is a 'b'?

Rinse wash and repeat for 'c' and 'd,' and we will have a tree.

Now, I note that there is a bit of cheating here - and if you think this makes it too imprecise, let me know. But this carries the implicit assumption that the probability of finding a 'b', say, does not change if we already know there is an 'a.' For large samples, this is perfectly fine - the assumption doesn't really matter. But this is not actually true - and it seems very challenging to me to compensate for this.

  • 0
    Unfortunately, not knowing how many a's are in the string means you don't know the chance that there is a b.2011-09-24
0

Note: I am assuming that all letters are equally likely. The number of combinations does not change, but the equal liklihood of each combination does. There are many ways to attack this. My favorite would be to count the strings as follows: let $A(n)$ be the number of $n$ character strings with none of abcd, so $A(n)=22^n$. Let $B(n)$ be the number of strings of $n$ characters that include any number of exactly one of abcd. So $B(1)=4, B(n)=23B(n-1)+4A(n-1)$ because you can add any of $23$ letters to a string that already has one of abcd and still have one, or add any of $4$ letters to a string that doesn't have any and have a string with one. Similarly, let $C(n)$ be the $n$ character strings that include two different letters (any number of times) of abcd. So $C(1)=0, C(n)=3B(n-1)+24C(n-1)$. Then $D(n)$ has three letters, $D(0)=0, D(n)=25D(n-1)+2C(n-1)$. Finally, $E(n)$ is the number of strings that have all four. $E(0)=0, E(n)=26E(n-1)+D(n-1)$. This can be put easily into a spreadsheet (if the numbers don't get too big or you can allow floating point errors) or an arbitrary precision language like Python. I get about $7.57\%$ of all $20$ letter strings have all of abcd.

For this case, I get $1509099304683960350515930104/19928148895209409152340197376$ as the exact answer

This is a recurring theme-forget what you don't need to know. In this case, there are only two kinds of letters- abcd and everything else. Among abcd, all you need to know is how many are already in the string, not which ones.

  • 0
    @wespiserA: If we keep the denominator $1$, the $22$ in $A$ becomes the sum of the $P(x)$s over the non-abcd letters. The $4$ in the expression for $B$ becomes the sum of the $P(x)$s for abcd. I think the $23$ in $B$ becomes the sum of the non-abcd plus a weighted sum of the abcd, but am not sure how to weight it, and so on.2011-09-24