Assume x is a letter whose probability is determined by P(x). Given N outcomes, what are the chances of observing m of the possible outcomes together at least once? e.g. For a string of lower-case text reduced to letters only to a length of 20 characters, what is the probability of observing at least one char of each of the following, 'a','b','c','d' at least one. i.e. the text contains 'a', 'b', 'c','d' in any order, duplicates allowed. Thank you very much.
Probability of occurrence of 4 letters in a sample of text
3 Answers
The probability $Q$ to observe at least once each letter in a set $B$ of size $b$, in a word of length $\ell$ written in an alphabet $A$ of size $a\ge b$ when the word is chosen uniformly at random, is $$ Q=\sum_{n=0}^b(-1)^n{b\choose n}\left(1-\frac{n}a\right)^\ell. $$ Hence the number of such words is $$ a^\ell Q=\sum_{n=0}^b(-1)^n{b\choose n}(a-n)^\ell. $$ If $b=4$, $a=26$ and $\ell=20$, W|A says that this sum of $b+1=5$ terms is $$ Q=\frac{188,637,413,085,495,043,814,491,263}{2,491,018,611,901,176,144,042,524,672}=7.573\%. $$ The proof of the general formula for $Q$ is rather simple if one uses the inclusion-exclusion principle. To wit, for each letter $x$ in $A$, consider the event $C_x$ that the word does not contain $x$. Then, one sees that $Q$ is the probability of the event $$ \bigcap_{x\in B}(\Omega\setminus C_x)=\Omega\setminus C,\qquad C=\bigcup_{x\in B}C_x, $$ hence $$ 1-Q=\mathrm P(A)=\sum_{n=1}^b (-1)^{n-1}\sum_{I:|I|=n} \mathrm P(C_I), \qquad C_I=\bigcap_{x\in I} C_x, $$ where the sums are over subsets $I$ of $B$. There are ${b\choose n}$ such subsets $I$ of size $|I|=n$ and for each one of them, $C_I$ is the event that each of the $\ell$ letters of the word is chosen in the alphabet $A\setminus I$ of size $(a-n)$ instead of the whole alphabet $A$ of size $a$, hence $$ \mathrm P(C_I)=\left(1-\frac{n}a\right)^\ell. $$ Rearranging and adding $1$ as the $n=0$ term yields the formula for $Q$ stated above.
Perhaps we should do this in some order. So let's first wonder whether there is an 'a'. We can find that probability easily. Now, we are one branch along in a probability tree.
Next, we ask ourselves: given that there is an 'a' (if there is not an 'a', we don't care - we don't have 1 of each), what is the probability that there is a 'b'?
Rinse wash and repeat for 'c' and 'd,' and we will have a tree.
Now, I note that there is a bit of cheating here - and if you think this makes it too imprecise, let me know. But this carries the implicit assumption that the probability of finding a 'b', say, does not change if we already know there is an 'a.' For large samples, this is perfectly fine - the assumption doesn't really matter. But this is not actually true - and it seems very challenging to me to compensate for this.
-
0Unfortunately, not knowing how many a's are in the string means you don't know the chance that there is a b. – 2011-09-24
Note: I am assuming that all letters are equally likely. The number of combinations does not change, but the equal liklihood of each combination does. There are many ways to attack this. My favorite would be to count the strings as follows: let $A(n)$ be the number of $n$ character strings with none of abcd, so $A(n)=22^n$. Let $B(n)$ be the number of strings of $n$ characters that include any number of exactly one of abcd. So $B(1)=4, B(n)=23B(n-1)+4A(n-1)$ because you can add any of $23$ letters to a string that already has one of abcd and still have one, or add any of $4$ letters to a string that doesn't have any and have a string with one. Similarly, let $C(n)$ be the $n$ character strings that include two different letters (any number of times) of abcd. So $C(1)=0, C(n)=3B(n-1)+24C(n-1)$. Then $D(n)$ has three letters, $D(0)=0, D(n)=25D(n-1)+2C(n-1)$. Finally, $E(n)$ is the number of strings that have all four. $E(0)=0, E(n)=26E(n-1)+D(n-1)$. This can be put easily into a spreadsheet (if the numbers don't get too big or you can allow floating point errors) or an arbitrary precision language like Python. I get about $7.57\%$ of all $20$ letter strings have all of abcd.
For this case, I get $1509099304683960350515930104/19928148895209409152340197376$ as the exact answer
This is a recurring theme-forget what you don't need to know. In this case, there are only two kinds of letters- abcd and everything else. Among abcd, all you need to know is how many are already in the string, not which ones.
-
0thank you! I'll run this through Haskell to avoid floating point errors! – 2011-09-24
-
0just to make sure i'm on the right track here, plugging in values would make B(n) = ( ( 23 * (expt 4 19)) + ( 4 * (expt 22 19))) = 128259908852079784775122944 – 2011-09-24
-
0I get $B(20)=4042905330592729194863789700.\ \ 23\cdot4^{19}+4\cdot22^{19}$ looks like the first term counts $19$ place strings from abcd followed by a letter that is not the first one of abce and the second counts 19 place strings without abcd followed by one of abcd. But you won't get a string of 10 non-abcd, followed by a, followed by 9 non-bcd nor a string of 10 non-abcd, followed by aa, followed by 8 non-bcd. – 2011-09-24
-
0so what should the equation for B(20) be? I'm a little confused whether my math, or calculation is off. – 2011-09-24
-
0It should be $23B(19)+4A(19),$ but $B(19)\neq 4^{19}$ because that requires the string to have all been chosen from four letters. I get $B(19)=170201974858289379664370716$ – 2011-09-24
-
0@Ross: The OP stated that he knew the probability that a particular letter x, which he denoted $P(x)$, would occur. Is this incorporated into your answer? I seem to think that you assume all strings are equally likely. Is that correct, or am I missing something? – 2011-09-24
-
0You are correct, I am assuming that all letters are equally likely. I will specify that. – 2011-09-24
-
1@mixedmath, yes you are correct. I am working the problem now as a didactic exercise then will get to that. If you have a solution with P(x) probabilities, please share! – 2011-09-24
-
0@wespiserA: If we keep the denominator $1$, the $22$ in $A$ becomes the sum of the $P(x)$s over the non-abcd letters. The $4$ in the expression for $B$ becomes the sum of the $P(x)$s for abcd. I think the $23$ in $B$ becomes the sum of the non-abcd plus a weighted sum of the abcd, but am not sure how to weight it, and so on. – 2011-09-24