7
$\begingroup$

I have a four-letter alphabet containing A, B, C, and D. What is the expected number of times a string of length $m$ occurs inside of larger random substring of length $n$, both generated from the same alphabet? I think I've got it so far for an even distribution, where each letter has a probability of $0.25$:

$(n-m)\cdot\left(\frac 1 4\right)^m$

What if the letters are not evenly distributed? What if A and B had probabilities of $0.115$, and C and D had probabilities of $0.385$? How does that change the problem?

2 Answers 2

6

You miscounted by $1$: A string of length $m$ fits into a string of length $n$ once, not $0$ times, so it should be $(n-m+1)$.

The problem is essentially the same if the probabilities differ. By linearity of expectation, you still only have to take $n-m+1$ times the probability for a single match. If you have $k$ letters $a_1$ through $a_k$ that occur with probabilities $p_1$ through $p_k$, the expected number of matches is

$ (n-m+1)\left(\sum_{i=1}^kp_i^2\right)^m\;. $

  • 2
    @joriki - Thanks for the "linearity of expectation", saved my day.. Can you please explain why the inner part of your expression is SUM(p^2)? It should be the probability for a1..ak which is the product of all p's, isn't it?2015-10-31
5

This is too long for a comment to the answer by @joriki, but it's intended to fill in some details for those commenters who questioned whether the sum of squared probabilities is correct (it is).

For an alphabet of $k$ letters $\{a_1\ldots a_k\}$, we're given two random words on that alphabet: $X_1\ldots X_m$ and $Y_1\ldots Y_n,\ \ m, and all of these random variables are i.i.d. Then the required expectation is found as follows, where $[...]$ are Iverson brackets:

$E\bigg(\text{number of occurrences of }X_1\ldots X_m\text{ in }Y_1\ldots Y_n\bigg)\\ \begin{align} &= E\bigg(\sum_{i=1}^{n-m+1}[Y_i\ldots Y_{i+m-1}=X_1\ldots X_m]\bigg)\\ &= \sum_{i=1}^{n-m+1}P(Y_i\ldots Y_{i+m-1}=X_1\ldots X_m)\\ &=(n-m+1)\ P(Y_1\ldots Y_m=X_1\ldots X_m)\\ &=(n-m+1)\ P\bigg(\bigcap_{i=1}^m \{Y_i=X_i\}\bigg)\\ &=(n-m+1)\prod_{i=1}^m P(Y_i=X_i)\\ &=(n-m+1)\big( P(Y_1=X_1)\big)^m\\ &=(n-m+1)\bigg( \sum_{i=1}^k P(\{Y_1=a_i\}\cap\{X_1=a_i\}\bigg)^m\\ &=(n-m+1)\bigg( \sum_{i=1}^k P(Y_1=a_i)P(X_1=a_i)\bigg)^m\\ &=(n-m+1)\bigg( \sum_{i=1}^k p_i^2\bigg)^m\\ \end{align} $

If the shorter word were fixed instead of random, say $x_1\ldots x_m$, then similar derivation gives

$E\bigg(\text{number of occurrences of }x_1\ldots x_m\text{ in }Y_1\ldots Y_n\bigg)\\ \begin{align} &=\qquad ...\\ &=(n-m+1)\prod_{i=1}^m P(Y_i=x_i)\\ &=(n-m+1)\prod_{i=1}^m p(x_i)\\ \end{align} $

where $p()$ is the probability mass function for the given distribution on the finite alphabet.