6
$\begingroup$

I have a four-letter alphabet containing A, B, C, and D. What is the expected number of times a string of length $m$ occurs inside of larger random substring of length $n$, both generated from the same alphabet? I think I've got it so far for an even distribution, where each letter has a probability of $0.25$:

$$(n-m)\cdot\left(\frac 1 4\right)^m$$

What if the letters are not evenly distributed? What if A and B had probabilities of $0.115$, and C and D had probabilities of $0.385$? How does that change the problem?

2 Answers 2

5

You miscounted by $1$: A string of length $m$ fits into a string of length $n$ once, not $0$ times, so it should be $(n-m+1)$.

The problem is essentially the same if the probabilities differ. By linearity of expectation, you still only have to take $n-m+1$ times the probability for a single match. If you have $k$ letters $a_1$ through $a_k$ that occur with probabilities $p_1$ through $p_k$, the expected number of matches is

$$ (n-m+1)\left(\sum_{i=1}^kp_i^2\right)^m\;. $$

  • 1
    Could you please explain this is little more detail. I am sorry but I just am not able to get it in my head. What I was doing is expected counts = summation (count * probability) . I am not able to reach to the formula you mentioned with it.2015-08-08
  • 0
    @Naman: Do you understand linearity of expectation? Did you check out the link?2015-08-08
  • 0
    @jorki Yes I did check and I understand that. I am reading book (https://books.google.com/books?id=_DMvZC6zJqAC&pg=PA224&lpg=PA224&dq=expected+number+of+substrings+in+random+string&source=bl&ots=h-_nWwcOu7&sig=776dBkifUhi7Yp0I4qEH0KqnYuk&hl=en&sa=X&ved=0CFYQ6AEwB2oVChMInLX47J2axwIVRhmSCh3ehgoq#v=onepage&q=expected%20number%20of%20substrings%20in%20random%20string&f=false) page 223 for same explanation. But unable to understand properly.2015-08-08
  • 0
    @Naman: It seems that that book treats "cyclic strings"? That's why they have $m$ (our $n$) where we have $n-m+1$, because we're not allowing the substring to wrap around the string cyclically. (In case this doesn't resolve your problem, please be a lot more specific on what your problem is.)2015-08-08
  • 3
    I could understand that part. It should be (n-m+1) but not able to get the probability part. Shouldn't it be prod(probability of each alphabet in given substring) ? I am not able to understand the second part of your formula. Why is it summing over squared probability over alphabets?2015-08-08
  • 0
    Okay. Thanks for help.2015-08-08
  • 1
    @joriki, I think that it should be $p_i$, NOT $p_i^2$. Please confirm2015-10-31
  • 2
    @joriki - Thanks for the "linearity of expectation", saved my day.. Can you please explain why the inner part of your expression is SUM(p^2)? It should be the probability for a1..ak which is the product of all p's, isn't it?2015-10-31
4

This is too long for a comment to the answer by @joriki, but it's intended to fill in some details for those commenters who questioned whether the sum of squared probabilities is correct (it is).

For an alphabet of $k$ letters $\{a_1\ldots a_k\}$, we're given two random words on that alphabet: $X_1\ldots X_m$ and $Y_1\ldots Y_n,\ \ m, and all of these random variables are i.i.d. Then the required expectation is found as follows, where $[...]$ are Iverson brackets:

$$E\bigg(\text{number of occurrences of }X_1\ldots X_m\text{ in }Y_1\ldots Y_n\bigg)\\ \begin{align} &= E\bigg(\sum_{i=1}^{n-m+1}[Y_i\ldots Y_{i+m-1}=X_1\ldots X_m]\bigg)\\ &= \sum_{i=1}^{n-m+1}P(Y_i\ldots Y_{i+m-1}=X_1\ldots X_m)\\ &=(n-m+1)\ P(Y_1\ldots Y_m=X_1\ldots X_m)\\ &=(n-m+1)\ P\bigg(\bigcap_{i=1}^m \{Y_i=X_i\}\bigg)\\ &=(n-m+1)\prod_{i=1}^m P(Y_i=X_i)\\ &=(n-m+1)\big( P(Y_1=X_1)\big)^m\\ &=(n-m+1)\bigg( \sum_{i=1}^k P(\{Y_1=a_i\}\cap\{X_1=a_i\}\bigg)^m\\ &=(n-m+1)\bigg( \sum_{i=1}^k P(Y_1=a_i)P(X_1=a_i)\bigg)^m\\ &=(n-m+1)\bigg( \sum_{i=1}^k p_i^2\bigg)^m\\ \end{align} $$

If the shorter word were fixed instead of random, say $x_1\ldots x_m$, then similar derivation gives

$$E\bigg(\text{number of occurrences of }x_1\ldots x_m\text{ in }Y_1\ldots Y_n\bigg)\\ \begin{align} &=\qquad ...\\ &=(n-m+1)\prod_{i=1}^m P(Y_i=x_i)\\ &=(n-m+1)\prod_{i=1}^m p(x_i)\\ \end{align} $$

where $p()$ is the probability mass function for the given distribution on the finite alphabet.