4
$\begingroup$

in the genome we have 4 nucleotides (A,T,C,G). Now given a nucleotide sequence like

AGT CG TA CG ATCT CG ,

we can count the number of "CG" pairs. That's 3 in this case. (we count all the pairs so, ACT has pairs AC and CT)

Now I would like to test the significance of my results, or how likely is it that I would get 3 CG pairs if that sequence was random. I could test that with a permutation test, but that's not completely accurate and might also take time.

Now the question: What is the probability distribution of such CG pair, given the length of the sequence and the count of each element (A,C,T,G), so that I could calculate the exact probability that my result could come from a random sequence.

  • 0
    An interesting applied combinatorics problem. I'd say the cases of successive $C$s, and the case of a $C$ at the end should all be counted separately. For any given arrangement of $C$s, you can then (relatively easily) calculate the probability of a $G$ succeeding a $C$ thrice. Possibly the problem can be solved by induction on the length of the formula (at least, that could give you a recurrence relation which can be fed to a computer).2012-10-22

1 Answers 1

4

I'll assume that the intended question is this: Given the length of a sequence and the counts of all four nucleotides in this sequence (as opposed to their frequency in sequences in general), what is the probability that a sequence drawn randomly uniformly from all sequences fulfilling that description would contain exactly a certain number $k$ of CG pairs?

Denote the counts of the nucleotides by $\def\n#1{n_{\text #1}}\n A$, $\n C$, $\n G$ and $\n T$ and their sum, the length of the sequence, by $n$. Then we can form $k$ CG pairs and distribute these $k$ pairs and the remaining $n-2k$ individual nucleotides in

$$ \binom{n-k}{\n A,\n T,\n C-k,\n G-k,k} $$

different ways (see multinomial coefficients). But this overcounts, since we're allowing the remaining C and G nucleotides to form pairs. Every combination with $m$ pairs is being counted $\binom mk$ times, where it shouldn't be counted at all. Making use of

$$ \sum_{j=k}^m\binom mj\binom jk(-1)^{j-k}=\delta_{km}\;, $$

we can correct for the overcounting and calculate the desired count of sequences fulfilling the description as

$$ \begin{align} &\sum_{j=k}^\infty\binom{n-j}{\n A,\n T,\n C-j,\n G-j,j}\binom jk(-1)^{j-k}\\=&\sum_{j=k}^\infty\binom{n-j}{\n A,\n T,\n C-j,\n G-j,j-k,k}(-1)^{j-k}\;, \end{align} $$

where the sum actually only runs up to $\min(\n C,\n G)$ and the remaining terms are zero. This count has to be divided by the total number of sequences fulfilling the description, which is

$$ \binom n{\n A,\n T,\n C,\n G}\;. $$

In your example, with $\n A=3$, $\n C=\n G=\n T=4$, $n=15$ and $k=3$, the result would be

$$ \binom{15}{3,4,4,4}^{-1}\left(\binom{12}{3,4,1,1,3}-\binom{11}{3,4,1,3}\right)=\frac{44}{1365}\approx3\%\;. $$

[Edit in response to comment:]

If you want to count the sequences with at least $k$ pairs, we still need to correct for overcounting, since each of the sequences with more than $k$ pairs is counted more than once, but the correction is slightly different. The required binomial coefficient identity is

$$ \sum_{j=k}^m\binom mj\binom{j-1}{k-1}(-1)^{j-k}=1\;, $$

and the resulting sum is

$$ \sum_{j=k}^\infty\binom{n-j}{\n A,\n T,\n C-j,\n G-j,j}\binom{j-1}{k-1}(-1)^{j-k}\;. $$

In your example, with $\n A=3$, $\n C=\n G=\n T=4$, $n=15$ and $k=3$, the result would be

$$ \binom{15}{3,4,4,4}^{-1}\left(\binom{12}{3,4,1,1,3}\binom22-\binom{11}{3,4,4}\binom32\right)=\frac{3}{91}\approx3\%\;. $$

The change relative to the result for exactly $3$ pairs is less than one tenth of a percent. The difference in the counts,

$$ \left(\binom{12}{3,4,1,1,3}\binom22-\binom{11}{3,4,4}\binom32\right) - \left(\binom{12}{3,4,1,1,3}-\binom{11}{3,4,1,3}\right) = \binom{11}{3,4,4} \;, $$

is precisely the number of sequences with $4$ CG pairs.

  • 0
    Shouldn't the overcounting corrections start at $k+1$?2012-10-22
  • 1
    @Lord: I guess that was slightly ambiguous -- that expression isn't the correction but the corrected value -- I've changed the formulation.2012-10-22
  • 0
    I get it now. Your derivation using combinatorics is nice; sadly, I've always been at best mediocre with the binomials and their friends.2012-10-22
  • 0
    This seems to be correct, but I'll do a bit more testing to see how well these results match my permutation tests. Thank you very much2012-10-22
  • 0
    Hey, to do a significance test, I would have to know the probability of there being k or more CG pairs, and is there a simpler, way of getting that instead of summing up all the probabilities for i going from k to min(nc,ng)? Because computing that on a pc for sequences of a few milion, would not be possible.2012-10-24
  • 0
    @zidarsk8: There is; I saw your earlier (now deleted) comment and was about to post an update in response; give me an hour or so :-)2012-10-24
  • 0
    The earlyer comment also had a suggested solution, which was wrong, so that's why I deleted it. Thank you for all your help.2012-10-24
  • 0
    @zidarsk8: You're welcome. I've updated the answer with a modified solution for $k$ or more pairs.2012-10-24
  • 1
    @zidarsk8: I just reread your comment, and now I'm not sure whether you were saying that a double sum to calculate the "$k$ or more" solution would have been impractical (I agree), or that even the single sum in either of the two solutions is too much? I wouldn't know how to do this without any sum at all, but it might be possible. But I would have thought that a single sum should be doable, even for a few million base pairs.2012-10-24
  • 0
    Thank you for the update @joriki.2012-10-24