4
$\begingroup$

in the genome we have 4 nucleotides (A,T,C,G). Now given a nucleotide sequence like

AGT CG TA CG ATCT CG ,

we can count the number of "CG" pairs. That's 3 in this case. (we count all the pairs so, ACT has pairs AC and CT)

Now I would like to test the significance of my results, or how likely is it that I would get 3 CG pairs if that sequence was random. I could test that with a permutation test, but that's not completely accurate and might also take time.

Now the question: What is the probability distribution of such CG pair, given the length of the sequence and the count of each element (A,C,T,G), so that I could calculate the exact probability that my result could come from a random sequence.

  • 0
    An interesting applied combinatorics problem. I'd say the cases of successive $C$s, and the case of a $C$ at the end should all be counted separately. For any given arrangement of $C$s, you can then (relatively easily) calculate the probability of a $G$ succeeding a $C$ thrice. Possibly the problem can be solved by induction on the length of the formula (at least, that could give you a recurrence relation which can be fed to a computer).2012-10-22

1 Answers 1

4

I'll assume that the intended question is this: Given the length of a sequence and the counts of all four nucleotides in this sequence (as opposed to their frequency in sequences in general), what is the probability that a sequence drawn randomly uniformly from all sequences fulfilling that description would contain exactly a certain number $k$ of CG pairs?

Denote the counts of the nucleotides by $\def\n#1{n_{\text #1}}\n A$, $\n C$, $\n G$ and $\n T$ and their sum, the length of the sequence, by $n$. Then we can form $k$ CG pairs and distribute these $k$ pairs and the remaining $n-2k$ individual nucleotides in

$ \binom{n-k}{\n A,\n T,\n C-k,\n G-k,k} $

different ways (see multinomial coefficients). But this overcounts, since we're allowing the remaining C and G nucleotides to form pairs. Every combination with $m$ pairs is being counted $\binom mk$ times, where it shouldn't be counted at all. Making use of

$ \sum_{j=k}^m\binom mj\binom jk(-1)^{j-k}=\delta_{km}\;, $

we can correct for the overcounting and calculate the desired count of sequences fulfilling the description as

$ \begin{align} &\sum_{j=k}^\infty\binom{n-j}{\n A,\n T,\n C-j,\n G-j,j}\binom jk(-1)^{j-k}\\=&\sum_{j=k}^\infty\binom{n-j}{\n A,\n T,\n C-j,\n G-j,j-k,k}(-1)^{j-k}\;, \end{align} $

where the sum actually only runs up to $\min(\n C,\n G)$ and the remaining terms are zero. This count has to be divided by the total number of sequences fulfilling the description, which is

$ \binom n{\n A,\n T,\n C,\n G}\;. $

In your example, with $\n A=3$, $\n C=\n G=\n T=4$, $n=15$ and $k=3$, the result would be

$ \binom{15}{3,4,4,4}^{-1}\left(\binom{12}{3,4,1,1,3}-\binom{11}{3,4,1,3}\right)=\frac{44}{1365}\approx3\%\;. $

[Edit in response to comment:]

If you want to count the sequences with at least $k$ pairs, we still need to correct for overcounting, since each of the sequences with more than $k$ pairs is counted more than once, but the correction is slightly different. The required binomial coefficient identity is

$ \sum_{j=k}^m\binom mj\binom{j-1}{k-1}(-1)^{j-k}=1\;, $

and the resulting sum is

$ \sum_{j=k}^\infty\binom{n-j}{\n A,\n T,\n C-j,\n G-j,j}\binom{j-1}{k-1}(-1)^{j-k}\;. $

In your example, with $\n A=3$, $\n C=\n G=\n T=4$, $n=15$ and $k=3$, the result would be

$ \binom{15}{3,4,4,4}^{-1}\left(\binom{12}{3,4,1,1,3}\binom22-\binom{11}{3,4,4}\binom32\right)=\frac{3}{91}\approx3\%\;. $

The change relative to the result for exactly $3$ pairs is less than one tenth of a percent. The difference in the counts,

$ \left(\binom{12}{3,4,1,1,3}\binom22-\binom{11}{3,4,4}\binom32\right) - \left(\binom{12}{3,4,1,1,3}-\binom{11}{3,4,1,3}\right) = \binom{11}{3,4,4} \;, $

is precisely the number of sequences with $4$ CG pairs.

  • 0
    Thank you for the update @joriki.2012-10-24