In the following data, I am trying to run a simple markov model.
block_M1 block_M2 hybrid_block block_S1 block_S2
A|T T|A A|C C|G T|A
T|G C|T T|A A|T C|A
C|A A|G C|G G|A C|G
G|T A|T G|T C|T A|T
So, block M contains several strings that belongs to one category, where strings in block M1 are ATCG and TCAA, M2 has strings TCAA and GAGT. Similarly, block S also has 4 total strings.
Hybrid blocks have two strings (ATCG and CAGT) where one of the strings is inherited from block M and another strings is from block S.
I am trying to build a markov model which can help me identify which string in hybrid block came from which blocks. In this example I can tell that in hybrid block ATCG came from block M and CAGT came from block S.
I followed the example in this link: http://web.stanford.edu/class/stats366/exs/HMM1.html
Unlike in CpG island problem which compute transition probabilites from the nucleotide 's' to nucleotide 't' along the length of the given sequence using Markov model; our model will need to compute the transition probability from nucelotide 's' (in previous position) to nucelotide 't' in next position from both the block (M and S) using Markov Model.
So, a transition probability for using markov model for block M can be written as:
$a_{s,t}^{m} = \frac{{}c_{s,t}^{m}}{\sum_{k}c_{s,k}^m}$
Similar markov model can be prepared for S-block.
But, unlike in CpG island problem I have to model the markov process in a way that the probability of which string from hybrid block belongs to which main block needs summing the probability across all the observation. Something like p(A|C) p(G|A) p(C|G) p(C|C) for both M block and S block.
Any suggestion on how can I take this further?
Also, is it better to model start state as p(A|A) or rather model the end state as p(C|C).
Please let me know if the problem isn't clear.
Additionally, I want to write a program to model this out. What approach/module should I look into?
Thanks,