Conditional entropy can be expressed in terms of entropy as H(Y|X) = H(X,Y) - H(X).
Given a sequence $\{x _{1}, x _{2},... x _{n}\}$, the bigrams are $\{(x _{i}, x _{i+1})\}$, and the trigrams are $\{(x _{i}, x _{i+1}, x _{i+2})\}$.
In general, $H(X|Y) \ne H(Y|X)$. However, in the case of bigrams, since $H(X_{i}) = H(X_{i+1})$, we have $H(X_{i+1}|X_{i}) = H(X_{i}|X_{i+1})$. (Or very nearly: in fact $H(X_{i})$ omits the last term in the sequence, and $H(X_{i+1})$ omits the first term.)
Question 1: The above argument shows that the conditional entropy of the bigrams is equal to the conditional entropy of their reverses, but is there an intuitive way to see this? Why should the conditional entropy going forwards be the same as the conditional entropy going backwards?
Question 2: When I calculate $H(X_{i+2}|X_{i},X_{i+1})$ and $H(X_{i}|X_{i+1},X_{i+2})$ in code, they come out the same. I find this surprising. Is there a bug in my code? If not, what is the explanation. (Again, if possible I would like an intuitive explanation as well as a mathematical one.)
Question 3: If my sequence is produced by some Markov process, would it be valid to eliminate values $x_{i}$ for which we have "insufficient evidence" (ie too few samples)? Would this change the answers to questions 1 and 2?