1
$\begingroup$

If I have a sequence which is comprised of one of $10$ prefixes, one of $5$ suffixes and a variable length middle, how do I compute the entropy of the sequence?

Using Shannon-Entropy $$H= -\sum_{i=1}^{m} p_i \ln(p_i)$$

I can compute the entropy contributions from the $10$ starting and $5$ ending sequences using this sum, but I am unsure how to computer the variable length section. If the distribution of length is:

  • $0=0.5$
  • $1=0.25$
  • $2=0.2$
  • $3=0.05$

And each character can be A-F say, for $16$ possibility with equal probability how do I calculate the number of bits of entropy? If I simply use the sum above, I am underestimating the contribution from the longer sequences, which have more possibilities. Do I also need to ignore the $p_0$ term, as $0.5\ln(0.5)$ isn't actually adding any additional characters to the sequence.

1 Answers 1

1

If all $16$ possibilities has an equal probability, that entropy must $\ln 16\approx 2.77$ its; $H\sim \max= \sum[ \frac1n \ln(\frac1n) ]$. Adding new information with "equal" probability is not add extra information into entropy.

  • 0
    While I agree adding 1 character with 16 equal possibilities adds log(16) entropy, the problem here is how to cope with variable length sequence.2013-05-23