3
$\begingroup$

As I understand the KL Divergence, it measures how different two probability distributions $P$ and $Q$ are.

However, say the two distributions are:

P = [0.2 0.2 0.2 0.4]; Q = [0 0.5 0.1 0.4]; 

Then the KL-divergence is infinity. What is the justification for these distributions being infinitely different?

  • 0
    In general $D(P||Q) \neq D(Q||P)$ for the example you have given, $D(P||Q) = \infty$, whereas $D(Q||P)$ is finite.2011-02-09
  • 0
    Right. But I suppose D(P||Q) gives a measure of how different P is to Q? So why is that infinite?2011-02-09
  • 0
    The same question was asked recently here: http://math.stackexchange.com/questions/20961/in-what-situations-is-the-integral-equal-to-infinity2011-02-09
  • 0
    Actually D(Q||P) = 0.38882016-11-09

1 Answers 1

7

One interpretation of the Kullback-Leibler divergence $D_{KL}(P \Vert Q)$ is that it measures the extra message length per character needed to send a message drawn from distribution $P$ using an encoding scheme optimized for distribution $Q$. If one character is assigned a very small probability by $Q$, then the encoding scheme will use a very large number of bits to encode that character, since it is expected to occur so rarely. (By Kraft's inequality, the number of bits should be $-\log_2(q_i)$, where $q_i$ is the probability of the unlikely character.) If in fact it occurs frequently in distribution $P$, then a message drawn from $P$ will take a large number of bits to encode (specifically, $-p_i \log_2(q_i)$ per character on average). Therefore, if the $i$-th component of $Q$ goes to zero while the corresponding component of $P$ remains fixed and positive, the KL divergence $D_{KL}(P \Vert Q)$ should (and does) diverge logarithmically as $-p_i \log_2(q_i)$.

  • 0
    I see, thanks. So if a character is assigned zero probability by Q, then it will not be assigned a 'symbol' in the encoding table, thus will effectively not be able to be represented when P comes along?2011-02-09
  • 1
    @Bill: That's right... I think strictly the KL divergence is undefined if $q_i=0$ and $p_i>0$, but that's the intuition for what happens as $q_i \rightarrow 0$: the encoding table is "unprepared" to represent the offending character without a large overhead.2011-02-09