3
$\begingroup$

As I understand the KL Divergence, it measures how different two probability distributions $P$ and $Q$ are.

However, say the two distributions are:

P = [0.2 0.2 0.2 0.4]; Q = [0 0.5 0.1 0.4]; 

Then the KL-divergence is infinity. What is the justification for these distributions being infinitely different?

  • 0
    Actually$D(Q||P)$= 0.38882016-11-09

1 Answers 1

7

One interpretation of the Kullback-Leibler divergence $D_{KL}(P \Vert Q)$ is that it measures the extra message length per character needed to send a message drawn from distribution $P$ using an encoding scheme optimized for distribution $Q$. If one character is assigned a very small probability by $Q$, then the encoding scheme will use a very large number of bits to encode that character, since it is expected to occur so rarely. (By Kraft's inequality, the number of bits should be $-\log_2(q_i)$, where $q_i$ is the probability of the unlikely character.) If in fact it occurs frequently in distribution $P$, then a message drawn from $P$ will take a large number of bits to encode (specifically, $-p_i \log_2(q_i)$ per character on average). Therefore, if the $i$-th component of $Q$ goes to zero while the corresponding component of $P$ remains fixed and positive, the KL divergence $D_{KL}(P \Vert Q)$ should (and does) diverge logarithmically as $-p_i \log_2(q_i)$.

  • 1
    @Bill: That's right... I think strictly the KL divergence is undefined if $q_i=0$ and p_i>0, but that's the intuition for what happens as $q_i \rightarrow 0$: the encoding table is "unprepared" to represent the offending character without a large overhead.2011-02-09