2
$\begingroup$

The KL-divergence is defined as:

$D_{KL} (p(x) \parallel q(x)) = \sum_x p(x) \log \frac{p(x)}{q(x)} $

If $A$ and $B$ are discrete variables, does it make sense to calculate $D_{KL}(p(A, B) \parallel p(A))$? Namely, the divergence between the joint distribution $p(A, B)$, and the marginal distribution $p(A)$. Or must they be of the same type (i.e. defined over the same and all variables)?

I tried, and I found this:

$ \begin{align} D_{KL}(p(A, B) \parallel p(A)) &= \sum_{a \in A} \sum_{b \in B} p(a, b) \log \frac {p(a, b)}{p(a)} \\ &= ~... \\ &= H\,(B \mid A) \end{align} $

Here, $H$ is the Shannon entropy.

Now, if needed, we could make the second distribution to be the same type, by defining $D_{KL}(p(A, B) \parallel q(A, B))$, where we simply say that $q(A, B) = Pr(A)$. Would it then be okay to calculate $D_{KL}(p(A, B) \parallel q(A))$?

In the book "Elements of Information Theory", by Cover and Thomas, it says that $D_{KL}(p(x) \parallel q(x)) = \infty $ if the distribution $q$ doesn't define a probability value for every symbol that $p$ defines.

1 Answers 1

1

I seems to me that you have already answered your question. Namely, $D_{KL}(p(A, B) \parallel p(A)) = H\,(B \mid A) $


Update: the above is not right. The definition of $D_{KL}$ require two valid probability functions defined in the same space. It's true that we could consider $p(A)$ as a function of two variables, only that it's constant on the second variable - as you wrote: $q(A,B)=P(A)$ . But then the sum over the two variables would not (in general) sum up to one, hence $P(A)$ would not be a valid joint probability function.

Hence, no, the conditional entropy cannot be written a $KL-$divergence.


In the book "Elements of Information Theory", by Cover and Thomas, it says that $D_{KL}(p(x) \parallel q(x)) = \infty $ if the distribution $q$ doesn't define a probability value for every symbol that $p$ defines.

That's true, but that's inconsequential here. That means that $D_{KL}(p(x) \parallel q(x)) = \infty $ if $q(x)=0$ for some value $x$ ("symbol") such that $p(x)>0$. But that's not you case. BEcause, for any given pair $(a,b)$ with $p(a,b)>0$ you have that $q(a,b) \triangleq p(a) = \sum_{b'} p(a,b')>0$. So, no problem.

  • 0
    Neat! Thank you. $D_{KL}$ seems to appear everywhere, so I'm trying to get a good grasp of it, hence the question. What does the little triangle above the equal sign in your last equation mean?2017-02-07
  • 0
    Triangle above = means a definition.2017-02-11
  • 0
    As a side note, the sum over all values of $A$ and $B$ of $q(a,b)=p(a)$ adds up to $\sum_{b\in B}1$, i.e. the number of values $B$ can take. Thus it will be a valid joint probability (sum up to $1$) only if $B$ takes only 1 value, which is a degenerate case and even the joint probability doesn't make much sense.2017-04-06