11
$\begingroup$

The Kullback-Leibler Divergence is defined as $$K(f:g) = \int \left(\log \frac{f(x)}{g(x)} \right) \ dF(x)$$

It measures the distance between two distributions $f$ and $g$. Why would this be better than the Euclidean distance in some situations?

  • 1
    Because $K(f\mid g)$ measures the ratio between the (un)likelihood that a $g$ sample is like an $f$ sample, and its typical likelihood as a $g$ sample.2011-12-11
  • 4
    There is an interpretation in terms of information theory, see http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Motivation.2011-12-11

2 Answers 2

16

The short answer is that KL divergence has a probabilistic/statistical meaning (and a lot of them, in fact) while Euclidean distance has not. For example, a given difference $f(x)-g(x)$ has a whole different meaning depending on the absolute sizes of $f(x)$ and $g(x)$.

The WP page on the subject is a must read, naturally. Let me explain only one interpretation of KL divergence. Assume a random i.i.d. sample $\mathfrak X=(x_k)_{1\leqslant k\leqslant n}$ follows the distribution $f$ and a random i.i.d. sample $\mathfrak Y=(y_k)_{1\leqslant k\leqslant n}$ follows the distribution $g$. A way to distinguish $\mathfrak X$ from $\mathfrak Y$ is to ask for the likelihood that $\mathfrak Y$ behaves like $\mathfrak X$, that is, that $\mathfrak Y$ behaves like a typical sample from $f$.

More precisely, one wants to estimate how unlikely $\mathfrak Y$ becomes when one asks that $\mathfrak Y$ behaves like an $f$ sample, compared to its ordinary likelihood as a $g$ sample.

The computation is rather simple and based on the following. Assume $N(x,x+\mathrm dx)$ values from the sample fall in each interval $(x,x+\mathrm dx)$. Then, the likelihood scales like $$ \prod g(x)^{N(x,x+\mathrm dx)}=\exp\left(\sum N(x,x+\mathrm dx)\log g(x)\right). $$ For a typical $f$ sample, $N(x,x+\mathrm dx)\approx nf(x)\mathrm dx$ when $n\to\infty$, for every $x$, hence the likelihood of $\mathfrak Y$ masquerading as an $f$ sample scales like $$ \ell_n(f\mid g)\approx\exp\left(n\int f(x)\log g(x)\mathrm dx\right). $$ On the other hand, for a typical $g$ sample, $N(x,x+\mathrm dx)\approx ng(x)\mathrm dx$ when $n\to\infty$, for every $x$, hence the likelihood of $\mathfrak Y$ behaving like a typical $g$ sample scales like $$ \ell_n(g\mid g)\approx\exp\left(n\int g(x)\log g(x)\mathrm dx\right). $$ Thus $\ell_n(f\mid g)\ll\ell_n(g\mid g)$, as was to be expected, and the ratio $\dfrac{\ell_n(f\mid g)}{\ell_n(g\mid g)}$ decreases exponentially fast when $n\to\infty$, approximately like $\mathrm e^{-nH}$, where $$ H=\int f(x)\log f(x)\mathrm dx-\int f(x)\log g(x)\mathrm dx=K(f\mid g). $$

  • 0
    Very nice explanation. Thank you very much.2011-12-12
  • 0
    What do you mean by $N(x,dx)$ here?2015-07-21
  • 0
    @dimebucker91 The thing is defined in the post. What is not clear about the definition?2015-07-21
  • 0
    @Did So you are defining N as the number within that interval? I just thought that N represented some known quantiy2015-07-21
  • 0
    @Did I guess I was just confused because you use the term $N(x,x+dx)$, is this a typo? Also, I'm confused when you say, the likelihood scales like.., how do you get that expression from the normal way to think about a likelihood as the product of $g(x_i)$'s?2015-07-21
  • 0
    @dimebucker91 Were you in fact alluding to a typo? Anyway, I changed $N(x,\mathrm dx)$ into $N(x,x+\mathrm dx)$.2015-07-21
  • 0
    @Did Could you elaborate on the statement " The likelihood scales like.." How do you get that from the normal definition of the likelihood?2015-07-22
  • 0
    @dimebucker91 This approaches the continuous distributions with densities $f$ and $g$ by discrete distributions putting mass $f(x)dx$ and $g(x)dx$ on some regularly spaced values of $x$ each at distance $dx$ from the others. The usual likelihood for distributions $(p_i)$ and $(q_i)$ and numbers $(N_i)$ would be $$\prod_iq_i^{N_i}$$ By the law of large numbers, $N_i\approx np_i$ and we chose $p_i=f(x)dx$ and $q_i=g(x)dx$ at the $i$th point $x$, thus the "true" likelihood is approximately $$\prod_x(g(x)dx)^{nf(x)dx}\propto\prod_xg(x)^{nf(x)dx}$$ and you should see why the rest follows. Do you?2015-07-22
  • 0
    @Did This definitely makes more sense now. Thank you. Just one last question about how you calculated the rate of decrease for the ratio of likelihoods. Could you point me in the direction or link me to a simple example of this kind of calculation ?2015-07-23
  • 0
    @dimebucker91 Simply by forming the ratio of the approximate formulas for $\ell_n(f\mid g)$ and $\ell_n(g\mid g)$ given above.2015-07-23
  • 0
    @Did so is due to the absolute continuity that we may say that the ratio of the likelihoods: $\exp (n \int ( f(x) \log g(x) - g(x) \log g(x) ) dx)$ is approximately equal to: $\exp (n \int ( f(x) \log f(x) - f(x) \log g(x) ) dx) $ ?2015-07-23
  • 0
    @dimebucker91 Ahh, finally... *10 comments* to arrive at the question of whether $(f,g)$ should not be $(g,f)$, in the end? Listen, I think I will let you decide the case since, between my post and the WP page, you have everything at hand you need to make up your mind. But do not forget to tell me your conclusion. Done deal?2015-07-23
  • 0
    @Did I guess I am missing something obvious? Your answer implies that $H(f,g) - H(g) \approx H(f,g) - H(f) \implies H(f) \approx H(g)$2015-07-24
  • 0
    @dimebucker91 No, my answer does not imply that. Did you read my last comment? It could help.2015-07-24
5

Kullback-Leibler divergence can be regarded better in the following sense:

For two probability measures $P$ and $Q$, Pinsker's inequality states that $$ |P-Q|\le [2 KL(P\|Q)]^{\frac{1}{2}},$$ where l.h.s. is the total variation metric (corresponds to $\ell_1$-norm). So convergence in KL-divergence sense is stronger than convergence in total variation. The motivation comes from information theory as Jeff pointed out.