5
$\begingroup$

Problem statement:

The entropy of a probability vector $ p = (p_1, ... , p_n)^T $ is defined as $ H(p)= - \sum\limits_{i=1}^{n} p_i \log{p_i} $, subject to $ \sum\limits_{i=1}^{n} p_i = 1 \mbox{ and } p_i \geq 0$, where $ 0\log{0} = 0 $ by convention.

What is the largest and smallest entropy of any probability vector?

Discussion

I can maximise the expression using Lagrange multipliers without a problem. It's also equally easy to prove that the smallest entropy of any probability vector is 0, by noting that it must be non-negative, and providing an example with entropy 0.

However, I would very much like to prove that the minimum value is 0 using Lagrange multipliers (the fact that I can't seems to me to suggest that there is something that I haven't understood).

I set up the Lagrangian: $ L(p, \lambda) = - \sum p_i\log{p_i} + \lambda(\sum p_i - 1) $ and by differentiating to find stationary points obtain the system: $ {\frac{\partial{L}}{\partial{p_i}}}= -\log{p_i} - 1 + \lambda = 0 $. The obvious solution to this gives me my maximum point, but I am expecting another solution for the minimum.

  • Am I misusing Lagrange multipliers?
  • Is there another solution to my system that I am missing (maybe involving the "convention" about $ 0\log{0} = 0$)?
  • Do I need to worry about slackness and the condition $p_i \geq 0$ to find the minimum?
  • Is finding this minimum beyond the scope of this method for some reason, and my lecturers are in fact expecting me to resort to the "easy" method described above?

Any hints or answers are welcome, thank you in advance.

  • 0
    There are two separate issues here. One is just the fact that the function is not differentiable on the whole domain. The second is that the inequality constraints $p_i \geq 0$ need to be incorporated. Handling the latter would involve a generalization using KKT conditions. Both are issues here.2012-10-03

2 Answers 2

2

I think the standard notation requires the restrictions to be $f_j(p)\leq0$ for a minimization, and then multipliers can be nonnegative, so the lagrangian is always lower than the cost function.

Minimize h(p)

S. To

Sum(p)=1, let $\lambda_0$ be the asoc. Mult.

$p_j \geq0$, let $\lambda_j$ be the asoc. Mult.

Now to find the optimum candidates:

$dL/dp_j=-\log p_j -1 +\lambda_0-\lambda_j=0$

and kkt:

$\sum p=1$

$\sum p_j \lambda j=0$

Note from this last eq that, as both $\lambda_j$ and $p_j$ are positive, sum equals zero means every element is zero, i.e., either $\lambda_j$ or $ p_j$ equals zero. This and 'the convention' let you solve the multipliers, but well skip to a faster road:

  1. Look for $i: \lambda_i=0$

This implies $p_i= \exp{\lambda_0-1}$, which does not depend on anything else, so it must be a constant.

Let $p_i= \exp{\lambda_0-1}\triangleq 1/k$, where k is the sumber of non-zero mass elements.

  1. The old problem now becomes:

Minimize $-\sum 1/k \log 1/k=\log k$

Subject to

$ k \in Z, k\geq 1$

And clearly the optimum is achieved for k=1, which is the least entropic (degenerate ) distribution.

Note: to be thorough, you should also consider the other, non-zero multipliers, then solve for $\lambda_0$, then the rest multipliers, which have to be equal... But this is enough for the question I think

1

(the fact that I can't seems to me to suggest that there is something that I haven't understood).

Than something is more basic than Lagrange multipliers: is that extrema of functions not necesarily correspond to critical points (i.e. points where it's derivative is zero). You have to check also points where the function is not diferentiable, and/or the domain boundary.

For example, suppose you must have the extrema of the function $f(x)= (x-2)^3 - 3 x + 6$ in the domain $x \ge 0$ . Setting the derivative to zero, you get the critical points $x_1 = 1$ (maximum) and $x_2 = 3$ (minimum). But you forget the minimum that occurs at $x_0 = 0$, which you must check individually.

The same applies to Lagrange multipliers, which also find extrema that corresponds to critical points (along a restricted curve).

  • 0
    One minor comment. The word boundary is overloaded. The feasible set here is $\Sigma = \{p \in \mathbb{R}^n | p_i \in [0,1], \ \ \sum_n p_i = 1 \}$, and as a subset of $\mathbb{R}^n$, the entire set is its boundary, ie, $\partial \Sigma = \Sigma$. A perhaps more relevant concept would be the relative boundary (see, eg, Rockafellar, "Convex Analysis", or Boyd, Vandenberghe, "Convex Optimization"). In this case, the relative boundary of $\Sigma$ is the intersection of $\Sigma$ with the axes.2012-10-03