2
$\begingroup$

Exercise 2.27 in Elements of Information Theory (2nd ed.) reads:

Let $\mathbf{p} = (p_{1}, p_{2}, \ldots, p_{m})$ be a probability distribution on $m$ elements (i.e., $p_{i} \geq 0$ and $\sum_{i=1}^{m} p_{i} = 1$). Define a new distribution $\mathbf{q}$ on $m - 1$ elements as $q_{1} = p_{1}, q_{2} = p_{2}, \ldots, q_{m-2} = p_{m-2}$, and $q_{m-1} = p_{m-1} + p_{m}$ [i.e., the distribution $\mathbf{q}$ is the same as $\mathbf{p}$ on $\{1, 2, \ldots, m - 2\}$, and the probability of the last element in $\mathbf{q}$ is the sum of the last two probabilities of $\mathbf{p}$]. Show that $$ H(\mathbf{p}) = H(\mathbf{q}) + (p_{m-1} + p_{m})H\left( \frac{p_{m-1}}{p_{m-1} + p_{m}}, \frac{p_{m}}{p_{m-1} + p_{m}} \right). $$

Question 1. What does $H\left( \frac{p_{m-1}}{p_{m-1} + p_{m}}, \frac{p_{m}}{p_{m-1} + p_{m}} \right)$ mean in this context?

Question 2. Where is this notation introduced in the text?

Please do not provide a solution to the exercise.

  • 0
    It is the entropy of a hypothetical random variable $X$ with support containing $2$ points ($\{ a,b \}$ say). Furthermore, $Pr(X=a)$ is $\frac{p_{m-1}}{p_{m-1}+p_m}$, and $Pr(X=b)$ is the second term.2011-09-04
  • 0
    I scanned through the book on Google Books and there appears to be a symbol index in the back. Did you look there?2011-09-04
  • 1
    This is just to confirm Srivatsan's interpretation. You can write this as $H(\mathbf{p}) = H(\mathbf{q})+(p_{m-1}+p_m)H(\mathbf{r})$, where $\mathbf{r} = \left( \frac{p_{m-1}}{p_{m-1} + p_{m}}, \frac{p_{m}}{p_{m-1} + p_{m}}\right)$. Note that $\mathbf{p,q,r}$ are all vectors satisfying the hypothesis for the definition of $H$.2011-09-04
  • 0
    Further, near example 4.2.1, the book also expresses the entropy-rate of a 2 state markov-chain as $H(\mathcal{X})=[\beta H(\alpha) + \alpha H(\beta)]/(\alpha + \beta)$. Apparently, $H(\alpha) \equiv H(\alpha, 1-\alpha)$.2016-07-08

1 Answers 1

1

To define the entropy $H$, all you need is a probability distribution; that is, a vector of nonnegative real numbers summing to $1$. In this case, it happens to be the vector $\mathbf r = (a,b)$ with $$ a := \frac{p_{m-1}}{p_{m-1}+p_m}, b := \frac{p_m}{p_{m-1} + p_m}. $$

Note that before defining $H(\mathbf r) = H(a,b)$, you must still verify that $\mathbf r = (a,b)$ is a well-defined probability distribution. That is, you should check that: $a \geq 0, b \geq 0$, and $a+b = 1$. (Easy exercise!)

Added: The OP remarked that the book defines $H(X)$ for a random variable $X$, but not $H(\mathbf p)$ for a distribution vector $\mathbf p$. So as per @Didier's comment below, I will explicitly say what this means. Apparently, the book does define the notation after all; see the comments below. Originally I had defined the notation $H(\mathbf p)$ explicitly; I will just keep it in my answer.

For a probability distribution $\mathbf p$, we can define $H(\mathbf p)$ by the familiar Shannon formula: $$ H(\mathbf p) := \sum_{i=1}^{n} p_i \log_2 \left(\frac{1}{p_i}\right). $$ We can also relate this definition to the entropy of a random variable. Imagine a random variable $X$ with support $\{ 1,2, \ldots, n \}$ such that $\Pr[X = i] = p_i$. Then directly from the two definitions, we see that $H(X)$ is equal to $H(\mathbf p)$.

  • 0
    Of course, in the above explanation, I implicitly assume probability distributions have finite support.2011-09-04
  • 0
    You're right, thanks. And to answer Question 2: the notation is introduced in this very problem!2011-09-04
  • 0
    @Quinn Rather, wherever $H$ was first defined :-). Also, an unrelated point, if you wondering why the $\mathbf r$ was defined that way, the vector $(\frac{x}{x+y}, \frac{y}{x+y})$ gives you a probability distribution with individual probabilities proportional to $x$ and $y$ (irrespective of the value of $x+y$).2011-09-04
  • 0
    Only $H(X)$ was defined prior to this problem, where $X$ is a random variable. This is the first time the notation $H(\mathbf{p})$ is used, where $\mathbf{p}$ is a vector with non-negative entries that sum to 1.2011-09-04
  • 0
    Hence your Question 1. is really: Let $\mathbf p$ denote a probability distribution on a finite set, what does $H(\mathbf p)$ means?2011-09-05
  • 0
    @Didier, Thanks for your comment. It's kind of strange that the book does not (apparently) define $H(\mathbf p)$. I'll add a note about that in my answer.2011-09-05
  • 0
    @Didier Actually what happened is that I understood the definition of $H(\mathbf{p})$ just fine and even (correctly) thought that $H(\alpha, 1-\alpha)$ (where $\alpha \in [0,1]$) should mean $\alpha \log \alpha + (1-\alpha)\log (1-\alpha)$. However, I made an arithmetical error and concluded that it must mean something else. I also didn't notice that $H(\alpha, 1-\alpha)$ was a special case of $H(\mathbf{p})$.2011-09-05
  • 0
    Hence a second Edit seems necessary, to avoid blaming wrongly the authors of the book.2011-09-05
  • 0
    @Dider Thanks, done.2011-09-06
  • 0
    @Quinn There is in fact an error in your final comment as well. The definition of $H(x, 1-x)$ is missing a negative sign; the correct formula is $- \alpha \log \alpha - (1-\alpha) \log (1-\alpha)$.2011-09-06
  • 0
    @Srivatsan You're right, thanks.2011-09-06