2
$\begingroup$

Theorem 7 in Shannon's seminal paper A Mathematical Theory of Communication states:

"The output of a finite state transducer driven by a finite state statistical source is a finite state statistical source, with entropy (per unit time) less than or equal to that of the input. If the transducer is non-singular they are equal."

Basically, it says that a transducer cannot increase the entropy of source. It includes a proof, which I can't make much sense of. I want a justification (even if only by intuition) of how this is true.

Here is the included proof:

Let $\alpha$ represent the state of the source, which produces a sequence of symbols $x_i$; and let $\beta$ be the state of the transducer, which produces, in its output, blocks of symbols $y_j$. The combined system can be represented by the "product state space" of pairs $(\alpha, \beta)$. Two points in the space $(\alpha_1, \beta_1)$ and $(\alpha_2, \beta_2)$, are connected by a line if $\alpha_1$ can produce an $x$ which changes $\beta_1$ to $\beta_2$, and this line is given the probability of that $x$ in this case. The line is labeled with the block of $y_j$ symbols produced by the transducer. The entropy of the output can be calculated as the weighted sum over the states. If we sum first on $\beta$ each resulting term is less than or equal to the corresponding term for $\alpha$, hence the entropy is not increased. If the transducer is non-singular, let its output be connected to the inverse transducer. If H'_1, H'_2 and H'_3 are the output entropies of the source, the first and second transducers respectively, then H'_1 \geq H'_2 \geq H'_3 = H'_1 and therefore H'_1 = H'_2.

  • 0
    I edited the question based on the paper. The complete paper can be found here: http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf.2011-11-14

1 Answers 1

0

This is known as the data processing inequality and Shannon uses $H'_i$ to mean conditional entropy.

First note that given $Y = f(X)$, the behaviour of $f$ is entirely described by $P(Y|X)$.

Now an explicit statement of the theorem would be to say that, given a Markov chain $X \to Y \to Z$ with factorization $P(X,Y,Z) = P(X)P(Y|X)P(Z|Y)$ we have that $H(X|Y) \le H(X|Z)$.

That is to say that the amount of uncertainty that remains about $X$ after learning $Y$ is never greater than the amount of uncertainty that remains about $X$ after learning $Z$.

This is mainly due to the independence of $X$ and $Z$ given $Y$, which you can see by dividing by $P(Y)$.

$P(X,Z|Y) = \frac{P(X)P(Y|X)P(Z|Y)}{P(Y)} = \frac{P(X,Y)P(Z|Y)}{P(Y)} = P(X|Y)P(Z|Y)$

Of course $P(X,Z|Y) = P(X|Y)P(Z|Y)$ is the definition of independence, which might be better expressed as a property of the mutual information. In our case $I(X,Z|Y) = 0$.

With the definition $I(X;Y) = H(X) - H(X|Y)$, we can rephrase the theorem $H(X|Y) \le H(X|Z)$ using the mutual information:

$I(X;Y) \ge I(X;Z)$

Using the identity $I(X;Y,Z) = I(X;Y) + I(X;Z|Y)$ we find that

$I(X;Y) + I(X;Z|Y) = I(X;Z) + I(X;Y|Z)$

and because we have already established that $I(X;Z|Y) = 0$, this becomes

$I(X;Y) = I(X;Z) + I(X;Y|Z)$

Finally, $I(X;Y|Z) \ge 0$ (a property of mutual information) and we find that

$I(X;Y) \ge I(X;Z)$

which is the original claim.