The inherent unpredictability, or randomness, of a probability distribution can be measured by the extent to which it is possible to compress data drawn from that distribution.
$$ \text{more compressible} ≡ \text{less random} ≡ \text{more predictable}$$
Suppose there are $n$ possible outcomes, with probabilities $p_1, p_2, . . . , p_n$. If a sequence of $m$ values is drawn from the distribution, then the $i^{th}$ outcome will pop up roughly $mp_i$ times (if $m$ is large). For simplicity, assume these are exactly the observed frequencies, and moreover that the $p_i$’s are all powers of $2$ (that is, of the form $\frac{1}{2^k}$). It can be seen by induction that the number of bits needed to encode the sequence is: $$\sum_{i=1}^{n} mp_i \log\left(\frac{1}{p_i}\right)\tag{1}$$ Thus the average number of bits needed to encode a single draw from the distribution is: $$\sum_{i=1}^{n} p_i \log\left(\frac{1}{p_i}\right)\tag{2}$$ This is the entropy of the distribution, a measure of how much randomness it contains. $(3)$
How are the equations $(1)$ and $(2)$ derived (the excerpt mentions induction but does not provide further proof) and how does the transition from compressibility to entropy at $(3)$ follow? Note when encoding is mentioned the excerpt is refering to Huffman encoding.
