Shannon formally defined the amount of information in a message as a function of the probability of the occurrence of each possible message[1]. Given a universe of messages $\mathbf{M} = \{ m_1,m_2,..,m_n \}$ and a probability $\mathbf{p(m_i)}$ for the occurrence of each message, the information content of a message in $\mathbf{M}$ is given by:
$\sum\limits_{i=1}^{n} -p(m_{i})\log_{2}(p(m_{i})) $
How is this formula derived? Why did Shannon add $\log_2$?
[1]C.E. Shannon. A mathematical theory of communication. Bell Systems Technical Journal, 27:379–423, 623–656, 1948.