Suppose you are to build a communication system. Every second, a randomly chosen message is delivered to you at point A and you are to make sure that with high probability it will be delivered at point B within a finite time. You know in advance which messages are possible and what their distribution are. (Successive messages are supposed to be independent).
However, bandwidth (the number of bits you can send per second) is expensive, and you want to be able to lease a channel with then minimal capacity you need to be able to deliver messages without building an ever-increasing backlog (again, with high probability).
If there are $n$ possible messages, you could meet your goal by buying enough bandwidth to send $\log_2(n)$ bits per second. But that could be wasteful -- say that 99% of the time the message is "no comments". Then you can encode that message as a single 0 bit, and send everything else as a 1-bit followed by a message number. That way you only have to buy bandwidth for about $1+\log_2(n-1)/100$ bits per second. This leaves you enough room to send a 0 each time nothing interesting happens. Once in a wile, when something interesting does happen, you send a 1 plus additional bits, which will take about $\log_2(n)$ extra seconds and build up a small backlog of messages that are all likely to be single 0's. But since you can send those 0's slightly faster than one per second, on average you can expect to have your backlog cleared by the time the next interesting thing happens.
(There are safety-margin refinements from queueing theory here that I won't go into).
The moral of this example is that if you designing a coding system and are interested in minimizing the expected number of bits necessary to send a message, you can "afford" to spend more bits on a rare message because you don't have to do it so often.
And it turns out that in the limit as $N\to\infty$, the lowest possible expected number of bits to send $N$ independent identically distributed messages (where the minimum is taken over all possible coding strategies), is exactly $N$ times the Shannon information content of the probability distribution.