Entropy $H=-\sum_i p_i\log p_i$ in the discrete case and $H=-\int f(x)\log f(x)\,dx$ (where $f$ is the probability density function) in the continuous. The choice of base of logarithm is free, but affects $H$ as a scale factor. It represents the amount of uncertainty in a stochastic system or model, or the combinatorial complexity of a finite sample space, i.e. how hard it should be a priori to compress the contained information, without any prior shared reference point.
Claude Shannon's motivation was to formulate and quantify the amount of information contained (stored, transmitted, increased or lost), in a statistical sense, within various forms of communication. He is widely credited as the father of information theory, which also has deep connections to thermodynamis, data compression, where asymptotic equipartition property the plays a fundamental role, and even bayesian statistics, where the principle of maximum entropy becomes an axiom. The concept of entropy is in fact so fundamental that one can also talk of conditional entropy, joint entropy, just as in statistics, or relative entropy.
Chris Hillman states in his An Entropy Primer that Shannon's Noiseless Coding Theorem and a general version of the Equipartition Theorem, together, give Shannon's way of understanding what his entropy "means". But in What is Information, he warns:
There are also some non-relations which are insufficiently appreciated. For instance, the "entropy" of a density is analogous not to probabilistic entropy but rather to a related quantity called divergence.
Or as Mohri, et. al. 2008 say, citing the same work:
The relative entropy, or Kullback-Leibler divergence, is one of the most commonly used measures of the discrepancy of two distributions p and q [Cover and Thomas, 1991].
And from wikipedia:
Another idea, championed by Edwin T. Jaynes, is to use the principle of maximum entropy (MAXENT). The motivation is that the Shannon entropy of a probability distribution measures the amount of information contained in the distribution. The larger the entropy, the less information is provided by the distribution. Thus, by maximizing the entropy over a suitable set of probability distributions on X, one finds the distribution that is least informative in the sense that it contains the least amount of information consistent with the constraints that define the set. For example, the maximum entropy prior on a discrete space, given only that the probability is normalized to 1, is the prior that assigns equal probability to each state. And in the continuous case, the maximum entropy prior given that the density is normalized with mean zero and variance unity is the standard normal distribution. The principle of minimum cross-entropy generalizes MAXENT to the case of "updating" an arbitrary prior distribution with suitability constraints in the maximum-entropy sense.