1
$\begingroup$

We know that the expectation operator is defined for a random variable $x$, as such:

$$ \mathbb{E} \left\{x\right\} = \int_{-\infty}^{\infty} x \: p_x(x) \; \mathrm{d}x $$

Where $p_x{x}$ is the PDF of the random variable $x$.

If there is an arbitrary(?) function $f$ acting on the random variable $x$, then the expected value of this function can also be written as:

$$ \mathbb{E}\left\{f(x) \right\} = \int_{-\infty}^{\infty} f(x) \: p_x(x) \: \mathrm{d}x $$

My questions are: On many algorithms that I study, (statistical in nature), one often finds themselves taking the expected value of some entity, that is a function of the random variable $x$. In the reverse case, one can also find themselves poking around and manipulating the probability distribution function of $x$, and then we can 'take it back' into an expression using the expectation operator.

Upon evaluating the expected value of $x$ however, ($\mathbb{E[x]})$, I often come across this estimation formula:

$$ \mathbb{E}\left\{x\right\} \approx \frac{1}{N}\sum_{n=1}^{N} x[n] $$

and similarly,

$$ \mathbb{E}\left\{f(x)\right\} \approx \frac{1}{N}\sum_{n=1}^{N} f(x[n]) $$ Where each $x[n]$ is an individual realization of the random variable $x$.

My question is, why is this formula true, and how did it come about? Every book I read seems to just include it as if it fell from the sky one day and no explanation is given as to why it is true.

Could someone please give an intuitive and mathematical explanation for why - and more importantly, how this happens to be true? What is the history/rationale behind it?

Many thanks.

  • 3
    Answer: [Strong law of large numbers](http://en.wikipedia.org/wiki/Law_of_large_numbers#Strong_law).2012-08-21
  • 1
    It is the Law of Large Numbers. There's a lot of litterature on this subject, see for example Shiryaev's _Probability_.2012-08-21
  • 0
    I think you misunderstood the question. The OP is puzzled by the discrete form of the expectation and wants to understand what it is saying. He is not asking about convergence of a sum of a sequence of random variables to a limit.2012-08-21
  • 1
    @MichaelChernick: One of his questions is _why_ the approximation can be used. And the answer to that is the Law of Large Numbers.2012-08-21
  • 0
    @StefanHansen Sorry to disagree but what he actually said was "why is this formula true" and not "why can the approximation be used". Those are very different questions.2012-08-21
  • 0
    Think of it as a way of making the best guess about the actual expected value E[X], by looking at different values x_i that you see for X. Intuitively, the average of the seen values is the best guess for the actual expected value.2012-08-21
  • 1
    @MichaelChernick: I don't follow. After "I often come across this estimation formula" he states an approximation and then asks why this can be used.2012-08-21
  • 0
    @StefanHansen I see you are right. I missed the approximately equal sign. In that case your answer is appropriate. I think the OP was hung up on both issues, (1) why the approximation works and (2) what it represents.2012-08-21
  • 0
    @StefanHansen and Micheal Chernick : Guys - really, no need to argue with each other. :-) Thanks to both of you for your insights, my understanding is coming along.2012-08-21

2 Answers 2

2

As you are aware, the expected value of a random variable $X$ is computed as:

$\mathbb{E}\{X\} = \int_{-\infty}^{\infty}z f_{X}(z)dz$, where $f_X$ is the probability density function (PDF) of $X$.

Note that the above integral may not converge (e.g. if $X$ is a Cauchy random variable), and the PDF may not exist as well. But let's assume we are not getting into such problems, which is indeed the case in most practical settings.

Often, we will not know the PDF, or even its functional form. In such a case, one resorts to estimating the PDF from realizations of the random variable. That is, assume we have $N$ realizations of $X$ - $\{x_i\}_{i=1}^N$. Let us consider the following estimate:

$\hat{f}_X(z) = \frac{1}{N}\sum_{i=1}^N \delta(z-x_i)$, where $\delta(.)$ is the Dirac delta function.

So essentially, we are treating the random variable as a uniform discrete random variable, and putting a mass of $1/N$ over each of the observed values. The estimate of the expectation of $X$ becomes:

$\hat{\mathbb{E}}_N\{X\} = \int_{-\infty}^{\infty}z \frac{1}{N}\sum_{i=1}^N \delta(z-x_i)dz$

$= \frac{1}{N}\sum_{i=1}^N \int_{-\infty}^{\infty}z \delta(z-x_i)dz$

By the sifting property of the delta function, you will note that the integral becomes $x_i$. Hence, we arrive at the following expression for the sample expectation:

$\hat{\mathbb{E}}_N\{X\} = \frac{1}{N}\sum_{i=1}^N x_i$

Note that $x_i$'s are themselves random, since another set of draws from $X$ is likely to give me a different set of realizations. Thus the above estimate of the expected value is itself a random variable - dependent on the draws $\{x_i\}_{1}^N$ and the number of draws $N$. Since all $x_i$s are distributed identically (with PDF $f_X$), and are independent draws, the Laws of Large Numbers tell us that $\mathbb{E}_N\{X\} \rightarrow \mathbb{E}\{X\}$ with probability $1$ (almost surely), and in probability, as $N \rightarrow \infty$.

  • 0
    Thank you very much, very good post. My understanding is slowly materializing on this issue... One thing though: You say that let us assume $\hat{f}_X(z) = \frac{1}{N}\sum_{i=1}^{N} \delta(z - x_i)$ - let me understand: We are doing this because given no prior information, we are literally weighing each observation we have, by $\frac{1}{N}$, to ensure that the PDF sums to 1 of course. So every observation that happens to exist in the observation set gets this weight, and observations that didnt happen to come about, well too bad for them. (contd)2012-08-21
  • 0
    (contd), After that, we insert this estimate into the actual formula of the expectation operator involving the weighted integral of the PDF, and now this time around, each observation gets weighted by the _value_ of that observation itself... (via sifting). So I get that. What I do not get is 'where' is the frequency/popularity of some observation taken into account here? For example, you observed 99 instances of an r.v equaling 10, but only one instance equaled 11. Where is this information being taken into account in the above?2012-08-21
  • 0
    Good question. If an observation (say $x_0$) occurs $K$ times, we will have the corresponding term repeated $K$ times in the sum of Dirac deltas. This automatically constitutes a frequency weighting.2012-08-21
  • 0
    Right. How would we mathematically add this information to the equation you cite, correctly? $\hat{f}_X(z) = \frac{1}{N}\sum_{i=1}^{N} K_{x_i} \; \delta(z-x_i)$. Would this be acceptable?2012-08-21
  • 0
    This equation is correct, except that the summation would be over *unique* values of $X$ observed in the $N$ samples.2012-08-21
  • 0
    Yes, is there a way to write it out mathematically in the correct form, (taking into account $K$ for unique values of $X$)?2012-08-21
  • 0
    @Mohammad: As Kartik said in his first comment above, you don't need to put K into the equation; the formula already takes care of it. For instance, if the sequence of values you observe is 10, 11, 10, 8, 10, 10, 10, 10, 11, then you'll do (10 + 11 + 10 + 8 + 10 + 10 + 10 + 10 + 11)/9. You don't have to explicitly put the 6 for the 10 and the 2 for the 11 into the formula anywhere. [Or, as Kartik said in his second comment above, you can adopt the convention that the sum is over distinct values of X, and write it as $(6\cdot 10 + 2\cdot 11 + 8)/9$.]2012-08-22
  • 0
    @ShreevatsaR I understand that, but what I am trying to ascertain is how exactly do we write - in mathematical terms - the statement "sum is over distinct values of X" in the $\sum$ notation?2012-08-22
2

This is just the expected value applied to a discrete uniform distribution that puts probability mass 1/N on each of the N distinct outcomes of the random variable. Alternatively it could be the sample estimate of these expectations where x[n] represents an individual observation.

  • 0
    *This is just the expected value applied to a discrete uniform distribution that puts probability mass 1/N on each of the N distinct possible values of the random variable*... No this is not.2012-08-22
  • 0
    @did Oh come on. I am referring to the formula as given assuming the x[n] are iid.2012-08-22
  • 0
    Oh come on, where is the *discrete uniform distribution* you are referring to?2012-08-22
  • 0
    The distribution that places mass 1/N at each x[n] for n=1,2,...,N.2012-08-22
  • 0
    Which are certainly not the *distinct possible values of the random variable*. This seems to confuse the underlying common theoretical distribution of the i.i.d. random variables x[k] and the empirical random distribution of the sample x[1], ..., x[n].2012-08-22
  • 0
    @MichaelChernick I think what you are saying is true, but only if all the samples are unique. Say, [2 4 1 8 3]. Then each bin gets weighted by 1/5. But if there was some repetition, say, [2 2 2 8 3], then the bin for the 2's will be weighted by 3/5, and bin for 8 by 1/5, and the bin for 3 by also 1/5.2012-08-22
  • 0
    @Mohammad. If the samples are repeated values then the values get unequal weights but the labelled observations x[n] each get equal weight.2012-08-22
  • 0
    @did I see. I meant to say distinct outcomes rather than possible values. I have noted the error and made the change to my answer.2012-08-22
  • 0
    Have an upvote. You fought for it. :-)2012-08-22
  • 0
    Upvoted. $ $ $ $2012-08-22
  • 0
    @did Thanks for the change in the vote and for pointing out the problem iwth the original answer.2012-08-22
  • 0
    *Vote*, not *change in the vote*.2012-08-22
  • 0
    @did okay. Someone downvoted me before. I had thought it was you.2012-08-22
  • 0
    "You assume too much." (Padme to Qui-Gon, Star Wars I).2012-08-22
  • 0
    @MichaelChernick I did not downvote you either.2012-08-22
  • 0
    @Mohammad Yes I never thought that you might have.2012-08-22