Consider a set $\Omega$ with $N$ distinct members, and a function $f$ defined on $\Omega$ that takes the values 0,1 such that $ \frac{1}{N} \sum_{x \in \Omega } f(x)=p$. For a subset $S⊆Ω$ of size n, define the sample proportion $p:= p(S)= \frac{1}{n} \sum_{x\in S} f(x)$. If each subset of size $n$ is chosen with equal probability, calculate the expectation and standard deviation of the random variable $p$.
subsets probability question
-
0For each subset containing $x$, there is a corresponding subset without $x$, and vice versa. Therefore each $x$ appears in exactly half of all subsets. Hence... – 2012-06-12
-
0This is a question from some old interview paper I am attempting. A more detailed solution would be helpfull. – 2012-06-12
-
0It's fairly easy to prove that the expectation is p, but for the standard deviation I have no idea. I think there is an easy way to compute this but I can't find it – 2012-06-12
2 Answers
It helps to introduce indicator random variables here. For each $x\in\Omega$, let $Z_x$ be the indicator random variable that takes the value 1 if $x\in S$, and value 0 otherwise.
We can express $$p(S)={1\over n}\sum_{x\in\Omega} Z_x\cdot f(x),$$ where the sum is no longer over the random set $S$. Since all points are equally likely to be elements of $S$, it is not hard to calculate $$\mathbb{E}(Z_x)={n\over N},\quad \text{Var}(Z_x)={n\over N}\left({1-{n\over N}}\right), \quad \text{cov}(Z_x,Z_y)={-n\over N^2} {N-n \over N-1}\text{ for }x\neq y.$$
Using linearity of expectation, and bilinearity of covariance, after some calculation we get $$\mathbb{E}(p(S))={1\over N}\sum_{x\in\Omega} f(x),$$ and $$\text{Var}(p(S))={1\over n} {N-n \over N-1} \left[{1\over N}\sum_{x\in\Omega} f(x)^2- \left( {1\over N}\sum_{x\in\Omega} f(x)\right)^2\right].$$
-
0It would appear that this answer is consistent with user58519 below, if your expression in front the square brackets is just $1/n$. However, I can find no error in your calculation but somehow feel the below suggestion is $\textit{right}$. – 2013-04-16
-
0@Delvesy The formula in the other answer is based on the false assumption that the summands in random variable $p(S)$ are independent. As a "sanity check", what *should* the variance in $p(S)$ be when $n=N$? – 2013-04-16
-
0Ah of course.. and yes, it did cross my mind once I posted that we must have zero variance for our 'predictor' if we select the entire population for our sample. Thanks for the clarification. – 2013-04-16
-
0@ByronSchmuland: I am not sure but I think your computation of the variance is wrong. Here are my computations (note that I use the notation $1_x$ instead of $Z_x$): http://i.stack.imgur.com/om5Eg.png – 2014-03-09
-
0@Tom My formula is the same as yours. – 2014-03-09
-
0@ByronSchmuland Yes, indeed, I am very sorry. I just didn't see how to tidy up the terms so nicely (at first). – 2014-03-09
-
1@Tom No problem! It is not that obvious. – 2014-03-09
I think the answer is:
a) $E[\bar{p}] = p$,
b) Var$[\bar{p}] = \frac{\sqrt{p(1-p)}}{\sqrt{n}}$.
I believe the answer can be found on page 10 of http://math.arizona.edu/~faris/stat.pdf
~JD