12
$\begingroup$

I am having trouble understanding the concept of a sufficient statistic. I have read What is a sufficient statistic? and Sufficient Statistic (Wikipedia)

Can someone please give an example of:

  1. a simple (but non-trivial) statistical model
  2. a sufficient statistic of that model
  3. an insufficient statistic of that model
  4. how you identified 2 & 3 as having and lacking, respectively, the sufficiency property

4 Answers 4

13

$\def\E{\mathrm{E}}$Consider samples $X = (X_1,X_2)$ from a normally distributed population $N(\mu,1)$ with unknown mean.

Then the statistic $T(X)=X_1$ is an unbiased estimator of the mean, since $\E(X_1)=\mu$. However, it is not a sufficient statistic - there is additional information in the sample that we could use to determine the mean.

How can we tell that $T$ is insufficient for $\mu$? By going to the definition. We know that $T$ is sufficient for a parameter iff, given the value of the statistic, the probability of a given value of $X$ is independent of the parameter, i.e. if

$$P(X=x|T=t,\mu)=P(X=x|T=t)$$

But we can compute this:

$$P(X=(x_1,x_2) | X_1=t,\mu) = \begin{cases} 0 & \mbox{if }t\neq x_1 \\ \tfrac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x_2-\mu)^2} & \mbox{if }t=x_1 \end{cases}$$

which is certainly not independent of $\mu$.

On the other hand, consider $T'(X) = X_1+X_2$. Then we have

$$P(X=(x_1,x_2) | X_1+X_2=t, \mu) = \frac{1}{2\pi}\int_{-\infty}^{\infty}e^{-\frac{1}{2}(s-\mu)^2 - \frac{1}{2}(t-s-\mu)^2}ds$$

and you can complete the square and show that this is independent of $\mu$, and hence $T'$ is a sufficient statistic for the mean $\mu$.

  • 0
    Part way through writing this, I wished I'd chosen a Bernoulli distribution with parameter $p$ instead of a normal distribution. It would have been easier to understand, less effort for me to type, and I would have had the patience to do the integral at the end myself. Let me know if you still don't understand, and I'll rewrite my answer.2012-05-17
  • 0
    I think it's clicked - the statistic is a function of the sample. If given the statistic result on the sample, the unknown model parameter becomes conditionally independent of the sample, the statistic is sufficient.2012-05-17
  • 0
    Almost - if, given the statistic, *the sample is independent of the parameter*, then the statistic is sufficient. Conditional independence and independence are different things.2012-05-17
  • 0
    How are they different in this context? I thought the conditioning is on the knowledge of the statistic (this is the third event).2012-05-17
  • 1
    The RHS of the last displayed equation should read $\exp(-(x_1-t/2)^2)/\sqrt{\pi}$ if $x_1+x_2=t$ and $0$ otherwise (thus the argument is correct although the formula in the RHS is not).2012-05-20
  • 1
    One comment and one question. First the comment, I think it would pay if you corrected the last equation according to @Did comment. The RHS is misleading, and squaring doesn't do away with the $\mu$ - here is the equivalent to your RHS: $e^{-\mu ^2-s^2+s t-\frac{t^2}{2}+\mu t}.$ The question is why are you integrating from $-\infty$ to $+\infty.$ You don't need to prove that it is a valid pdf, do you?2018-02-02
  • 0
    If you want to edit to correct the RHS I will be more than happy to accept it!2018-02-02
0

Just look at the form of an exponential family of distribution. The function of the data in the exponent is the sufficient statistic. The most obvious case is the sample mean for a normally distributed data with a known variance. The sample mean is a sufficient statistic for the population mean. The full information in the data is the n observations X1, X2,...,Xn but there is no additional information in that data that will help in the estimation of the population mean given the sample mean. when the variance is unknown the sample mean and the sample variance represent the sufficient statistic for the population mean and variance. Sufficiency is important because it plays a major role in the theory of parametric point estimation. An insufficient statistic would be any statistic different from the sufficient one. So in the normal distribution for example let Y1=X1=X2, Y2=X3-X4,... Ym=Xn-1 -Xn for m=n/2 (where say n is even). Then Y1, Y2,...,Ym is not sufficient for the mean and variance of the normal. So this answers 1-3.

  • 0
    What is item (3) in your answer?2012-05-17
  • 0
    Is that a typo, you mean "Y1=X1-X2" ? How did you determine that your example for 3 was not sufficient?2012-05-17
  • 0
    and also is the last element meant to be "Ym = Xm-1 - Xm" ?2012-05-17
  • 0
    I have not answered part 4. I think that takes a longer explanation and I haven't found the time to do that yet. The easy and unsatisfying answer though is that it does not contain the sufficient statistic X bar. I meant Ym=Xn-1-Xn I used the notationm just to avoid n/2 as a subscript.2012-05-17
0

In relation to the final equation on the example in the accepted answer (+1):

The independence from the population parameter $\theta$ of the conditional probability mass function of the random vector $\mathrm X = \left(\mathrm X_1,\mathrm X_2, \dots,\mathrm X_n \right),$ corresponding to $n$ iid samples, with respect to a statistic $T(\mathrm X)$ of this random vector can be understood through the partition of the sample space by the statistic. The intuition here would be of Venn diagrams separating uniquely those samples of size $n$ that add up to the same value, or the set of partitions of $n \bar{ \mathrm x}=\sum_{i=1}^n \mathrm x,$ which can be thought of as $[x_{n \bar{\mathrm x}}]\left(x^0+x^1+x^2+\cdots\right)^n,$ for instance in the case of the Poisson, which has support $\mathbb N\cup\{0\},$ the mean of samples of $n=10$ would partition the sample space (diagrammatically) as

enter image description here

This explains why, considering $\mathrm X$ as a subset of $T(\mathrm X),$

$$\Pr\left(\mathrm X=\mathrm x \cap T(\mathrm X)=T(\mathrm x)\right)=\Pr\left(\mathrm X=\mathrm x\right)$$

allowing the following "test" for a sufficient statistic:

$$\begin{align} \Pr\left(\mathrm X=\mathrm x \vert T(\mathrm X)=T(\mathrm x)\right)&=\frac{\Pr\left(\mathrm X=\mathrm x \cap T(\mathrm X)=T(\mathrm x)\right)}{\Pr\left(T(\mathrm X)=T(\mathrm x) \right)}\\[2ex] &=\frac{\Pr\left(\mathrm X=\mathrm x \right)}{\Pr\left(T(\mathrm X)=T(\mathrm x) \right)} \end{align} $$

i.e. if for all values of $\theta,$ the ratio of the probability of the sample over the probability of the statistic is constant, the test statistic is sufficient: $\Pr\left(\mathrm X=\mathrm x \vert T(\mathrm X)=T(\mathrm x)\right)$ does not depend on $\theta.$

Moving on to the example in the accepted answer (2 draws from a normal $N(\mu,\sigma)$ distribution, $\mathrm X =(\mathrm X_1, \mathrm X_2),$ which are meant to represent the entire sample, $(\mathrm X_1, \mathrm X_2, \cdots, \mathrm X_n)$ in the more general case, and transitioning from discrete probability distributions (as assumed up to this point) to continuous distributions (from PMF to PDF), the joint pdf of independent (iid) Gaussians with equal variance is:

$$\begin{align} f_\mathrm X\left(\mathrm X =\mathrm x\vert\mu\right)&=\prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left({\frac{-(x_i-\mu)^2}{2\sigma^2}}\right)\\[2ex] &=\frac{1}{(2\pi\sigma^2)^{(n/2)}}\exp\left({\frac{-\sum_{i=1}^n(x_i-\mu)^2}{2\sigma^2}}\right)\\[2ex] &=\frac{1}{(2\pi\sigma^2)^{n/2}}\exp\left({\frac{-\sum_{i=1}^n(x_i-\bar x + \bar x -\mu)^2}{2\sigma^2}}\right)\\[2ex] &=\frac{1}{(2\pi\sigma^2)^{n/2}}\exp\left({\frac{-\left(\sum_{i=1}^n(x_i-\bar x)^2 + n(\bar x -\mu)^2\right)}{2\sigma^2}}\right)\\[2ex] \end{align}$$

The ratio of pdf's (the denominator corresponding to the pdf of the sampling distribution of the sample mean for the normal, i.e. $N(\mu,\sigma^2/n),$ results in

$$\begin{align} \frac{f_\mathrm X(\mathrm X =\mathrm x\vert \mu)}{q_{T(\mathrm X)}(T\left(\mathrm X=T(\mathrm x)\right)\vert \mu)}&=\frac{\frac{1}{(2\pi\sigma^2)^{n/2}}\exp\left({\frac{-\left(\sum_{i=1}^n(x_i-\bar x)^2 + n(\bar x -\mu)^2\right)}{2\sigma^2}}\right)} {\frac{1}{\left(2\pi\frac{\sigma^2}{n}\right)^{1/2}}\exp\left({\frac{-n(\bar x-\mu)^2}{2\sigma^2}}\right)}\\[2ex] &\propto \exp{\left(\frac{-\left(\sum_{i=1}^n(x_i-\bar x)^2\right) }{2\sigma^2} \right)} \end{align}$$

eliminating the dependency on a specific $\mu.$

This is all beautifully explained in Statistical Inference of George Casella and Roger L. Berger.

Consequently, the sample mean is a sufficient statitic.

In contradistinction, the maximum value of the sample, which is a sufficient statistic of a uniform $[0,\theta]$ with unknown $\theta,$ would not be sufficient to estimate the mean of Gaussian samples. The histogram of the maximum value of samples of 10 from the uniform $[0,3]$ shows how the $\theta$ parameter is approximated, allowing the rest of the information from the sample to be discarded:

enter image description here

The maximum would simply be an extreme example of the single random variable in the sample vector posted as a counterexample to a sufficient statistic in the approved answer.

In this case, the pdf of the statistic becomes unwieldy, involving the error function:

$$\frac{1}{2}+\frac{1}{2}\text{erf}\left(\frac{x-\mu}{\sigma\sqrt 2}\right)$$

which (among other differences between the numerator and denominator of the pdf ratios) preclude getting rid of $\mu.$

Intuitively, knowing the maximum value of each sample does not summarize all the information regarding the population mean, $\mu,$ available in the sample. This is visually clear plotting the sampling distribution of the means of $10^6$ simulations of $n=10$ samples $N(0,1)$ (on the left) versus the sampling distribution of the maximum values (on the right):

enter image description here

The latter disposes of information available within the complete sample necessary to estimate the population mean - it is also biased.

-1

Here is my attempt to answer part 4 of the question. If you have a parameteric family of distributions with a parameter theta (theta can be K-dimensional for K>=1) then a statistic is sufficient if all the information about theta is contained in it. That means that given the parametric family the conditional distribution of the data given the sufficient statistic is independent of the parameter theta. For many problem there are several functions T of the data that are sufficient. Sufficiency usually carries data reduction with it. For example say the parametric family is N(m,1) (m is the mean and 1 is the variance). To estimate m you take a sample of size n=10. The data involves ten values but the sufficient statistic (the sample mean) is just a single number. That is the least amount of information possible for a sufficient statistic. This type of sufficient statistc is better than others in data reduction and is called minimal sufficient. To determine that a candidate statistic is sufficient you can use the factorization theorem to test it. If the candidate T is sufficent it will factor the density as follows f(x|theta)= g(T(x)|theta) h(x) where f is the density for the data given theta, g is the density for T(x) given theta and the term h(x) does not depend on theta. So you can check sufficiency by calculating f and g, divide f by g and T will be sufficient iff the resulting function h doesn't depend on (involve) theta. Besides sufficiency and minimal sufficiency there is a concept called complete sufficiency. All the cocepts are important in the parametric theory of point estimation. This is nicely covered in Casella and Berger's book Statistical Inference 2e Chapter 6.