12
$\begingroup$

I am having trouble understanding the concept of a sufficient statistic. I have read What is a sufficient statistic? and Sufficient Statistic (Wikipedia)

Can someone please give an example of:

  1. a simple (but non-trivial) statistical model
  2. a sufficient statistic of that model
  3. an insufficient statistic of that model
  4. how you identified 2 & 3 as having and lacking, respectively, the sufficiency property

4 Answers 4

13

$\def\E{\mathrm{E}}$Consider samples $X = (X_1,X_2)$ from a normally distributed population $N(\mu,1)$ with unknown mean.

Then the statistic $T(X)=X_1$ is an unbiased estimator of the mean, since $\E(X_1)=\mu$. However, it is not a sufficient statistic - there is additional information in the sample that we could use to determine the mean.

How can we tell that $T$ is insufficient for $\mu$? By going to the definition. We know that $T$ is sufficient for a parameter iff, given the value of the statistic, the probability of a given value of $X$ is independent of the parameter, i.e. if

$P(X=x|T=t,\mu)=P(X=x|T=t)$

But we can compute this:

$P(X=(x_1,x_2) | X_1=t,\mu) = \begin{cases} 0 & \mbox{if }t\neq x_1 \\ \tfrac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}(x_2-\mu)^2} & \mbox{if }t=x_1 \end{cases}$

which is certainly not independent of $\mu$.

On the other hand, consider $T'(X) = X_1+X_2$. Then we have

$P(X=(x_1,x_2) | X_1+X_2=t, \mu) = \frac{1}{2\pi}\int_{-\infty}^{\infty}e^{-\frac{1}{2}(s-\mu)^2 - \frac{1}{2}(t-s-\mu)^2}ds$

and you can complete the square and show that this is independent of $\mu$, and hence $T'$ is a sufficient statistic for the mean $\mu$.

  • 0
    If you want to edit to correct the RHS I will be more than happy to accept it!2018-02-02
0

Just look at the form of an exponential family of distribution. The function of the data in the exponent is the sufficient statistic. The most obvious case is the sample mean for a normally distributed data with a known variance. The sample mean is a sufficient statistic for the population mean. The full information in the data is the n observations X1, X2,...,Xn but there is no additional information in that data that will help in the estimation of the population mean given the sample mean. when the variance is unknown the sample mean and the sample variance represent the sufficient statistic for the population mean and variance. Sufficiency is important because it plays a major role in the theory of parametric point estimation. An insufficient statistic would be any statistic different from the sufficient one. So in the normal distribution for example let Y1=X1=X2, Y2=X3-X4,... Ym=Xn-1 -Xn for m=n/2 (where say n is even). Then Y1, Y2,...,Ym is not sufficient for the mean and variance of the normal. So this answers 1-3.

  • 0
    I have not answered part 4. I think that takes a longer explanation and I haven't found the time to do that yet. The easy and unsatisfying answer though is that it does not contain the sufficient statistic X bar. I meant Ym=Xn-1-Xn I used the notationm just to avoid n/2 as a subscript.2012-05-17
0

In relation to the final equation on the example in the accepted answer (+1):

The independence from the population parameter $\theta$ of the conditional probability mass function of the random vector $\mathrm X = \left(\mathrm X_1,\mathrm X_2, \dots,\mathrm X_n \right),$ corresponding to $n$ iid samples, with respect to a statistic $T(\mathrm X)$ of this random vector can be understood through the partition of the sample space by the statistic. The intuition here would be of Venn diagrams separating uniquely those samples of size $n$ that add up to the same value, or the set of partitions of $n \bar{ \mathrm x}=\sum_{i=1}^n \mathrm x,$ which can be thought of as $[x_{n \bar{\mathrm x}}]\left(x^0+x^1+x^2+\cdots\right)^n,$ for instance in the case of the Poisson, which has support $\mathbb N\cup\{0\},$ the mean of samples of $n=10$ would partition the sample space (diagrammatically) as

enter image description here

This explains why, considering $\mathrm X$ as a subset of $T(\mathrm X),$

$\Pr\left(\mathrm X=\mathrm x \cap T(\mathrm X)=T(\mathrm x)\right)=\Pr\left(\mathrm X=\mathrm x\right)$

allowing the following "test" for a sufficient statistic:

$\begin{align} \Pr\left(\mathrm X=\mathrm x \vert T(\mathrm X)=T(\mathrm x)\right)&=\frac{\Pr\left(\mathrm X=\mathrm x \cap T(\mathrm X)=T(\mathrm x)\right)}{\Pr\left(T(\mathrm X)=T(\mathrm x) \right)}\\[2ex] &=\frac{\Pr\left(\mathrm X=\mathrm x \right)}{\Pr\left(T(\mathrm X)=T(\mathrm x) \right)} \end{align} $

i.e. if for all values of $\theta,$ the ratio of the probability of the sample over the probability of the statistic is constant, the test statistic is sufficient: $\Pr\left(\mathrm X=\mathrm x \vert T(\mathrm X)=T(\mathrm x)\right)$ does not depend on $\theta.$

Moving on to the example in the accepted answer (2 draws from a normal $N(\mu,\sigma)$ distribution, $\mathrm X =(\mathrm X_1, \mathrm X_2),$ which are meant to represent the entire sample, $(\mathrm X_1, \mathrm X_2, \cdots, \mathrm X_n)$ in the more general case, and transitioning from discrete probability distributions (as assumed up to this point) to continuous distributions (from PMF to PDF), the joint pdf of independent (iid) Gaussians with equal variance is:

$\begin{align} f_\mathrm X\left(\mathrm X =\mathrm x\vert\mu\right)&=\prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left({\frac{-(x_i-\mu)^2}{2\sigma^2}}\right)\\[2ex] &=\frac{1}{(2\pi\sigma^2)^{(n/2)}}\exp\left({\frac{-\sum_{i=1}^n(x_i-\mu)^2}{2\sigma^2}}\right)\\[2ex] &=\frac{1}{(2\pi\sigma^2)^{n/2}}\exp\left({\frac{-\sum_{i=1}^n(x_i-\bar x + \bar x -\mu)^2}{2\sigma^2}}\right)\\[2ex] &=\frac{1}{(2\pi\sigma^2)^{n/2}}\exp\left({\frac{-\left(\sum_{i=1}^n(x_i-\bar x)^2 + n(\bar x -\mu)^2\right)}{2\sigma^2}}\right)\\[2ex] \end{align}$

The ratio of pdf's (the denominator corresponding to the pdf of the sampling distribution of the sample mean for the normal, i.e. $N(\mu,\sigma^2/n),$ results in

$\begin{align} \frac{f_\mathrm X(\mathrm X =\mathrm x\vert \mu)}{q_{T(\mathrm X)}(T\left(\mathrm X=T(\mathrm x)\right)\vert \mu)}&=\frac{\frac{1}{(2\pi\sigma^2)^{n/2}}\exp\left({\frac{-\left(\sum_{i=1}^n(x_i-\bar x)^2 + n(\bar x -\mu)^2\right)}{2\sigma^2}}\right)} {\frac{1}{\left(2\pi\frac{\sigma^2}{n}\right)^{1/2}}\exp\left({\frac{-n(\bar x-\mu)^2}{2\sigma^2}}\right)}\\[2ex] &\propto \exp{\left(\frac{-\left(\sum_{i=1}^n(x_i-\bar x)^2\right) }{2\sigma^2} \right)} \end{align}$

eliminating the dependency on a specific $\mu.$

This is all beautifully explained in Statistical Inference of George Casella and Roger L. Berger.

Consequently, the sample mean is a sufficient statitic.

In contradistinction, the maximum value of the sample, which is a sufficient statistic of a uniform $[0,\theta]$ with unknown $\theta,$ would not be sufficient to estimate the mean of Gaussian samples. The histogram of the maximum value of samples of 10 from the uniform $[0,3]$ shows how the $\theta$ parameter is approximated, allowing the rest of the information from the sample to be discarded:

enter image description here

The maximum would simply be an extreme example of the single random variable in the sample vector posted as a counterexample to a sufficient statistic in the approved answer.

In this case, the pdf of the statistic becomes unwieldy, involving the error function:

$\frac{1}{2}+\frac{1}{2}\text{erf}\left(\frac{x-\mu}{\sigma\sqrt 2}\right)$

which (among other differences between the numerator and denominator of the pdf ratios) preclude getting rid of $\mu.$

Intuitively, knowing the maximum value of each sample does not summarize all the information regarding the population mean, $\mu,$ available in the sample. This is visually clear plotting the sampling distribution of the means of $10^6$ simulations of $n=10$ samples $N(0,1)$ (on the left) versus the sampling distribution of the maximum values (on the right):

enter image description here

The latter disposes of information available within the complete sample necessary to estimate the population mean - it is also biased.

-1

Here is my attempt to answer part 4 of the question. If you have a parameteric family of distributions with a parameter theta (theta can be K-dimensional for K>=1) then a statistic is sufficient if all the information about theta is contained in it. That means that given the parametric family the conditional distribution of the data given the sufficient statistic is independent of the parameter theta. For many problem there are several functions T of the data that are sufficient. Sufficiency usually carries data reduction with it. For example say the parametric family is N(m,1) (m is the mean and 1 is the variance). To estimate m you take a sample of size n=10. The data involves ten values but the sufficient statistic (the sample mean) is just a single number. That is the least amount of information possible for a sufficient statistic. This type of sufficient statistc is better than others in data reduction and is called minimal sufficient. To determine that a candidate statistic is sufficient you can use the factorization theorem to test it. If the candidate T is sufficent it will factor the density as follows f(x|theta)= g(T(x)|theta) h(x) where f is the density for the data given theta, g is the density for T(x) given theta and the term h(x) does not depend on theta. So you can check sufficiency by calculating f and g, divide f by g and T will be sufficient iff the resulting function h doesn't depend on (involve) theta. Besides sufficiency and minimal sufficiency there is a concept called complete sufficiency. All the cocepts are important in the parametric theory of point estimation. This is nicely covered in Casella and Berger's book Statistical Inference 2e Chapter 6.