29
$\begingroup$

Which of the following is more surprising?

  1. In a group of 100 people, the tallest person is one inch taller than the second tallest person.
  2. In a group of one billion people, the tallest person is one inch taller than the second tallest person.

Put more precisely, suppose we have a normal distribution with given mean $\mu$ and standard deviation $\sigma$. If we sample from this distribution $N$ times, what is the expected difference between the largest and second largest values in our sample? In particular, does this expected difference go to zero as $N$ grows?

In another question, it is explained how to compute the distribution $MAX_N$ of the maximum, but I don't see how to extract an estimate for the expected value of the maximum from that answer. Though $E(MAX_N)-E(MAX_{N-1})$ isn't the number I'm looking for, it might be a good enough estimate to determine if the value goes to zero as $N$ gets large.

  • 1
    I was thinking continuity is only important because samples can be close. Think of taking one billion samples from a distribution of the even naturals from 0 to 10^10^10. You wouldn't be surprised that the largest was more than 1 because you would have to get two from the same bin. But over that range, even with a continuous distribution you would expect the gap from the top to the second to be >1.2011-03-03

6 Answers 6

8

The precise version of the question was answered in the affirmative in the paper "Extremes, Extreme Spacings, and Tail Lengths: An Investigation for Some Important Distributions," by Mudholkar, Chaubey, and Tian (Calcutta Statistical Association Bulletin 61, 2009, pp. 243-265). (Unfortunately, I haven't been able to find an online copy.)

Let $X_{i:n}$ denote the $i$th order statistic from a random sample of size $n$. Let $S_{n:n} = X_{n:n} - X_{n-1:n}$, the rightmost extreme spacing. The OP asks for $E[S_{n:n}]$ when sampling from a normal distribution.

The authors prove that, for an $N(0,1)$ distribution, $\sqrt{2 \log n}$ $S_{n:n}$ converges in distribution to $\log Z - \log Y$, where $f_{Z,Y}(z,y) = e^{-z}$ if $0 \leq y \leq z$ and $0$ otherwise.

Thus $S_{n:n} = O_p(1/\sqrt{\log n})$ and therefore converges in probability to $0$ as $n \to \infty$. So $\lim_{n \to \infty} E[S_{n:n}] = 0$ as well. Moreover, since $E[\log Z - \log Y] = 1$, $E[S_{n:n}] \sim \frac{1}{\sqrt{2 \log n}}$. (For another argument in favor of this last statement, see my previous answer to this question.)

In other words, (2) is more surprising.

Added: This, does, however, depend on the fact that the sampling is from the normal distribution. The authors classify the distribution of extreme spacings as ES short, if $S_{n:n}$ converges in probability to $0$ as $n \to \infty$; ES medium, if $S_{n:n}$ is bounded but non-zero in probability; and ES long, if $S_{n:n}$ diverges in probability. While the $N(0,1)$ distribution has ES short right tails, the authors show that the gamma family has ES medium right tails (see Shai Covo's answer for the special case of the exponential) and the Pareto family has ES long right tails.

  • 0
    I had to do the transformation by hand to verify it, but you are correct. Nice observation! For the record, the Mudholkar, Chaubey, and Tian paper does not mention that $\log Z - \log Y$ has that simpler form.2014-01-13
7

Revised answer.

A very accurate approximation for the case of the normal distribution can be found in this paper. Let $X_{1:n} \leq X_{2:n} \leq \cdots \leq X_{n:n}$ be the ordered statistics obtained from a random sample $X_1,X_2,\ldots,X_n$, where $X_i \sim {\rm Normal}(\mu,\sigma^2)$. According to Eq. (2), for $i \geq n/2$ and as $n \to \infty$, $ X_{i:n} \approx \mu + \sigma \bigg[\sqrt {2\ln n} - \frac{{\ln (\ln n) + \ln (4\pi ) - 2W_{i:n} }}{{2\sqrt {2\ln n} }}\bigg], $ where $W_{i:n}$ has the density $ g_{i:n} (w) = \frac{1}{{(n - i)!}}\exp ( - (n - i + 1)w - \exp ( - w)), \;\; - \infty < w < \infty . $ Thus, for example, $ g_{n:n} (w) = \exp ( - w - \exp ( - w)), \;\; - \infty < w < \infty $ and $ g_{n-1:n} (w) = \exp ( - 2w - \exp ( - w)), \;\; - \infty < w < \infty . $ According to Eqs. (3) and (4) of that paper, $ {\rm E}[X_{n:n} ] \approx \mu + \sigma \bigg[\sqrt {2\ln n} - \frac{{\ln (\ln n) + \ln (4\pi ) - 2 \cdot 0.5772}}{{2\sqrt {2\ln n} }}\bigg] $ and $ {\rm Var}[X_{n:n} ] \approx \frac{{\sigma ^2 \cdot 1.64493}}{{2\ln n}}. $

Some general facts, which are somewhat useful in our context. If $X_{1:n} \leq X_{2:n} \leq \cdots \leq X_{n:n}$ are the ordered statistics obtained from a random sample $X_1,X_2,\ldots,X_n$, where the $X_i$ have cdf $F$ and pdf $f$, then $ {\rm E}[X_{i:n}] = \frac{{n!}}{{(i - 1)!(n - i)!}}\int_{ - \infty }^\infty {x [ F(x)] ^{i - 1} [ 1 - F(x)] ^{n - i} f(x)\,dx}. $ By an exercise in a book on order statistics, $ {\rm E}[X_{r + 1:n} - X_{r:n} ] = {n \choose r}\int_{ - \infty }^\infty {[F(x)]^r [1 - F(x)]^{n - r}\, dx} ,\;\; r = 1, \ldots ,n - 1. $ Letting $r=n-1$ thus gives $ {\rm E}[X_{n:n} - X_{n-1:n} ] = n \int_{ - \infty }^\infty {[F(x)]^{n-1} [1 - F(x)]\, dx}. $ Applying this formula to the case of exponential with mean $\theta$ gives a constant difference: $ {\rm E}[X_{n:n} - X_{n-1:n} ] = n\int_0^\infty {(1 - e^{ - x/\theta } )^{n - 1} e^{ - x/\theta } \, dx} = \theta. $ Nevertheless, the corresponding pdf, $\theta ^{ - 1} e^{ - x/\theta } \mathbf{1}(x \geq 0)$, goes to zero much faster than, say, $1/x^2$ as $x \to \infty$. In fact, $X_{n:n} - X_{n-1:n}$ is exponentially distributed with mean $\theta$ (see also leonbloy's answer). Indeed, substituting the exponential cdf $F(x)=(1-e^{-x/\theta})\mathbf{1}(x \geq 0)$ and pdf $f(x)=\theta^{-1} e^{-x/\theta}\mathbf{1}(x \geq 0)$ into the general formula $ f_{X_{n:n} - X_{n-1:n} } (w) = \frac{{n!}}{{(n - 2)!}}\int_{ - \infty }^\infty {[F(x)]^{n - 2} f(x)f(x + w)\,dx},\;\; 0 < w < \infty $ for the density of $X_{n:n}-X_{n-1:n}$ (which is a special case of the formula for $X_{j:n}-X_{i:n}$, $1 \leq i < j \leq n$), gives $ f_{X_{n:n} - X_{n-1:n} } (w) = \theta^{-1}e^{-w/\theta}, \;\; 0 < w < \infty, $ that is, $X_{n:n} - X_{n-1:n}$ is exponential with mean $\theta$.

  • 0
    @ShaiCovo and others: Could the asymptotic expression for $X_{n:n}$ be used to compute an asymptotic expression for the moment generating function $\mathbb{E}[e^{tX_{n:n}}]$ of $X_{n:n}$ or are there issues? When $\mu=0$ and $\sigma=1$, Mathematica evaluates the resulting integral to $\exp\left[t\left(\sqrt{\log n}-\frac{\log\log n+\log 4\pi}{2\sqrt{2\log n}}\right)\right]\Gamma[1-t]$ valid for t<1...2013-12-02
6

A quick heuristic attempt at this: first, standard results on order statistics tell us that if we take $n$ samples from any distribution, with CDF $F$, the $k$th shortest person out of $n$ will typically have height around $F^{-1}(k/(n+1))$.

So fix $\mu = 0$ and $\sigma = 1$. Then we expect the height of the tallest out of $n-1$ people to be around $\Phi^{-1}(1-1/n)$, and the height of the second tallest to be around $\Phi^{-1}(1-2/n)$, where $\Phi$ is the standard normal CDF. The question, then, is what happens to $\Phi^{-1}(1-1/n)-\Phi^{-1}(1-2/n)$ as $n$ gets large.

Now, it's a standard estimate that for large $z$, $1-\Phi(z) \approx \phi(z)/z$, where $\phi(z) = e^{-z^2/2}/\sqrt{2\pi}$ is the standard normal PDF. So let $\epsilon = 1-\Phi(z)$; then we get $\epsilon \approx \phi(z)/z$. Inverting gives the approximation

$\Phi^{-1}(1-\epsilon) \approx W\left( {1 \over 2\epsilon^2 \pi} \right)^{1/2}$,

where $W$ is the Lambert $W$ function, the inverse of $x \rightarrow xe^x$. In particular, if $\epsilon = 1/n$, then we have

$\Phi^{-1}(1-1/n) \approx W \left( {n^2 \over 2 \pi} \right)^{1/2}$.

So finally the question becomes, what happens to

$ W\left( {4n^2 \over 2 \pi} \right)^{1/2} - W\left( {n^2 \over 2\pi} \right)^{1/2} $

as $n$ gets large? It appears that this goes to zero as $n$ gets large; that is, smaller gaps are expected between the smallest and second smallest entries in larger samples from the normal distribution. So (2) is more surprising.

That being said, I've thrown out a lot here, but I'm guessing that this captures the correct asymptotics.

  • 0
    @Anton Geraschenko: Neither it's true that a decay faster than$1/x^2$implies that the difference will go to zero. See the exponential case, in my answer or in Shai Covo's2011-03-03
6

Let $\mu_{i:n}$ denote the mean of the $i$th largest observation from a sample of size $n$ from a distribution. The question is about the behavior of $\mu_{n:n} - \mu_{n-1:n}$ for a normal distribution. David and Nagaraja's Order Statistics says that

  1. For an $N(0,1)$ distribution, $\mu_{n:n}$ has the asymptotically dominant term $\sqrt{2\log n}$ (pp. 302-303, see also Shai Covo's answer).
  2. For any distribution, $\mu_{n-1:n} + (n-1)\mu_{n:n} = n \mu_{n-1,n-1}$ (p. 44).

From (2), we have $\mu_{n:n} - \mu_{n-1:n} = n (\mu_{n:n} - \mu_{n-1,n-1})$.

Adding the asymptotic from (1), we see that $\mu_{n:n} - \mu_{n-1:n}$ has the asymptotically dominant term $n \left(\sqrt{2\log n} - \sqrt{2\log (n-1)} \right)$, which is dominated by $\frac{n}{n \sqrt{2 \log n}} = \frac{1}{\sqrt{2 \log n}},$ which tells us both that $\mu_{n:n} - \mu_{n-1:n} \to 0$ and the rate at which it does so.

2

I'd take this approach:

  • Call $ p_d= P(x_1 > x_2 + d)$

  • The probability that sample $x_1$ is the largest value AND it exceeds the second largest in more than $d$ is just $p_d^{N-1}$

  • The probability that the largest value exceeds the second largest in more that $d$ is then $N p_d^{N-1}$ WRONG-fixed below

To compute $p_d$: that is the probability that the difference of two iid normals exceeds $d$. But the difference of two normals is a normal of media cero, and variance $2 \sigma^2$, so that probability is given by the integral of the normal cumulative distribution function. What matters to us is that it's a constant that don't depend on $N$.

So the probability in question (fixed $d$, $N \to \infty$) tends to zero.


UPDATE: As correctly pointed out in the comments, the second step is wrong, the events are not independent. They are independent, though, if $x_1$ is fixed - that is, they are conditionally independent. So:

$P(x_1 > x_2 + d | x_1) = F_x(x_1 - d) $

$P(x_1 > x_i + d | x_1) = F_x^{N-1}(x_1 - d)$

And $P(x_1 > x_i + d) = \int_{-\infty}^{\infty}F_x^{N-1}(x - d) f_x(x) dx$

(this tends to $1/N$ as $d \to 0^+$, as was to be expected)

So, restricting to $d \ge 0$, the probability that the largest value exceeds the second largest in more that $d$ is (if I have not messed up anything else) :

$ P(x_A - x_B > d)= N \int_{-\infty}^{\infty} F_x^{N-1}(x - d) f_x(x) dx $

where $x_A$ is the largest value and $x_B$ the second one.

The question is about the above formula going to zero or not for fixed $d$, and growing N. That seems to depends on the density.

As a check: if $x$ is exponential with parameter $\lambda$, the integrals can be evaluated, and I get:

$ P(x_A - x_B > d) = exp( - \lambda d) $

This (besides tending to 1 for $d \to 0^+$, as it should) does not depend on $N$. Hence, for an exponential distribution the two events of the original question are equally surprising.

  • 0
    You are totally right. I tried to fix it.2011-03-03
0

I'm no statistician, but it would clearly depend on what type of distribution you're talking about. The highest-rated answer is talking about PDFs and CDFs, and then he goes on to pick a CDF and do calculations off of that. The problem with that is that not everything is a bell curve or other very common distribution.

Given a completely flat distribution, such as a computer's random number generator, outliers will evaporate very quickly.

Given a bell curve, such as your height example, outliers will be semi-persistent.

For a distribution function with infinite variance (i.e. +/- infinity,) then there is no reason for outliers to have close neighbors.