5
$\begingroup$

Let's suppose there are $n$ real numbers $a_0 < ... < a_n$ uniformly selected from interval [0, 1). If one knows $k$ numbers on consecutive positions $a_i < ... < a_{i+k-1}$ how good is $(k - 1) / (a_{i+k-1} - a_i)$ an estimator for $n$? What other estimators are possible/better?

NOTE: $n >> k$.

  • 0
    n &$k$are not correlated.2011-01-17

3 Answers 3

4

As in the book A First Course in Order Statistics (see, in particular, Section 2.5), let us denote by $W_{i,j:n}$ the spacing $W_{i,j:n} = U_{j:n} - U_{i:n}$, $1 \leq i < j \leq n$, where $U_{1:n} < \cdots < U_{n:n}$ are $n$ order statistics from a ${\rm uniform}(0,1)$ distribution. By equation (2.5.21) of that book, the density function of $W_{i,j:n}$ is given by $ f_{W_{i,j:n} } (w) = \frac{{n!}}{{(j - i - 1)!(n - j + i)!}}w^{j - i - 1} (1 - w)^{n - j + i} ,\;\; 0 < w < 1, $ so that $W_{i,j:n}$ has a ${\rm Beta}(j - i,n - j + i + 1)$ distribution (which depends only on $j-i$ and not on $i$ and $j$ individually). Having the distribution of $W_{i,j:n}$, you can check how good is $\frac{{j - i}}{{U_{j:n} - U_{i:n} }}$ an estimator for $n$, which is what you asked (in different notation).

EDIT: Using the density function of $W_{i,j:n}$, one can easily calculate the bias (cf. Trevor's answer). My calculation shows that $ {\rm E}\bigg[\frac{{j - i}}{{U_{j:n} - U_{i:n} }}\bigg] = \frac{{j - i}}{{j - i - 1}}n, $ hence the estimator gets better and better as $j-i$ grows. (The expectation is infinite when $j-i=1$; cf. Dinesh's answer.)

3

Are you constrained to use the estimator that you have indicated? If not, another choice for an estimator would be to form the joint pdf of $f(a_i,\dots,a_{i+k-1} \mid n)$ and use the value of $n$ that maximizes this joint pdf as your population estimator.

The joint pdf of the $k$ consecutive values is

$\begin{eqnarray} f(a_i,\dots,a_{i+k-1}) = k!{n \choose k} {{n-k} \choose {i-1}} a_i^{(i-1)} (1-a_{i+k-1})^{(n-i+1)-k} \end{eqnarray}$

This can be seen as follows - Let us say we have $n$ unordered random variables. We choose $k$ of them with order (in $k! {n \choose k}$ ways). Of the remaining $(n-k)$ variables, we choose $(i-1)$ (in ${{n-k} \choose {i-1}}$ ways) and designate them to be the first $(i-1)$ ordered variables in some order. They all are less than $a_i$ and the probability of this happening is $a_i^{i-1}$. The remaining variables we designate to be larger than the $(i+k-1)$th variable which happens with probability $(1-a_{i+k-1})^{(n-i+1-k)}$. It only remains to compute the probability that the $j$th variable takes the value $a_j$ for $i \leq j \leq i+k-1$. But this is just $F(a_j)d(a_j) = d(a_j)$ since the variables are uniformly selected from $[0,1)$.

Maximizing this as a function of $n$ is a potentially messy business. To get an idea of how good this approach is, I used Stirling's approximation for the factorials and further assumed that $n$ is much larger than $i,k$. Then, the maximizing value is approximately

$\begin{eqnarray} \tilde{n} = \frac{i+k-1}{\log \frac{1}{1-a_{i+k-1}}} \end{eqnarray}$

Simulations seem to suggest that this estimator is better for some parameter ranges (i.e., when $n$ is much larger than $i,k$ for ex.) while in other regimes, the estimator in the question is better (when $k,i$ can't be ignored in comparison with $n$). Even in the latter regime, it might be possible to optimize the $n$ value that maximizes the joint pdf above more carefully (without using asymptotic stuff like Stirling's) to get a better estimator. Edit: I checked in simulation that numerically maximizing the joint pdf without resorting to Stirling's approximation does give a uniformly better estimator.

Another thing I noted from the simulation is that if $k=2$, the estimator $\frac{1}{a_{i+1}-a_i}$ performs poorly. To see this, note that the distribution of $D = a_{i+1}-a_i$ is independent of $i$ and is equal to $f_D(d) = n_t(1-d)^{n_t-1}$ where $n_t$ denotes the true value of $N$. This makes the population estimator $\tilde{N}$ have the density function

$\begin{equation} f_{\tilde{N}}(\tilde{n}) = \frac{n_t}{\tilde{n}^2} \left(1-\frac{1}{\tilde{n}}\right)^{n_t-1} \end{equation}$

This estimator has an unbounded expectation and thus an unbounded bias. But this is a moot point if for some reason you are constrained to use the estimator you had indicated in the question.

  • 0
    Maybe I should have pointed this out: $i$ is unknown. $k$, $a_i ... a_{i+k-1}$ are known.2011-01-18
0

Let $\hat{n} = \frac{k-1}{a_{i+k-1}-a_i}$. I think you have to find $\mathbb{E}(n)- \mathbb{E}(\hat{n}) = n- \mathbb{E}(\hat{n})$ (i.e. the bias). I think the margin of error at $95 \%$ conidence would be of the following form: $ \mathbb{E}(\hat{n}) \pm \underbrace{l_{0.95}\sqrt{\text{Var}(\hat{n})}}_{\text{margin of error}}$.

  • 0
    you use normal approximation, which comes from CLT, but $\hat{n}$ is not the sum of independent random variables.2011-01-17