6
$\begingroup$

I have a large array of $n$ integers, some of which may be repeated, and I want to estimate how many distinct integers are in the array. Say the number of distinct integers is $N$. I can sample with replacement easily but can't afford to sample anything like $n$ samples as $n$ is too big. If I sample $y$ positions uniformly with replacement, let $X$ be the number of distinct integers I get in the sample. How can we use $X$ to give an estimate for $N$?

When $y$ is $\Omega(n \log n)$ then we expect to have seen every position in the array and so $X=N$ with high probability by an application of http://en.wikipedia.org/wiki/Coupon_collector%27s_problem . When $y$ is much smaller than $n$ it seems we might only be able to give probabilistic upper and lower bounds as estimates for $N$ depending on the distribution of duplicates in the input array.

EDIT: As pointed out in the comments, there was an error in the original question (now fixed).

  • 2
    This is commonly known as the problem of *species richness estimation*. A web-search on this phrase should turn up lots of references.2012-11-22

1 Answers 1