5
$\begingroup$

I have a large array of $n$ integers, some of which may be repeated, and I want to estimate how many distinct integers are in the array. Say the number of distinct integers is $N$. I can sample with replacement easily but can't afford to sample anything like $n$ samples as $n$ is too big. If I sample $y$ positions uniformly with replacement, let $X$ be the number of distinct integers I get in the sample. How can we use $X$ to give an estimate for $N$?

When $y$ is $\Omega(n \log n)$ then we expect to have seen every position in the array and so $X=N$ with high probability by an application of http://en.wikipedia.org/wiki/Coupon_collector%27s_problem . When $y$ is much smaller than $n$ it seems we might only be able to give probabilistic upper and lower bounds as estimates for $N$ depending on the distribution of duplicates in the input array.

EDIT: As pointed out in the comments, there was an error in the original question (now fixed).

  • 0
    By "sample with replacement" and "sample $y$ positions", do you mean that you take $y$ samples, each at a position independently uniformly selected from the $n$ positions in the array (so a position may be sampled repeatedly)?2012-11-22
  • 0
    @joriki, yes. Sorry that wasn't clear.2012-11-22
  • 0
    Then your formula is wrong. As $y\to\infty$, $E(X)\to N$ and $n/y\to0$, so $E(nX/y)\to0\ne N$.2012-11-22
  • 0
    @joriki. Oh dear. Do you know how to fix it?2012-11-22
  • 0
    $E(X)$ depends on the particular distribution of the integers. If one integer is repeated very many times and all others are unique, you expect to see fewer distinct integers than if all integers are repeated a couple of times.2012-11-22
  • 0
    @joriki, Does that mean that there is no good estimator for counting the number of distinct elements without knowing this a priori?2012-11-22
  • 0
    @joriki, the two extreme cases you give seem to give upper and lower bounds for an estimator based on $X$. Maybe it would make more sense to give a confidence interval rather than a single estimate? I don't know how to do that however.2012-11-22
  • 0
    You can hardly design an estimator if you don't specify the underlying probabilistic model. How are the numbers generated? Are they chosen from a finite set? Are you not really interested in estimating the size of that finite set? Is this related with "German tank" problem?2012-11-22
  • 0
    @leonbloy, I would like probabilistic bounds assuming the worst case. As joriki points out these seem to be on the one hand when each symbol occurs the same number of times and on the other when there are $n-(N-1)$ copies of one symbol and one copy of the other $N-1$ symbols.2012-11-22
  • 0
    @leonbloy, I don't assume the integers are from a particular set. We can just assume that we can tell which are the same and which are different from each other.2012-11-22
  • 0
    @Arnott: But if you dont have a probabilistic model, you can't speak of "probabilistic bounds". You can only tell extreme cases, and that's not an estimator, and that's trivial: extremes are X and X + (n-y). If need something better (a true estimator), you need a probabilistic model.2012-11-22
  • 0
    @leonbloy, there is randomness in the sampling. So for example we can potentially say $P(X \geq a) \leq k$ and $P(X \leq b) \leq k'$ and also $ a' \leq E(X) \leq b'$. These bounds would hold no matter what the input looks like.2012-11-22
  • 0
    @leonbloy, to be a little clearer. We could say something like "$\forall$ inputs, $a' \leq E(X) \leq b'$ and $P(X\geq a) \leq k$ etc." The point is that the probabilistic bounds hold for all possible inputs.2012-11-22
  • 0
    "we can potentially say ..." No, we can't, if we don't know anything about how the numbers were originaly thrown. For example, you can't even say if sampling the first two numbers if (in probability) equivalent to sampling the last two numbers. "For all possible inputs" you say? Then, take the extreme inputs: all equal and all distinct. So, all the bound you can write is $1\le E(X) \le y$ irrespective of the sampling.2012-11-22
  • 0
    @leonbloy, Sorry I think I haven't explained this well. I meant fix $n$, (unknown) $N$, $y$ and then consider over all inputs that satisfy those parameters. So if $n=N$ then $E(X) = y$ (actually not quite right as we may have chosen the same position more than once). We can also say it is very unlikely that $N=y$ for example if $y= N/2$, say, as we would have been very likely to have hit a duplicate.2012-11-22
  • 0
    @leonbloy, downvote?2012-11-22
  • 2
    This is commonly known as the problem of *species richness estimation*. A web-search on this phrase should turn up lots of references.2012-11-22

0 Answers 0