9
$\begingroup$

Here is the problem I'm trying to solve:

In order to send spam, a spammer generates fake nicknames, by picking random girl names (and appending a random number to it). I suppose it randomly and fairly picks names from a fixed list, and I get to observe the outcome. Over time, names eventually start repeating.

From the distribution of the repetition count of every distinct name, is there a way to estimate the size of the fixed list ?

Intuitively, if there was no repetition among the names I have observed, I would have no information about the size of the list, but the minimum bound given by the actual distinct names observed.

On the other hand, if the distribution of the repeat counts is far away from zero, I can be reasonably confident that the sample size is much larger than the word list, and that I probably have observed the full list already.

The problem lies in the middle zone. So far the distribution of repeat counts looks like this:

$ \begin{matrix} n & k \\ 114 & 1 \\ 66 & 2 \\ 30 & 3 \\ 4 & 4 \\ 2 & 5 \\ 1 & 6 \end{matrix} $

Here $n$ is the count of distinct names appearing $k$ times in the sample.

Is there a simple estimator for the size of the list?

  • 1
    One of the seminal papers appears to be [this one](http://statistics.stanford.edu/~ckirby/techreports/BIO/BIO%2009.pdf) by Efron and Thisted, though.2011-10-25

3 Answers 3

2

Interesting problem, should be well known. We can try maximum-likelihood approach, assuming a population of $M$ total names, our experiment has $K = \sum n_i k_i = 368$ tries, getting $N=\sum n_i = 217$ different values. More simply, the probability that in $K$ tries from $M$ elements we get $N$ different values is $ \frac{{M \choose N } N^K}{M^K} = \frac{M! }{ M^k (N-M)!} \frac{N^K}{N!} $

where the second factor does not depend on $M$. The log-likelihood, then, can be writen as

$L(M) = \log(M!) - K \log(M) - \log((M - N)!) + c \approx (M -K ) \log M - (M - N) log(M-N) $

where $c$ does not depend on $M$, and where we've used the Stirling approximation.

Its maximum happens at $K/M = - \log(1 - N/M)$ or $M = N/ (1 - \exp(-K/M))$. The last equation provides an iterative solution, in this case: $M \approx 315$

Of course, this does not validate that the data fits the assumed model.

1

I would try to fit this to a Poisson distribution. You have a list of $N$ names. We assume a name is picked at random with replacement with probability $1/N$. Now you have a distribution of $217$ tries. I find it hard to match this distribution that way-you have too many 2 and 3 hits for this model relative to 1 hits, so maybe the names are not truly chosen randomly. Fitting by eye leads to 250-300 names.

1

Using the suggestion from Dilip Sarwate, if $N$ is the number of different names observed, $N_1$ is the number of names seen once, and $K$ is the total number of observations, the simplest Good-Turing estimator for the total number of names is given by $ \hat M = {{N} \over {1-{{N_1} \over K}} }={{217} \over {1-{114 \over 368}} }=314.4,$ which for applied work is in good agreement with the maximum-likelihood approach.