Here is the problem I'm trying to solve:
In order to send spam, a spammer generates fake nicknames, by picking random girl names (and appending a random number to it). I suppose it randomly and fairly picks names from a fixed list, and I get to observe the outcome. Over time, names eventually start repeating.
From the distribution of the repetition count of every distinct name, is there a way to estimate the size of the fixed list ?
Intuitively, if there was no repetition among the names I have observed, I would have no information about the size of the list, but the minimum bound given by the actual distinct names observed.
On the other hand, if the distribution of the repeat counts is far away from zero, I can be reasonably confident that the sample size is much larger than the word list, and that I probably have observed the full list already.
The problem lies in the middle zone. So far the distribution of repeat counts looks like this:
$ \begin{matrix} n & k \\ 114 & 1 \\ 66 & 2 \\ 30 & 3 \\ 4 & 4 \\ 2 & 5 \\ 1 & 6 \end{matrix} $
Here $n$ is the count of distinct names appearing $k$ times in the sample.
Is there a simple estimator for the size of the list?