13
$\begingroup$

There is a Wikipedia-type website of a fixed size of $S$ number of articles. You start at any article on Wikipedia. You then start to press the "random article" button and count the number of times $N$ that it takes for you to randomly generate the original page again.

The goal is to find the best estimate for $S$ given only the number $N$.

Assume that the "random article" function generates any particular article with a uniform probability and that each click is a (EDIT: independent) random event. The page that you start on does not count as a click and is thus not counted in $N$.

This problem reminds me of the solutions to the locomotive problem and the German tank problem. Since the button is clicked N times, the case with the highest number of articles has all N pages being different pages, yielding an estimate of (I think) $2N+1$. This would mean that the best estimate would be less than $2N+1$, because it is always possible for $N$ to contain an article, other than the one you were looking for, to be repeated multiple times.

Another possible solution is based off of the method of mark and recapture to estimate the size of animal populations in the wild. This method gives the estimate as $S=N$. This would because out of $N$ samples taken during the "recapture", exactly $1/N$ of them are articles viewed during the "mark" phase (original article that is remembered). Since $1/N$ of the pages are known to be picked from the group of one particular page, then the estimate would be that there would be $N$ pages in total.

  • 2
    @EMS: I don't understand what graph structure has to do with this question.2012-03-21

1 Answers 1

7

If $S$ is the total number of articles, the probability of observing $N$ is $\frac{(S-1)^{N-1}}{S^N}.$

We should choose an estimate for $S$ that maximizes this probability. Differentiating gives $\frac{(N-1)(S-1)^{N-2} S^N - N S^{N-1} (S-1)^{N-1}}{S^{2N}} = (S-1)^{N-2} \frac{(N-1)S - N (S-1)}{S^{N+1}}$

which is equal to $0$ when $S = 1$ and when $S = N$; the former is a local minimum and the latter is a local maximum, so indeed $S = N$ is a sensible estimate.


However, if you were actually going to try to do this, I think waiting until the first article repeats is a bad idea. You have to wait very long and in the meantime you're not using information about any other articles that may have repeated. As Ross says in the comments another idea is to wait until some article repeats. If $S$ is the total number of articles, the probability of observing $N$ is now $P(S) = \frac{(S-1)(S-2)\cdots(S-N+1)N}{S^N} = \frac{N(S-1)!}{S^N (S-N)!}.$

Stirling's approximation gives $P(S) \approx C_N \frac{ (S-1)^{S-1} }{S^N (S-N)^{S-N} }$

where $C_N$ is a constant depending only on $N$. Taking logarithms gives $\log P(S) \approx \log C_N + (S-1) \log (S-1) - N \log S - (S-N) \log (S-N)$

and taking derivatives gives \frac{P'(S)}{P(S)} \approx 1 + \log (S-1) - \frac{N}{S} - 1 - \log (S-N).

A local maximum should therefore occur when $\log (S-1) - \log (S-N) \cong \frac{N}{S}$, or when $1 + \frac{N-1}{S-N} \approx e^{ \frac{N}{S} } \approx 1 + \frac{N}{S}$

giving $S(N-1) \approx N(S-N)$ or $S \approx N^2$. So as Ross indicates this is much more efficient for obtaining an estimate.

  • 0
    The validity of all this depends on whether the "random article" function really picks a random article with a uniform distribution, and independence of distinct trials. How would one test _that_?2012-03-21