2
$\begingroup$

We have a box with $m=1\,000\,000$ cards. Each card contains one word. The words are repeated so there is a relatively small number of $n$ unique words. $n$ is unknown.

If we get a sample of $k=5000$ cards, we find that there is $42$ unique words in our sample.

With this information, we know $P(n\geq42) = 1$.

How can we know $P(n\geq43)$, $P(n\geq44)$..., and so on?

Is this problem common, and does it have a "common name"?

PS: we have the information on the frequency of each of our $42$ words for the $5000$ card sample, they can be used for the solution if it is relevant. Lets call this frequencies $f_1, f_2, \dots,f_{42}$.

  • 1
    This is almost the [German tank problem](https://en.m.wikipedia.org/wiki/German_tank_problem). If we sampled one-by-one and put each card back before sampling the next one, and if the cards were numbered $1,2,\cdots,n$ instead, we would be a lot closer. It's also related to the [coupon collector's problem](https://en.m.wikipedia.org/wiki/Coupon_collector's_problem) except that in that problem we know $n$ and just want to know what $k$ gives what probability of collecting them all.2017-01-31
  • 0
    If you believe that the words in the box each have substantially different frequencies to each other, then you have a difficult problem: you will come up with different answers if you think many in the box could there twice than if you think each one is there at least say $100$ times each.2017-01-31
  • 0
    Why $P(n \ge 42)=1?$2017-01-31
  • 0
    @Axolotl Because it is guaranteed ($100\%$ probability) that there are at least $42$ unique words in the $1\,000\,000$ card box.2017-01-31

0 Answers 0