1
$\begingroup$

A box contains 100 balls. Each ball has a number from 1 to 10. How many balls should I draw (ball is put back in box after drawing) to predict the number of balls for each number with 95% certainty.

A prediction is correct if for each of the 10 numbers, the number of balls with that number is correctly predicted.

  • 0
    @Tunococ I do not say a priori that CLT can not be used, but can you explain a little bit more how you would solve it using CLT before I answer if it is good enough.2012-11-27

2 Answers 2

2

You're basically asking for confidence intervals on the probabilities of a multinomial distribution.

Let $Q$ be the number of balls in the box (you said 100), of which there are distinct labels 1 to $m$ (you said 10). Let's say there are $N_i$ balls of type $i$, with $\sum_i N_i = Q$; we'll say $N = (N_1, \dots, N_m)$.

Say you draw $n$ balls and get a vector of counts for each type $x = (x_1, \dots, x_m)$. If $m = 3$ and you drew 10 with label 1, 2 with label 2, and 3 with label 3, you'd have $x = (10, 2, 3)$. $x$ is distributed according to $\text{Multinomial}(n, p_1, \dots, p_m)$, where $p_i = N_i / Q$.

After drawing $n$ balls, your maximum-likelihood prediction of the ball counts $\hat{N}$ is clearly $(Q x_1, \dots, Q x_m)$.

Now, you want to pick $n$ large enough such that your 95% confidence region on $N$ will contain only one integer for each component.

Confidence intervals on multinomials are actually somewhat complicated, because of the interactions between cells; here's some papers. R has a function gofCI to do this asymptotically, whose help page has some more references.

  • 0
    Still reading the papers, but looks very promising. This is what I was looking for. Thanks.2012-11-28
0

If I understand correctly, your problem is estimating the parameter of a Bernoulli distribution. I will attempt to find an approximate answer to the problem using CLT. (This is not an exact solution.)

Suppose we're interested in estimating the number of balls with number 1 written on them. Consider a draw a success if you draw a ball with 1 written on it. Then the actual rate of success is the number of balls with number 1 divided by 100. Let $p$ be this rate.

Let $X_n$ be the number of times 1 is drawn after $n$ drawings. Then CLT says that $\sqrt{n}\left(\frac{X_n}n - p\right)$ converges in distribution to a normal distribution with mean $0$ and variance $p(1 - p)$. A less formal statement is that the distribution of $\frac{X_n}n$ is roughly normal with mean $p$ and variance $\frac 1np(1 - p)$ for large $n$. So my approximation is that you estimate $p$ by $\hat p = \frac{X_n}n$, and also variance by $\hat p(1 - \hat p)$. Then you pick $n$ large enough such that the distribution $\mathcal{N}\left(0, \frac 1n\hat p(1 - \hat p)\right)$ has probability $0.95$ within the range $[-0.005, 0.005]$.

  • 0
    I just realized that you replied to my answer, and I just discovered that you did not actually read my answer...2013-01-12