1
$\begingroup$

Let's say you have an urn full of balls. Each ball has one or more colors on it. I'm trying to figure out, given that you draw E balls from an urn without replacement, what is the probability that you will remove all of the balls of one or more colors? I think I have it solved for 2 colors, but beyond that, double-counting becomes problematic, and I'm at a bit of a loss as to the proper way to incorporate it.

I've started by thinking about this problem with two colors, red and green, and no mixed color balls. What is the probability of removing all of the red balls, R, given E draws? Well, we can get that from the density function of the hypergeometric function. For shorthand, dh(a,b,c,d) is the probability of drawing a balls of a color with b balls of that color in the urn, c balls not of that color, and d draws.

The probability of drawing out all of the red balls is dh(R, R, G, E). The probability of drawing out all of the green balls is dh(G,G,R,E). The probability of drawing out all of either the red OR the green (since we can't do both with 2 colors) is dh(R, R, G, E)+dh(G,G,R,E).

Great. This leads to an easy general expression for single color balls. Let's sum over all colors, with i being the number of balls with color i. T is the total number of balls.

p(removing all of 1 or more) = sum(dh(i, i, T-i, E))

Now let's say there are M mixed balls - both red and green. Let's go back to the probability of drawing out all balls with red on it.

Here, we have dh(R+M, R+M, G, E). And we can just flip the Rs and Gs to get the same expression for G.

Can we just sum these two to get the answer to the probability of drawing out all balls with either red or green on them? Do we need to worry about double-counting? Not usually in the case of two colors only. One can only draw out either all of the red or all of the green balls, unless you are removing all of the balls. In which case double counting does become a problem.

How can I derive a general term that corrects for double counting with C number of colors, mixtures having any number of colors on them from 2...C, and E draws. I'm guessing that the 1 color per ball case is a special case. I'm just not seeing the correction term. Thoughts? Just pointers in the right direction would be appreciated.

  • 0
    The nice thing here is that p(no white) = dh(0, W, N-W, N-E). However, you run into the same double-counting problem. I thought the solution could be 1-(p(no white) + p(no red| R-M(red/white)) + p(no black | B-M(red/black, white/black, white/black/red)) such that, p(no red| R-M(red/white)) = dh(0, R-M(r/w), N-R, N-E) where M(r/w) is the number of mixed red/white balls, but that doesn't seem to work. Although...should that second term be N-R-M(r/w)? Hrm...2011-11-30

2 Answers 2

1

Suppose you have $B$ balls in $C$ colours, with $b_j$ balls of colour $j$, and draw $E$ balls. For any subset $S$ of $\{1,2,\ldots, C\}$, let $b_S = \sum_{j \in S} b_j$. The probability $P(S)$ that you get all the balls of all the colours in $S$ is $0$ if $E < b_S$ and ${{B-b_S} \choose {E-b_S}}/{B \choose E}$ if $E \ge b_S$. By the inclusion-exclusion principle, the probability that you get all the balls of at least one colour is $ \sum_S (-1)^{|S|-1} P(S)$, where the sum is over all nonempty subsets $S$ of $\{1,2,\ldots, C\}$ with $b_S \le E$.

  • 0
    Ah - although if $\sum_{j \in S} b_j$ means the total number of balls in $b_s$ rather than totaling over all colors, then this works. See my answer below as well. Now the trick is to find a way to get at this algorithmically without having to go through every single possible set, which I feel like I should be able to do if I know the number of mixed balls of each kind, number of colors, and number of balls. Hrm...2011-12-02
0

Thanks for the leads. I have an answer, but am unsatisfied with it from a computational perspective. First, the answer, then the ongoing quandary. So, assume you have B balls in an urn. There are C possible colors. For any subset of the colors, there is a set of balls that contains those colors - both of single and mixed color. Let's say that any subset of some j colors, $S \subset \{1\ldots C\}, j=\left | S \right |$, can be called $S_i|j$. For any value of j, there are t=$\binom{C}{j}$ combinations. So, using that information and the inclusion/exclusion principle...

$\sum_{j=1}^{C}-1^{(j-1)}\sum_{S_0 |j}^{S_t|j}dh(|S|, |S|, B-|S|, E) $

This is great. But, here's the problem. Computationally, it's a bear, as for even moderately large numbers of colors, it becomes slow to calculate. Indeed, one is often just better at iterating through all of the combinations of balls.

Compare the following R code which works with matrices where columns are colors and rows are balls

anyGone<-function(amat, E){  totals<-colSums(amat)   #combn doesn't play well with E=1  if(E==1) return( mean(colSums(apply(amat, 1, function(x) x==totals))>0) )      mean(combn(1:nrow(amat), E, function(rows) sum(totals==colSums(amat[rows,])))>0)  } 

to this function, which takes the same matrix.

anyGone3<-function(amat, E){     eachColor<-colSums(amat)   ret<-sum(dhyper(eachColor, eachColor, nrow(amat)-eachColor, E))    if(E>1){for (j in 2:ncol(amat)){     values<-combn(1:ncol(amat), j, function(x) sum(rowSums(amat[,x])>0))     ret<- ret+(-1)^(j-1)*sum(dhyper(values, values, nrow(amat)-values, E))   }}    ret } 

For a simple matrix with 15 colors, 15 balls, and 8 draws, the difference in performance is noticeable.

So......I guess the followup question is more algorithmic. Knowing the number of types of mixed balls, the number of colors, and the total number of balls, is there some way to shave off a large amount of time from this calculation.