1
$\begingroup$

If we draw k times from an urn of n balls (with replacement), what is the probability that the number of times we draw a ball we have seen before is at most m?

(The idea is that on each draw we compare the drawn ball to the list of previously drawn balls and determine if it is in the list. We want to answer yes to this question at most m times. Equivalently, we want to see at least k-m different balls across the k draws.)

This seems like something that has been studied before, but I haven't found anything on point. Most searches lead to basic explanations of the birthday paradox.

I'm interested in using collision count information to estimate n.

I have expressions for cases m = 0, 1, 2, but this gave no useful insight into the general case.

A good approximation could help as well, as long as it is reasonably accurate when the CDF probability is in the range 1% to 99%.

  • 0
    It seems that in trying to be brief and clear, I failed on the clarity. I added a parenthetical clarification of what I meant by the definition of *m*. Many thanks to Jeremy and Rus for their attempts to interpret what I meant.2017-01-23
  • 0
    Thanks for the clarification. Though I did not explain it as well as I could have, this is how I interpreted the problem in giving my solution.2017-01-24

2 Answers 2

1

I interpreted the condition in the question differently than Rus May, where the number of times we have seen any drawn ball is $m$, so that if a ball is drawn $m$+1 times (first time, plus $m$ repeats), no other ball can be seen twice.

That said, my solution is not pretty...there is quite possibly a better one and I'd love to upvote it. I'm going to make all of the balls in the urn have different colors and refer to them by color, not sure if this simplifies matters.

First, let define $Z(N,\lambda)$ to be the number of ways to order $N$ balls of $\lambda$ different colors such that each color is used at least twice and balls of the same color are indistinguishable. We'll worry about how to calculate $Z(N,\lambda)$ in a minute. But with this term defined, the sample space for our experiment is the set of all draws from the urn, which we can think of as all ways to put $n$ balls of $k$ different colors in order, with balls of the same color being indistinguishable. Obviously our sample space has size $k^n$.

The event of interest has exactly $m$ collisions, where a collision occurs when, examining from the first ball in order to the last, you encounter a ball of a color you've seen before. For any selection in this event, it is possible that a single color collides all $m$ times, or $m$ different colors collide once each, or anything in between. So let $\lambda$ be the number of colors for which there is at least one collision.

If $m=0$, in which case there are no collisions, then the number of orderings meeting our conditions is the usual falling factorial $$\frac{k!}{(k-n)!}$$ interpreted as zero if $n > k$.

If an ordering has $m > 0$ collisions with $\lambda$ different colors having collisions, then we must have $1 \le \lambda \le {{\rm min}\{m,k,n-m\}}$. Since we have a collision, at least one color must be involved, and the number of colors involved in a collision clearly cannot be greater than the number of collisions or the number of colors. $\lambda$ also cannot exceed $n-m$, the number of draws which are not collisions, since this pool provides the set of first appearances of colors with which later collisions collide. Note also that exactly $m+\lambda$ positions in the ordering are involved in a collision: the $m$ collisions themselves, plus the $\lambda$ original occurrences of the colors involved in the collisions.

Let's start counting the number of ways we can create such an ordering. First, we can pick the $\lambda$ colors for which collisions exist in $k \choose \lambda$ ways. Next, we pick the $m+\lambda$ positions to be involved in the collisions, which can be done in ${n \choose {m+\lambda}}$ ways. The $n-(m+\lambda)$ positions not involved in collisions must be distinctly colored using colors other than those involved in collisions, of which there are $k-\lambda$. This can be done in $\frac{(k-\lambda)!}{(k+m-n)!}$ ways, with the usual conventions on the falling factorial if $k+m-n < 0$. To color the positions involved in the collisions, we are creating an ordering of $m+\lambda$ balls of $\lambda$ different colors such that each color appears at least twice (the original appearance and at least one collision), which can be done in $Z(m+\lambda,\lambda)$ ways, using our definition above.

Combining this, we have that the probability that a draw contains exactly $m$ collisions is: $$\frac{\displaystyle \sum_{\lambda=1}^{{\rm min}\{m,k,n-m\}} {k \choose \lambda} {n \choose {m+\lambda}}\frac{(k-\lambda)!}{(k+m-n)!} Z(m+\lambda,\lambda)}{k^n}$$ which we can simplify to $$\frac{\displaystyle k! \sum_{\lambda=1}^{{\rm min}\{m,k,n-m\}} \frac{1}{\lambda!}{n \choose {m+\lambda}}Z(m+\lambda,\lambda)}{k^n(k+m-n)!}$$

Now we need to calculate $Z(N,\lambda)$. You can calculate this formula in terms of multinomial coefficients, simply excluding all coefficients where any of the lower indices is less than 2, but I'm not sure that really solves the problem. We can also derive a recurrence relation for $Z(N,\lambda)$ as follows. Note that $Z(N,\lambda)$ clearly must be zero if $N < 2 \lambda$ or if $\lambda = 0$, which form our boundary conditions.

First $S$ be any ordering of $N$ balls of $\lambda$ different colors, of which there are $\lambda^N$ possibilities. Let $S_0$ be the set of colors which do not appear in the ordering, and $S_1$ be the set of colors which appear exactly once; obviously $S_0$ and $S_1$ are disjoint. Given any pair of disjoint sets $S_0$ and $S_1$, how many such sequences $S$ are there?

Suppose $|S_0| = i$ and $|S_1|=j$. The colors in $S_0$ do not appear at all. To place one ball of each color in $S_1$ into our ordering, the first one can be placed in $N$ ways, the second in $N-1$, etc., so the total number of ways to place the balls of colors appearing only once is $\frac{N!}{(N-j)!}$. The remaining $N-j$ positions in the ordering must be filled by balls of $\lambda - (i+j)$ colors, with each color appearing at least twice, of which there are $Z(N-j,\lambda-(i+j))$ possibilities.

Letting $S_0$ and $S_1$ vary over all disjoint subsets of the set of colors, we construct all possible orderings of $N$ balls with $\lambda$ colors. We note that only the cardinalities of $S_0$ and $S_1$ affect the count, not their actual members, so we obtain the following recurrence $$\displaystyle \sum_{j=0}^\lambda \left[{\lambda \choose j} \frac{N!}{(N-j)!} \sum_{i=0}^{\lambda-j} {{\lambda-j} \choose i} Z(N-j,\lambda-(i+j))\right] = \lambda^N$$

Using this recurrence, we can derive the following formulas $$\begin{eqnarray*} Z(N,1) & = & 1, N>=2\\ Z(N,2) & = & 2^N - 2(N+1), N >=4\\ Z(N,3) & = & 3^N - 3(N+2)2^{N-1} +3(N^2+N+1), N>=6\\ Z(N,4) & = & 4^N - 4(N+3)3^{N-1} +6(N^2+3N+4)2^{N-2} -4(N^3+2N+1),N>=8 \end{eqnarray*}$$

At this point, I don't think it is hard to show that $Z(N,\lambda)$ can be approximated by $\lambda^N - \lambda(N+\lambda-1)(\lambda-1)^{N-1}$, which you can use with the earlier formula to derive estimates.

This is a very interesting problem...thanks for posting!

EDIT: I made a significant error in my calculation which has been corrected, as well as tightening up the indices of summation.

  • 0
    Note: after some research at OEIS, the problem of calculating my Z function has been addressed by Dennis Walsh, http://capone.mtsu.edu/dwalsh/DOUBSURJ.pdf2017-01-24
0

Call this probability $p_{k, n, m}$. In a sequence of $k$ draws, let $m_i$ be the number of times ball $i$ (for $i=1,\ldots,n$) appears among the $k$ draws. I take your condition to mean that $m_i\le m$ for each $i=1,\ldots,n$.

Without any constraints, the number of ways that the balls could be drawn with specified counts $m_1,\ldots, m_n$ is the multinomial coefficient $\binom{k}{m_1,\ldots, m_n}$, where $\sum_i m_i=k$. Likewise, the probability that ball $i$ appears $m_i$ times for each $i$ is $\binom{k}{m_1,\ldots, m_n}/n^k$. This probability for specified counts coincides with the coefficient of $x^k$ in the term of the expansion of the product of $n$ factors $$k!\bigl(1+\frac1{1!}\frac xn+\frac1{2!}\bigl(\frac xn\bigr)^2+\cdots \bigr) \bigl(1+\frac1{1!}\frac xn+\frac1{2!}\bigl(\frac xn\bigr)^2+\cdots \bigr) \cdots \bigl(1+\frac1{1!}\frac xn+\frac1{2!}\bigl(\frac xn\bigr)^2+\cdots \bigr),$$ obtained by selecting the $m_i$th term in the $i$th factor for each $i$.

Now, since each ball is to appear at most $m$ times among the $k$ draws, we must bound the terms in these factors to order $x^m$ and then sum over all allowable counts. This gives a concise formula for your probability: \begin{eqnarray*} % \nonumber to remove numbering (before each equation) p_{k,n,m} &=&[x^k]k!\left(1+\frac1{1!}\frac xn+\cdots+\frac1{m!}\left(\frac xn\right)^m\right)^n \\ &=& [x^k]k!\left(T_m\biggl(\frac xn\biggr)\right)^n , \end{eqnarray*} where $T_m$ is the $m$th order Taylor polynomial of the exponential function.

I think it's not too bad to get reasonable estimates of the probability from this formula, perhaps with Taylor's theorem, but I'll stop here.