8
$\begingroup$

Suppose you throw balls one-by-one into $b$ buckets, uniformly at random. At what time does the size of some (any) bucket exceed size $s$?


That is, consider the following random process. At each of times $t=1, 2, 3, \dots$,

  • Pick up a ball (from some infinite supply of balls that you have).
  • Assign it to one of $b$ buckets, uniformly at random, and independent of choices made for previous balls.

For this random process, let $T = T(s,b)$ be the time such that

  • At time $T-1$ (after the $T-1$th ball was assigned), for each bucket, the number of balls assigned to it was $\le s$.
  • At time $T$ (after the $T$th ball was assigned), there is some bucket for which the number of balls assigned to it is $s + 1$.

What can we say about $T$? If we can get the distribution of $T(s,b)$ that would be great, else even knowing its expected value and variance, or even just expected value, would be good.

Beyond the obvious fact that $T \le bs+1$ (and therefore $E[T]$ exists), I don't see anything very helpful. The motivation comes from a real-life computer application involving hashing (the numbers of interest are something like $b = 10000$ and $s = 64$).

  • 1
    This problem has been extensively studied. See e.g. Theorem 1 in the paper “'Balls into Bins' — A Simple and Tight Analysis” by Martin Raab and Angelika Steger [available online: http://www14.informatik.tu-muenchen.de/personen/raab/publ/balls.pdf].2012-07-15
  • 0
    @Yury: Thanks for the reference. But I forgot to mention that I did encounter this same (I think) paper while doing a Google search, but it is not very useful to me because it gives only *asymptotic* bounds: it's not very helpful to know that some probability is $o(1)$; I am more interested (given the practical motivation) in calculating the actual value, for numbers like the 10000 buckets of size 64 I mentioned. (Also, I think it is not trivial to go from the form given in Theorem 1 to a distribution of $T$, is it? Even if its probability $\Pr[M > k]$ was known as an actual value.)2012-07-15
  • 0
    BTW: I _would_ imagine this has been extensively studied; I am just surprised I am unable to find anything. I don't necessarily need a closed form; some efficient method of calculating this number would be good too.2012-07-15
  • 0
    Roughly speaking, if $T \gg n$, we can assume that the number of balls in each bin is a Gaussian r.v. with expectation $T/b$ and standard deviation $\sqrt{N b (1 - 1/b)}$; r.v. for different bins are “almost” independent. Then the number of balls in the heaviest bin is approximately $T/b + \sqrt{N b (1 - 1/b)} \cdot q_n$, where $q_n$ is such that $Pr[\gamma > q_n] = 1/n$ (here, $\gamma$ is a standard normal variable).2012-07-15
  • 0
    The little $o(1)$ term in Theorem 1 is *really* small. For practical purposes, it is 0.2012-07-15
  • 0
    @Yury: Thanks, that's good to know -- do you think that it is practically 0 even for small $n$ (in their notation) like 1000 or 10000? If you could post an answer helping me get from the form of their theorem to some actual value for $T$, that would be very helpful. I have tried, but find the cases for $m$ etc. confusing...2012-07-15
  • 0
    The paper gives that $T/b + \sqrt{2 \frac{T}{b} \log b} \approx s$, which is a quadratic equation in $T$. You can easily solve it for $T$.2012-07-15
  • 0
    @Yury: Thanks, I see you took the third case (not sure how we know that, but I'm sure you've done it right) -- so that gives (if I have solved the quadratic equation correctly): $T/b \approx (s + \log b) \pm \sqrt{(s+\log b)^2 - s^2)}$. But my question now, as it has been from my first comment, is about the $\approx$: with what confidence can we say that $T(10000, 64)$ is around $10000\times(73.21 \pm 35.55) \approx 377000$ (or is it $1088000$?)? Is there some easy way of getting such confidence bounds from the results in the paper?2012-07-15
  • 0
    It should be obvious that any answer greater than $64 \times 10000$ is too big.2012-07-16
  • 0
    @Henry: Oh yes, that rules out the bigger root of the quadratic equation (which is not a solution to the original equation anyway). Indeed I wrote $T \le bs + 1$ is in the question but didn't do this basic sanity check when posting the comment above. But still I'm not sure how close the actual number is to 377000 -- but it's good to notice that in this case, $T$ is about 60% of $bs$.2012-07-16
  • 0
    @Yury: Would you mind if I collected your comments so far and posted it as an answer? (Assuming you're not too keen on posting an answer yourself that is.)2012-07-18
  • 0
    I don't mind if you add my comments to the answer.2012-07-20

1 Answers 1