5
$\begingroup$

If I have $m$ distinct boxes, and $n$ distinct balls. I put all of these balls into the boxes with one box possibly containing more than one balls. What is the expected maximum number of balls in one box?

Appreciate your thoughts on solving this problem.

  • 0
    @Yuval: how to deal with max? Can you please give more details?2011-02-13
  • 0
    Your balls and boxes should be non-distinguishable, since you do not care which box has the highest number of balls, and "which" balls you count does not matter.2011-02-13
  • 0
    @milcak: what do you mean by "should"? In my case, I just want them to be distinguishable? If you like, you can also please give the non-distinguishable solution.2011-02-13
  • 0
    @Qiang Li what I meant is that the solution to both answers will be the same, so the way you phrased it can confuse someone. Say you distinguish the balls by color. Then in any arrangement you have some box that has some number of balls. But see that the number of balls does not depend on the color of the balls! You count all of them. Same for boxes you don't care *which* box has the most, you do not distinguish.2011-02-13
  • 0
    Or are you fixing one box, and asking how many balls on average are in that box for all arrangements?2011-02-13
  • 0
    @milcak: I do not think they are the same, since the underlying probabilities are different. Just think about 2 balls and 2 boxes. They are different!2011-02-13
  • 0
    @Qiang Li I see what your getting at, but you need to be more precise when asking. You tagged this probability: so when you have distinct balls putting a red ball then blue is different than blue then red. If you write $1BR2E$ meaning you have put the blue ball, then the red ball into box one, and the second box is empty, you get the following possibilities: $1RB2E$, $1BR2E$, $1B2R$, $1R2B$, $1E2BR$, $1E2RB$. Or do you? ...2011-02-13
  • 0
    ... Is putting the red ball into box one, then blue into box two the same as first putting the blue ball into box two **first**, then putting the red ball into box one? Which do you count? Which are repetitions? Which are distinct? Your question lies between probability and combinatorics, but you havent included what the probability space is properly. So yes, you are right, but then again, looking at it another way you are wrong. Also, this question is interesting to me, please specify it properly. I would very much like to try to solve it.2011-02-13
  • 0
    Sorry, I didn't notice the "maximum"...2011-02-13
  • 0
    @milcak: There are $m$ boxes and $n$ balls. You put each ball in a random box. In the end, you have $B_i$ balls in box $i$. Notice that $\sum_i B_i = n$. We want the expectation of $\max_{i=1}^m B_i$.2011-02-13
  • 0
    @Yuval I understand that. But if you have $n$ distinct balls, and you put them into one box, do you count this as $1$ arrangement, or $n!$ arrangements? In one case, you count that the maximum (for that one box) is at $n$ balls once, the other time you count it $n!$ times. These two will yield different results, no?2011-02-13
  • 1
    @milcak: Yes, the two would yield different results, and in order to get the expectation Yuval described, you'd need to count that as $1$ of the $m^n$ elementary events, namely the one where for each ball you chose to put it in that one box. It doesn't matter in which order you go through the balls and decide where to put them; what matters is only where you put each ball.2011-02-13
  • 0
    @milcak: "But if you have n distinct balls, and you put them into one box", this is one arrangement. But if you put n distinct balls into n distinct boxes with each box containing one ball exactly, then there are $n!$ arrangements, instead of 1.2011-02-13
  • 0
    @milcak: more specifically, for your example, in {1RB2E, 1BR2E, 1B2R, 1R2B, 1E2BR, 1E2RB}, 1RB2E and 1BR2E are one state, 1E2BR and 1E2RB are one state. So the sample space is {1RB2E, 1B2R, 1R2B, 1E2BR}. While if we have non-distinguishable balls, the sample space would be {20, 11, 02}. Even further, if the boxes are also non-distinguishable, we have {0,1}.2011-02-13
  • 0
    @Qiang Li Yes, that's one way of counting. There are others, and you didn't specify which.2011-02-13
  • 0
    @milcak: isn't that obvious, given my original statement? Note I specifically stated "distinct" in two places in order to be "very very" specific to avoid other possible meanings. :D2011-02-13
  • 0
    @Qiang Li Your question should have been tagged *combinatorics* as well, and should have read:"In a *combinatorial* model of distinct balls and distinct boxes, what is the expected maximum value of balls in one box." Your word **put** has 15 different meanings here.2011-02-13
  • 0
    @milcak: as i said, there "should" be no confusion at all, if you prefer to use "should".2011-02-13
  • 0
    @milcak: I agree with Qiang that it's obvious. While your alternative interpretation is perhaps consistent with the formulation of the question, the "balls into bins" model is so ubiquitous, and the standard interpretation so natural, that there is no need to explicitly describe it. The alternative interpretations are weird from a "practical" point of view, notwithstanding the fact that they describe other kinds of physical situations.2011-02-13
  • 0
    @Yuval Yes, I agree. Balls and urns or boxes or bins are a well known **combinatorial** model. If you ask a question about it, and **not** add a combinatorics tag, it is confusing. And if it wasn't for you, it was for me.2011-02-13
  • 0
    @milcak: are you saying that mathematics, or solution to a particular math problem, depends on how you "tag" it? just kidding...2011-02-13
  • 0
    @Qiang Li Hahaha :) well in this case it does serve as a sort of shorthand I guess. If this is clear to everyone else, then please ingnore me, but when you say **put** (tag-combinatorics) is different than **put** (tag-probability). Especially in the second case, I tend to be as careful as possible.2011-02-13
  • 0
    That's a perfect problem of discrete probability, an area which tends to have a combinatorial flavor.2011-02-13

1 Answers 1

4

This is answered in a paper by Raab and Steger (they switch your $n$ and $m$). The case $n=m$ is simpler and had been known before (see their introduction).

Intuitively, in order to find the "correct" answer, you calculate the expected number of bins $X_t$ which get at least $t$ balls; the value of $t$ such that $X_t \approx 1$ should be "close" to the expectation.

In order to make this rigorous, follow these steps:

  1. Find a critical value $t_0$ such that if $t \ll t_0$ then the probability that a given box gets at least $t$ balls is very close to $1$, and if $t \gg t_0$ then it is very close to $0$.
  2. When $t \gg t_0$, a union bound shows that with high probability no box gets at least $t$ balls.
  3. When $t \ll t_0$, the expected number of boxes with at least $t$ balls is very close to $m$, and so the probability that no box gets at least $t$ balls is very small.
  4. Deduce that most of the probability mass of the expectation is concentrated around $t_0$, and so the expectation itself is close to $t_0$.

When doing the calculations, we hit a problem in step 3: the number of expected boxes with at least $t$ balls is not $m$ but somewhat smaller. Raab and Steger show that the variable $X_t$ is concentrated around its expectation (using the "second moment method", i.e. Chebyshev's inequality), and so $\Pr[X_t = 0]$ is indeed small.

Most of the work is estimating the binomial distribution of the number of balls in each box, finding the correct $t_0$ for each case; the method fails when the number of balls is significantly smaller than the number of bins.


Here is some Sage code:

def analyze_partition(partition, bins):
    last = partition[0]
    counts = [1]
    for part in partition[1:]:
        if part != last:
            last = part
            counts += [1]
        else:
            counts[-1] += 1
    counts.append(bins - sum(counts))
    return multinomial(*partition) * multinomial(*counts)

def expected_max(balls, bins):
    return sum([max(partition) * analyze_partition(partition, bins)
                for partition in partitions(balls)
                if len(partition) <= bins])/bins^balls

When plugging $10$ balls and $5$ bins, I get $$1467026/390625 \approx 3.76.$$

  • 0
    The question is treated and the answer is bounded, but the question isn't answered :-)2011-02-13
  • 0
    @Yuval: I would agree with joriki. What if in case i need an exact solution, rather than bounds, or even high probability bounds?2011-02-13
  • 0
    @Qiang: It is very easy to give an exact formula for the expectation, using the definition of expectation and multinomial coefficients. However, this formula doesn't help you estimate the expectation.2011-02-13
  • 0
    There are many functions which are easy to estimate but hard to evaluate exactly, for example the prime-counting function $\pi$ and the partition function $p$. In these cases, however, there are non-trivial algorithms to calculate the functions explicitly.2011-02-13
  • 0
    I therefore suggest you rephrase your questions as: What is the most efficient algorithm for computing said expectation exactly?2011-02-13
  • 0
    @Yuval: then please give a multinomial coefficients formula as simplified as it can be for easy evaluation. To me, I always thought one needs to partition an integer in order to calculate the expectation. The partition of an integer is way more complicated.2011-02-13
  • 0
    Let $M$ be the contents of the maximal cell. Then $E[M] = \sum_{t \geq 1} \mathrm{Pr}[M \geq t] = m^{-n} \sum_{t \geq 1} \sum_{M \geq x_1 > x_2 > \cdots > x_k > 0,\, c_1,\ldots,c_k \geq 1,\, \sum c_i = m,\, \sum c_i x_i = n} \frac{n!}{(x_1!)^{c_1} \cdots (x_k!)^{c_k}} \frac{m!}{(c_1)!\cdots(c_k)!}$.2011-02-13
  • 0
    @Yuval: so still with integer partition?! :D2011-02-13
  • 0
    The formula is pretty straightforward, and you can easily program it. It's much better than the trivial $m^n$ algorithm. If all you want is a good concrete estimate for some particular $n,m$, the way to go is run lots of experiments and take their average. You can even get rigorous bounds on the error using Chebyshev or Chernoff.2011-02-13
  • 0
    @Yuval: I agree with the fomula, the same one I got but not sure how to evaluate it. would you mind to give an example of the evaluation in code for m=5, n=10 (maybe in Mathematica?) and then the associated error bounding? If so, I would be very thanksful to accept your answer. Many thanks. :)2011-02-13
  • 0
    @Qiang: And if Yuval does not give you all that, you will not accept his answer? Sweet.2011-02-13
  • 0
    @Didier, i might. :)2011-02-13
  • 0
    If you evaluate it in Mathematica then there will be no error... Otherwise, you're trying to measure some random variable $M$ by taking samples. You have a bound on the variance because the random variable is bounded (say by $n$), and so using the Central Limit Theorem (or rather its non-asymptotic counterparts) you can find some error bounds such that the probability that the correct result is not within them is extremely small.2011-02-13
  • 0
    There was actually a small error in my formula, it should be $\geq 0$ rather than $>0$.2011-02-13
  • 0
    @Yuval: there is another typo: $$E[M] = \sum_{t \geq 1} \mathrm{Pr}[M \geq t] = m^{-n} \sum_{t \geq 1} \sum_{t \geq x_1 > x_2 > \cdots > x_k \ge 0,\, c_1,\ldots,c_k \geq 1,\, \sum c_i = m,\, \sum c_i x_i = n} \frac{n!}{(x_1!)^{c_1} \cdots (x_k!)^{c_k}} \frac{m!}{(c_1)!\cdots(c_k)!}$$2011-02-14
  • 0
    @Yuval: wait, a significant error, it should be $E[M] = \sum_{t \geq 1} \mathrm{Pr}[M \geq t] = m^{-n} \sum_{t \geq 1} \sum_{x_1\ge t > x_2 > \cdots > x_k \ge 0,\, c_1,\ldots,c_k \geq 1,\, \sum c_i = m,\, \sum c_i x_i = n} \frac{n!}{(x_1!)^{c_1} \cdots (x_k!)^{c_k}} \frac{m!}{(c_1)!\cdots(c_k)!}$, am I correct?2011-02-14
  • 0
    Yes, now it looks even better... In my code I just implemented the usual formula for expectation.2011-02-14