2
$\begingroup$

I'm having some trouble with the following modified version of the birthday problem: Calculate the expected value for the number of people questioned until the first match in the birthday problem, write it (python is preferred), and explain the algorithm used.

So I've gotten as far the simulated probability (for a fuller understanding) which is roughly $24$ and change. I know that the basic idea is to product sum $[365/365]*[1/365]*[(1/365)(364/365)(2/365)]...[n/365]$ where you begin counting matches from person $2$ on up. Also it's an expanding series, which is awkward to translate from paper math to code.

The actual program is not long but the resulting numbers are too large and causing errors in running. I'm wondering if the expected value differs from the simulated (from random values) and why?

If anyone would be nice enough to give me a good idea on how this works I'd appreciate it.

  • 0
    Your problem seems to ask for the mean. There is a small discussion of that in the Wikipedia article on the birthday problem. It is in [Section 6.6.](http://en.wikipedia.org/wiki/Birthday_problem#Average_number_of_people)2012-10-11

1 Answers 1

2

With birthdays iid over $d$ days, the probability that it is the $k+1^{\text{th}}$ term [spot the rare rhyme for month] that first matches one of the previous distinct $k$ birthdays is $$ \frac{k}{d} \times \frac{d \times (d-1) \times \cdots \times (d-k+1)}{d \times d \times \cdots d} = \frac{k \, (d-1)!}{d^{k}\, (d-k)!}$$ so the expected value for the number needed for a match is $$\sum_{k=1}^{d} \frac{(k+1) \, k \, (d-1)!}{d^{k}\, (d-k)!}.$$

In R you might naively code this as

d <- 365
k <- 1:d
sum( (k+1) * k * factorial(d-1) / ( d^k * factorial(d-k) ) )

but this would produce an error, partly because $364! \approx 7 \times 10^{775}$ is too large. So instead use logarithms (e.g. $\log_e(364!) \approx 1786.432$, a more helpful size) as in

sum( exp( log(k+1) + log(k) + lfactorial(d-1) - k*log(d) - lfactorial(d-k) ) )

to give a result of about

[1] 24.61659

which is your "24 and change".

  • 0
    Thanks so much for that. I was actually trying to convert your code to python but R seems to have a few tricks in the way of lists and ranges that python's list comprehensions make for more complexity. This actually also helped me understand one of the advantages of using R over python. I'm still not sure why logs allow it to be computed? Are the numbers smaller before being summed or...?2012-10-12