2
$\begingroup$

I'm doing a training exercise looking at 'test' policy claims database data where some of the policy data contains dummy dates (like a date that says it was taken out on 31 Dec 9999)

I'm planing on running a program over the data to find the most common dates used, and I'm expecting that it will bring back the dummy data, but I'd like to know in the case that the dummy data isn't something obviously wrong (say it was 1 April 1970) what the probability of the date and year being the same on $n$ records from a set of $m$ records, over a fixed range of years (say 100 years).

I'd try and look into this myself but I have no clue about probability. I couldn't find it on the wiki page nor in a check of the various birthday problem questions on this stack exchange.

  • 0
    As this is test data I'm happy to make wide sweeping assumptions about the distributions of policy/birthdates.2012-11-27

1 Answers 1

2

If you have $m$ observations, all obtained independently and with equal probability from a set of $k$ possibilities, the probability that there are two identical observations is $1 - k(k-1)k-2)\cdot \dots \cdot (k-m+1)/k^m$ if $k \ge m$ and it's 1 if $k < m$. In your case, $k = 36500$ (100 years with 365 days each). For $m > 222$, this probability is $ > 1/2$.

In your particular problem, an out-of-bounds check on the year and day of the month might also be successful, to detect fake dates like 6/31/1960 or 7/2/2020.

  • 0
    @ Pureferret - I thought so :)2012-11-27